From 2011 to 2014, I worked in the basic research project “Adaptive Audio-Visual Dialect Speech Synthesis” at FTW, funded by the FWF under project number P22890-N23 and led by Dr. Michael Pucher as principal investigator. Building on the research results from this work, I compiled a doctoral dissertation which I defended at TU Graz in November 2014. My supervisor at TU Graz was Assoc.-Prof. Dr. Franz Pernkopf of the Institute of Signal Processing and Speech Communication (SPSC).
The title of my dissertation is:
Audiovisual Speech Synthesis Based on Hidden Markov Models
Here’s its abstract:
In this dissertation, new methods for audiovisual speech synthesis using Hidden Markov Models (HMMs) are presented and their properties are investigated. The problem of audiovisual speech synthesis is to computationally generate both audible speech as well as a matching facial animation or video (a “visual speech signal”) for any given input text. This results in “talking heads” that can read any text to a user, with applications ranging from virtual agents in human-computer interaction to characters in animated films and computer games.
For recording and playback of facial motion, an optical marker-based facial motion capturing hardware system and 3D animation software are employed, which represent the state of the art in the animation industry. For modeling the acoustic and motion parameters of the synchronously recorded speech data, an existing HMM-based acoustic speech synthesis framework has been extended to the visual and audiovisual domains.
The most important scientific contributions are on the one hand a novel joint audiovisual approach, where speech and facial motion are generated from a single model which combines both modalities. An analysis of the resulting HMMs and subjective perceptual experiments show that this way of modeling results in better synchronization between speech and motion than separate acoustic and visual modeling, which is the most commonly followed strategy in related work. On the other hand, average voice training and target speaker adaptation are investigated for the visual domain. The concept of adaptation has been one of the key factors for the popularity of the HMM-based framework for acoustic speech synthesis. Again, objective analysis and subjective perceptual experiments show that this concept is also applicable to the visual domain.
In order to study these modeling approaches, suitable data collections are required. To this end, several synchronous labeled corpora of speech and facial motion recordings in Austrian German have been created as part of this dissertation. The resulting data collections have been released on the Internet for research purposes, and may turn out to be valuable resources for the scientific community.
Download the full text PDF here: schabus_dissertation_final.pdf (1992 downloads) (20.5 MB)
Below is an example of computer-generated speech and facial motion from a combined audiovisual model for a previously unknown sentence.