Audiovisual Speech Synthesis Based on Hidden Markov Models

by Dietmar Schabus
Abstract:
In this dissertation, new methods for audiovisual speech synthesis using Hidden Markov Models (HMMs) are presented and their properties are investigated. The problem of audiovisual speech synthesis is to computationally generate both audible speech as well as a matching facial animation or video (a “visual speech signal”) for any given input text. This results in “talking heads” that can read any text to a user, with applications ranging from virtual agents in human-computer interaction to characters in animated films and computer games. For recording and playback of facial motion, an optical marker-based facial motion capturing hardware system and 3D animation software are employed, which represent the state of the art in the animation industry. For modeling the acoustic and motion parameters of the synchronously recorded speech data, an existing HMM-based acoustic speech synthesis framework has been extended to the visual and audiovisual domains. The most important scientific contributions are on the one hand a novel joint audiovisual approach, where speech and facial motion are generated from a single model which combines both modalities. An analysis of the resulting HMMs and subjective perceptual experiments show that this way of modeling results in better synchronization between speech and motion than separate acoustic and visual modeling, which is the most commonly followed strategy in related work. On the other hand, average voice training and target speaker adaptation are investigated for the visual domain. The concept of adaptation has been one of the key factors for the popularity of the HMM-based framework for acoustic speech synthesis. Again, objective analysis and subjective perceptual experiments show that this concept is also applicable to the visual domain. In order to study these modeling approaches, suitable data collections are required. To this end, several synchronous labeled corpora of speech and facial motion recordings in Austrian German have been created as part of this dissertation. The resulting data collections have been released on the Internet for research purposes, and may turn out to be valuable resources for the scientific community.
Reference:
Dietmar Schabus, “Audiovisual Speech Synthesis Based on Hidden Markov Models”, PhD thesis, Graz University of Technology, Graz, Austria, 2014.
Bibtex Entry:
@PhdThesis{Schabus2014b,
  Title                    = {Audiovisual Speech Synthesis Based on Hidden Markov Models},
  Author                   = {Dietmar Schabus},
  School                   = {Graz University of Technology},
  Year                     = {2014},

  Address                  = {Graz, Austria},
  Month                    = {11},

  Abstract                 = {In this dissertation, new methods for audiovisual speech synthesis using Hidden Markov Models (HMMs) are presented and their properties are investigated. The problem of audiovisual speech synthesis is to computationally generate both audible speech as well as a matching facial animation or video (a "visual speech signal") for any given input text. This results in "talking heads" that can read any text to a user, with applications ranging from virtual agents in human-computer interaction to characters in animated films and computer games.

For recording and playback of facial motion, an optical marker-based facial motion capturing hardware system and 3D animation software are employed, which represent the state of the art in the animation industry. For modeling the acoustic and motion parameters of the synchronously recorded speech data, an existing HMM-based acoustic speech synthesis framework has been extended to the visual and audiovisual domains.

The most important scientific contributions are on the one hand a novel joint audiovisual approach, where speech and facial motion are generated from a single model which combines both modalities. An analysis of the resulting HMMs and subjective perceptual experiments show that this way of modeling results in better synchronization between speech and motion than separate acoustic and visual modeling, which is the most commonly followed strategy in related work. On the other hand, average voice training and target speaker adaptation are investigated for the visual domain. The concept of adaptation has been one of the key factors for the popularity of the HMM-based framework for acoustic speech synthesis. Again, objective analysis and subjective perceptual experiments show that this concept is also applicable to the visual domain.

In order to study these modeling approaches, suitable data collections are required. To this end, several synchronous labeled corpora of speech and facial motion recordings in Austrian German have been created as part of this dissertation. The resulting data collections have been released on the Internet for research purposes, and may turn out to be valuable resources for the scientific community.
},
}