Joint Audiovisual Hidden Semi-Markov Model-based Speech Synthesis

by Dietmar Schabus, Michael Pucher and Gregor Hofer

Abstract:

This paper investigates joint speaker-dependent audiovisual Hidden Semi-Markov Models (HSMM) where the visual models produce a sequence of 3D motion tracking data that is used to animate a talking head and the acoustic models are used for speech synthesis. Different acoustic, visual, and joint audiovisual models for four different Austrian German speakers were trained and we show that the joint models perform better compared to other approaches in terms of synchronization quality of the synthesized visual speech. In addition, a detailed analysis of the acoustic and visual alignment is provided for the different models. Importantly, the joint audiovisual modeling does not decrease the acoustic synthetic speech quality compared to acoustic-only modeling so that there is a clear advantage in the common duration model of the joint audiovisual modeling approach that is used for synchronizing acoustic and visual parameter sequences. Finally, it provides a model that integrates the visual and acoustic speech dynamics.

Reference:

Joint Audiovisual Hidden Semi-Markov Model-based Speech Synthesis (Dietmar Schabus, Michael Pucher and Gregor Hofer), In IEEE Journal of Selected Topics in Signal Processing, volume 8, 2014.

Bibtex Entry:

@Article{Schabus2014a,
  author    = {Dietmar Schabus and Michael Pucher and Gregor Hofer},
  title     = {Joint Audiovisual Hidden Semi-Markov Model-based Speech Synthesis},
  journal   = {IEEE Journal of Selected Topics in Signal Processing},
  year      = {2014},
  volume    = {8},
  number    = {2},
  pages     = {336-347},
  month     = apr,
  issn      = {1932-4553},
  abstract  = {This paper investigates joint speaker-dependent audiovisual Hidden Semi-Markov Models (HSMM) where the visual models produce a sequence of 3D motion tracking data that is used to animate a talking head and the acoustic models are used for speech synthesis. Different acoustic, visual, and joint audiovisual models for four different Austrian German speakers were trained and we show that the joint models perform better compared to other approaches in terms of synchronization quality of the synthesized visual speech. In addition, a detailed analysis of the acoustic and visual alignment is provided for the different models. Importantly, the joint audiovisual modeling does not decrease the acoustic synthetic speech quality compared to acoustic-only modeling so that there is a clear advantage in the common duration model of the joint audiovisual modeling approach that is used for synchronizing acoustic and visual parameter sequences. Finally, it provides a model that integrates the visual and acoustic speech dynamics.},
  comment   = {<br><a href="/phd/audiovisual">Website showing example stimuli from evaluation</a><br><a href="/phd/audiovisual/jstsp2013schabus.mp4">Short video showing example stimuli from evaluation</a> (6 MB)},
  doi       = {10.1109/JSTSP.2013.2281036},
  file      = {http://dx.doi.org/10.1109/JSTSP.2013.2281036},
  groups    = {FTW, Visual},
  keywords  = {Acoustics;Hidden Markov models;Joints;Speech;Synchronization;Training;Visualization;Audiovisual speech synthesis;HMM-based speech synthesis;facial animation;hidden Markov model;speech synthesis;talking head},
  owner     = {schabus},
  timestamp = {2013.11.21},
}