Visual Control of Hidden-Semi-Markov-Model based Acoustic Speech Synthesis

by Jakob Hollenstein, Michael Pucher, Dietmar Schabus
Abstract:
We show how to visually control acoustic speech synthesis by modelling the dependency between visual and acoustic param- eters within the Hidden-Semi-Markov-Model (HSMM) based speech synthesis framework. A joint audio-visual model is trained with 3D facial marker trajectories as visual features. Since the dependencies of acoustic features on visual features are only present for certain phones, we implemented a model where dependencies are estimated for a set of vowels only. A subjective evaluation consisting of a vowel identification task showed that we can transform some vowel trajectories in a pho- netically meaningful way by controlling the visual parameters in PCA space. These visual parameters can also be interpreted as fundamental visual speech motion components, which leads to an intuitive control model.
Reference:
Jakob Hollenstein, Michael Pucher, Dietmar Schabus, “Visual Control of Hidden-Semi-Markov-Model based Acoustic Speech Synthesis”, In Proceedings of the 12th International Conference on Auditory-Visual Speech Processing (AVSP), Annecy, France, pp. 31-36, 2013.
Bibtex Entry:
@InProceedings{Hollenstein2013a,
  Title                    = {Visual Control of Hidden-Semi-Markov-Model based Acoustic Speech Synthesis},
  Author                   = {Jakob Hollenstein and Michael Pucher and Dietmar Schabus},
  Booktitle                = {Proceedings of the 12th International Conference on Auditory-Visual Speech Processing (AVSP)},
  Year                     = {2013},

  Address                  = {Annecy, France},
  Month                    = Sep,
  Pages                    = {31-36},

  Abstract                 = {We show how to visually control acoustic speech synthesis by 
modelling the dependency between visual and acoustic param- 
eters within the Hidden-Semi-Markov-Model (HSMM) based 
speech synthesis framework. A joint audio-visual model is 
trained with 3D facial marker trajectories as visual features. 
Since the dependencies of acoustic features on visual features 
are only present for certain phones, we implemented a model 
where dependencies are estimated for a set of vowels only. A 
subjective evaluation consisting of a vowel identification task 
 showed that we can transform some vowel trajectories in a pho- 
 netically meaningful way by controlling the visual parameters 
 in PCA space. These visual parameters can also be interpreted 
 as fundamental visual speech motion components, which leads 
 to an intuitive control model.},
  Url                      = {http://avsp2013.loria.fr/proceedings/papers/paper_20.pdf}
}