Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audio-visual speech synthesis

by Dietmar Schabus, Michael Pucher and Gregor Hofer
Abstract:
We have created a synchronous corpus of acoustic and 3D facial marker data from multiple speakers for adaptive audio-visual text-to-speech synthesis. The corpus contains data from one female and two male speakers and amounts to 223 Austrian German sentences each. In this paper, we first describe the recording process, using professional audio equipment and a marker-based 3D facial motion capturing system for the audio-visual recordings. We then turn to post-processing, which incorporates forced alignment, principal component analysis (PCA) on the visual data, and some manual checking and corrections. Finally, we describe the resulting corpus, which will be released under a research license at the end of our project. We show that the standard PCA based feature extraction approach also works on a multi-speaker database in the adaptation scenario, where there is no data from the target speaker available in the PCA step.
Reference:
Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audio-visual speech synthesis (Dietmar Schabus, Michael Pucher and Gregor Hofer), In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), 2012.
Bibtex Entry:
@InProceedings{Schabus2012,
  author    = {Dietmar Schabus and Michael Pucher and Gregor Hofer},
  title     = {Building a synchronous corpus of acoustic and 3D facial marker data for adaptive audio-visual speech synthesis},
  booktitle = {Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC)},
  year      = {2012},
  pages     = {3313-3316},
  address   = {Istanbul, Turkey},
  month     = may,
  abstract  = {We have created a synchronous corpus of acoustic and 3D facial marker data from multiple speakers for adaptive audio-visual text-to-speech synthesis. The corpus contains data from one female and two male speakers and amounts to 223 Austrian German sentences each. In this paper, we first describe the recording process, using professional audio equipment and a marker-based 3D facial motion capturing system for the audio-visual recordings. We then turn to post-processing, which incorporates forced alignment, principal component analysis (PCA) on the visual data, and some manual checking and corrections. Finally, we describe the resulting corpus, which will be released under a research license at the end of our project. We show that the standard PCA based feature extraction approach also works on a multi-speaker database in the adaptation scenario, where there is no data from the target speaker available in the PCA step.},
  file      = {/download/schabus_LREC_2012},
  groups    = {FTW, Visual},
  isbn      = {978-2-9517408-7-7},
  language  = {english},
  owner     = {schabus},
  timestamp = {2013.11.21},
  url       = {http://www.lrec-conf.org/proceedings/lrec2012/summaries/302.html},
}