Videos for Interspeech 2012

Speaker-adaptive visual speech synthesis in the HMM-framework

Dietmar Schabus, Michael Pucher, Gregor Hofer

FTW Telecommunications Research Center Vienna

This website shows visual speech synthesis results from different models. For each condition, we show the corresponding maker point cloud and a rendered video of a 3D head with these marker movements applied to it. For reference, the first row shows grayscale videos taken during recording, including recorded audio.
The second row uses the recorded visual data and synthesized audio speech, where the "true" phone durations from the recording are used for the audio synthesis. The following rows show synthesized visual data from different models, also here the "true" durations are used, both in visual and acoustic synthesis. Hence, all videos in the same column use the same synthetic audio.

Click on one of the images below to play the corresponding video. Click the "close" button in the overlay to come back here. Flash is required.

Speaker 1 Speaker 2 Speaker 3
Grayscale video video video video
Recorded data video video video video video video
Speaker-dependent
(212 utterances)
video video video video video video
Speaker-dependent
(19 utterances)
video video video video video video
Adapted
(212 utterances)
video video video video video video
Adapted
(19 utterances)
video video video video video video

The 3D head was designed by NaturalPoint http://www.naturalpoint.com/optitrack/
This research was funded by the Austrian Science Fund (FWF): P22890-N23.
The Competence Center FTW Forschungszentrum Telekommunikation Wien GmbH is funded within the program COMET – Competence Centers for Excellent Technologies by BMVIT, BMWA, and the City of Vienna. The COMET program is managed by the FFG.