This website shows visual speech synthesis results from different models. For each condition, we show the corresponding maker point cloud and a rendered video of a 3D head with these marker movements applied to it.
For reference, the first row shows grayscale videos taken during recording, including recorded audio.
The second row uses the recorded visual data and synthesized audio speech, where the "true" phone durations from the recording are used for the audio synthesis.
The following rows show synthesized visual data from different models, also here the "true" durations are used, both in visual and acoustic synthesis. Hence, all videos in the same column use the same synthetic audio.
Click on one of the images below to play the corresponding video. Click the "close" button in the overlay to come back here. Flash is required.
Speaker 1 | Speaker 2 | Speaker 3 | |
Grayscale video | |||
Recorded data | |||
Speaker-dependent (212 utterances) |
|||
Speaker-dependent (19 utterances) |
|||
Adapted (212 utterances) |
|||
Adapted (19 utterances) |
The 3D head was designed by NaturalPoint http://www.naturalpoint.com/optitrack/
This research was funded by the Austrian Science Fund (FWF): P22890-N23.
The Competence Center FTW Forschungszentrum Telekommunikation Wien GmbH is funded within the program COMET – Competence Centers for Excellent Technologies by BMVIT, BMWA, and the City of Vienna. The COMET program is managed by the FFG.