Intelligibility of time-compressed synthetic speech: Compression method and speaking style

by Cassia Valentini-Botinhao, Markus Toman, Michael Pucher, Dietmar Schabus, Junichi Yamagishi
Abstract:
We present a series of intelligibility experiments performed on natural and synthetic speech time-compressed at a range of rates and analyze the effect of speech corpus and compression method on the intelligibility scores of sighted and blind individuals. Particularly we are interested in comparing linear and non-linear compression methods applied to normal and fast speech of different speakers. We recorded English and German language voice talents reading prompts at a normal and a fast rate. To create synthetic voices we trained a statistical parametric speech synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to the generated speech waveform. Word recognition results for the English voices show that generating speech at a normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, particularly when applied to the fast data. Phonemic level annotation of the normal and fast databases showed that the German speaker was able to reproduce speech at a fast rate with fewer deletion and substitution errors compared to the English speaker, supporting the intelligibility benefits observed when compressing his fast speech. This shows that the use of fast speech data to create faster synthetic voices does not necessarily lead to more intelligible voices as results are highly dependent on how successful the speaker was at speaking fast while maintaining intelligibility. Linear compression applied to normal rate speech can more reliably provide higher intelligibility, particularly at ultra fast rates.
Reference:
Cassia Valentini-Botinhao, Markus Toman, Michael Pucher, Dietmar Schabus, Junichi Yamagishi, “Intelligibility of time-compressed synthetic speech: Compression method and speaking style”, In Speech Communication, vol. 74, pp. 52 – 64, 2015.
Bibtex Entry:
@Article{Valentini-Botinhao2015,
  Title                    = {Intelligibility of time-compressed synthetic speech: Compression method and speaking style},
  Author                   = {Cassia Valentini-Botinhao and Markus Toman and Michael Pucher and Dietmar Schabus and Junichi Yamagishi},
  Journal                  = {Speech Communication},
  Year                     = {2015},

  Month                    = nov,
  Pages                    = {52 - 64},
  Volume                   = {74},

  Abstract                 = {We present a series of intelligibility experiments performed on natural and synthetic speech time-compressed at a range of rates and analyze the effect of speech corpus and compression method on the intelligibility scores of sighted and blind individuals. Particularly we are interested in comparing linear and non-linear compression methods applied to normal and fast speech of different speakers. We recorded English and German language voice talents reading prompts at a normal and a fast rate. To create synthetic voices we trained a statistical parametric speech synthesis system based on the normal and the fast data of each speaker. We compared three compression methods: scaling the variance of the state duration model, interpolating the duration models of the fast and the normal voices, and applying a linear compression method to the generated speech waveform. Word recognition results for the English voices show that generating speech at a normal speaking rate and then applying linear compression resulted in the most intelligible speech at all tested rates. A similar result was found when evaluating the intelligibility of the natural speech corpus. For the German voices, interpolation was found to be better at moderate speaking rates but the linear method was again more successful at very high rates, particularly when applied to the fast data. Phonemic level annotation of the normal and fast databases showed that the German speaker was able to reproduce speech at a fast rate with fewer deletion and substitution errors compared to the English speaker, supporting the intelligibility benefits observed when compressing his fast speech. This shows that the use of fast speech data to create faster synthetic voices does not necessarily lead to more intelligible voices as results are highly dependent on how successful the speaker was at speaking fast while maintaining intelligibility. Linear compression applied to normal rate speech can more reliably provide higher intelligibility, particularly at ultra fast rates.},
  Doi                      = {10.1016/j.specom.2015.09.002},
  ISSN                     = {0167-6393},
  Keywords                 = {Fast speech},
}