Elsevier

Computer Speech & Language

Volume 41, January 2017, Pages 116-127
Computer Speech & Language

Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis

https://doi.org/10.1016/j.csl.2016.06.004Get rights and content

Highlights

  • Secondary prosodic features contribute to paralinguistic information in speech.

  • Concatenative speech synthesis has difficulties to control many prosodic features.

  • Here, articulatory synthesis is used for rule-based control of prosodic features.

  • Vocal tract length, articulatory precision and nasality are controlled effectively.

Abstract

Vocal emotions, as well as different speaking styles and speaker traits, are characterized by a complex interplay of multiple prosodic features. Natural sounding speech synthesis with the ability to control such paralinguistic aspects requires the manipulation of the corresponding prosodic features. With traditional concatenative speech synthesis it is easy to manipulate the “primary” prosodic features pitch, duration, and intensity, but it is very hard to individually control “secondary” prosodic features like phonation type, vocal tract length, articulatory precision and nasality. These secondary features can be controlled more directly with parametric synthesis methods. In the present study we analyze the ability of articulatory speech synthesis to control secondary prosodic features by rule. To this end, nine German words were re-synthesized with the software VocalTractLab 2.1 and then manipulated in different ways at the articulatory level to vary vocal tract length, articulatory precision and degree of nasality. Listening tests showed that most of the intended prosodic manipulations could be reliably identified with recognition rates between 77% and 96%. Only the manipulations to increase articulatory precision were hardly recognized. The results suggest that rule-based manipulations in articulatory synthesis are generally sufficient for the convincing synthesis of secondary prosodic features at the word level.

Introduction

Speech prosody encodes linguistic and paralinguistic information (Grichkovtsova et al, 2009, Ladd, 2008). Paralinguistic information includes for example the emotional state of the speaker (Schröder, 2001), speaker traits (Schuller et al., 2015), the speaking style (Yamagishi et al., 2005), and speech and voice disorders. The simulation of these paralinguistic aspects is still a challenging problem in speech synthesis technology. Each paralinguistic function (e.g., the expression of a certain vocal emotion) is implemented by the complex interplay of multiple prosodic features. To synthesize speech that conveys certain paralinguistic information, it is thus necessary to be able to manipulate the involved prosodic features. In this study we analyzed the potential of articulatory speech synthesis to individually control specific prosodic features that are highly relevant for the encoding of paralinguistic information but have rarely been controlled in speech synthesis so far.

In the following, we differentiate the “primary” prosodic features of pitch, duration, and intensity on the one hand, and “secondary” prosodic features like voice quality (Campbell, Mokhtari, 2003, Pfitzinger, 2006), nasality (Scherer, 1978), vocal tract length (Chuenwattanapranithi et al., 2008), and articulatory precision (Beller et al, 2008, Burkhardt, Sendlmeier, 2000) on the other hand. The primary features are easy to manipulate with the prevailing concatenative speech synthesis methods (Hunt and Black, 1996), as well as easy to quantify and analyze in recordings of natural speech. Therefore, the relation between these features and many paralinguistic aspects has been well studied, e.g., in the context of vocal emotions (Scherer et al, 2003, Scherer et al, 2015, Schröder, 2001).

In contrast, most secondary prosodic features are more difficult to manipulate in the acoustic domain of concatenative speech synthesis, because rather simple changes at the articulatory level may have complex consequences in the acoustic domain. For example, a nasal voice quality can be produced by a very specific and localized action at the articulatory level (lowering of the velum) but strongly affects the speech spectrum (introduction of pole–zero pairs in the vocal tract transfer function, shift of resonances). Analogously, a change of articulatory effort could be considered as a simple change of the speed of the articulators in the articulatory domain, but results in a complicated change of the formant trajectories in the acoustic domain. Hence, in order to manipulate secondary prosodic features like nasality or articulatory effort for speech synthesis, it is most favorable to do it in the articulatory domain of an articulatory speech synthesizer.

There is ample evidence that secondary prosodic features are just as important as the primary features for the implementation of diverse paralinguistic functions. For example, with regard to the expression of vocal emotions, the role of the feature phonation type was found to be a major cue in the expression of anger or fear (Airas, Alku, 2006, Birkholz et al, 2015, Burkhardt, 2009, Campbell, Mokhtari, 2003, Gobl, Ní Chasaide, 2003). Also vocal tract length is an important feature in the expression of emotions (Chuenwattanapranithi et al., 2008). According to the bio-informational dimensions theory, speakers modify their vocal tract length to project a larger body size to appear dominant and a smaller body size to appear friendly (Xu et al., 2013). Furthermore, articulatory precision is related to certain vocal emotions (Burkhardt, Sendlmeier, 2000, Murray, Arnott, 1993). For example, precise articulation contributes to a joyful impression and an imprecise one reduces it (Burkhardt and Sendlmeier, 2000). With regard to other paralinguistic functions, a nasal voice quality was, for example, identified as a vocal cue for the expression of body complacency (Sendlmeier and Heile, 1998), extroversion (Scherer, 1978) and sarcasm (Gibbs, 1986).

To vary these features with concatenative speech synthesis, the synthesizer needs a database of speech units that cover the necessary variation. However, since humans are not used to control prosodic features individually, they cannot help but let them co-vary with other features when asked to perform a manipulation. For example, to create a concatenative speech synthesizer for the synthesis of vocal emotions, the speech corpus needs to be recorded with multiple emotions that contain the emotion-specific feature combinations (Black, 2003, Iida et al, 2003). However, this is not only very laborious but the coverage of the feature space remains rather limited with respect to the variety of possible feature combinations.

In contrast to concatenative synthesis, parametric synthesis methods can manipulate the features of the voice source and the vocal tract independently. The main parametric synthesis methods are formant synthesis (Klatt, 1980), HMM-based synthesis (Zen et al., 2009) and articulatory synthesis (Aryal, Gutierrez-Osuna, 2016, Birkholz, 2013a, van den Doel et al, 2006). Formant synthesis has long been the first choice for the synthesis and analysis of (secondary) prosodic features for, e.g., emotions (Burkhardt, Sendlmeier, 2000, Murray, Arnott, 1995). More recently, HMM-based synthesis has been applied to modify secondary prosodic features for modeling different speaking styles and emotions (Yamagishi et al., 2005) or hypo- and hyperarticulated speech (Picart et al., 2014). However, both formant synthesis and HMM-based synthesis model speech in the temporal and spectral domain instead of the articulatory domain.

Articulatory speech synthesis can in principle vary all prosodic features directly at the articulatory and physiological level. Therefore, this kind of synthesis is generally considered as the best choice for research on paralinguistic effects like emotions (Schröder et al., 2010). However, despite considerable progress in the recent years, articulatory speech synthesis still sounds somewhat less natural than unit-selection synthesis, and the articulatory and acoustic models are rather time consuming. Hence, articulatory synthesis is not yet at a level of development where it is competitive for text-to-speech synthesis, but it is very well suited for analysis-by-synthesis experiments as in the present study. The effectiveness of articulatory synthesis in such an experiment for the analysis of phonation type in vocal emotions was recently demonstrated in Birkholz et al. (2015). However, the articulatory synthesis of further secondary prosodic features has so far not been demonstrated in a systematic way. In this study we therefore examined different ways for the variation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory speech synthesis. It is shown that rule-based articulatory manipulations suffice for the perceptually convincing generation of these features.

Section snippets

Methods

Nine German words (Banane [banaːnə], Birne [bɪrnə], Blaubeere [blaʊbeːʀə], Himbeere [hɪmbeːʀə], Mandarine [mandaʀiːnə], Melone [məloːnə], Mirabelle [mirabɛlə], Orange [oʀanʒə], Rosine [ʀozi:nə]; engl.: banana, pear, blueberry, raspberry, mandarin, melon, mirabelle, orange, raisin) were spoken in a neutral way by a male German native speaker and used as basis words for this study. These words were then re-synthesized as accurately as possible using the articulatory speech synthesizer

Results and discussion

Fig. 4a shows the results of the first task where the subjects rated the naturalness of the stimuli. The standard stimuli, i.e., those without any further manipulations, were rated as most natural with a mean score of 2.96. Among the stimuli variants with prosodic manipulations, the stimuli with higher effort were rated best with a mean score of 2.84, and the very centralized stimuli received the lowest mean score of 2.21. As detailed in Appendix A, there was no statistically significant

Conclusions

This study demonstrated that the secondary prosodic features of vocal tract length, nasality and articulatory precision can be created by simple rule-based articulatory manipulations of neutrally spoken re-synthesized words. The perceptual prosodic correlates of the articulatory variations were identified well by the subjects in a listening experiment. Only the articulatory manipulations made for increased articulatory precision were not successful. The studied prosodic features are potentially

Acknowledgments

The authors would like to thank all volunteers for their participation in the perception experiments and the two anonymous reviewers for their valuable comments on an earlier version of the paper.

References (53)

  • M. Airas et al.

    Emotions in vowel segments of continuous speech: analysis of the glottal flow using the normalised amplitude quotient

    Phonetica

    (2006)
  • G. Beller et al.

    Articulation degree as a prosodic dimension of expressive speech

    (2008)
  • P. Birkholz

    3D-Artikulatorische Sprachsynthese

    (2005)
  • P. Birkholz

    Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets

  • P. Birkholz

    Modeling consonant-vowel coarticulation for articulatory speech synthesis

    PLoS ONE

    (2013)
  • P. Birkholz

    VocalTractLab [software]

  • P. Birkholz et al.

    Influence of temporal discretization schemes on formant frequencies and bandwidths in time domain simulations of the vocal tract system

  • P. Birkholz et al.

    Articulatory synthesis of words in six voice qualities using a modified two-mass model of the vocal folds

  • P. Birkholz et al.

    Model-based reproduction of articulatory trajectories for consonant-vowel sequences

    IEEE Trans. Audio Speech Lang. Process

    (2011)
  • P. Birkholz et al.

    Synthesis of breathy, normal, and pressed phonation using a two-mass model with a triangular glottis

  • P. Birkholz et al.

    The contribution of phonation type to the perception of vocal emotions in German: an articulatory synthesis study

    J. Acoust. Soc. Am

    (2015)
  • A.W. Black

    Unit selection and emotional speech

  • P. Boersma et al.

    Praat: doing phonetics by computer [software]

  • F. Burkhardt

    Rule-based voice quality variation with formant synthesis

  • F. Burkhardt et al.

    Verification of acoustical correlates of emotional speech using formant-synthesis

  • N. Campbell et al.

    Voice quality: the 4th prosodic dimension

  • Cited by (20)

    • Effect of articulatory and acoustic features on the intelligibility of speech in noise: An articulatory synthesis study

      2020, Speech Communication
      Citation Excerpt :

      The stimuli for the perception experiment were created in three steps: For each German digit word, a recorded natural utterance of that word spoken with a “plain” speaking style was resynthesized in terms of a gestural score similar to Birkholz et al. (2017). In the resynthesized utterances, the phone durations and the f0 contours were closely fitted to those of the natural utterances.

    • How modeling entrance loss and flow separation in a two-mass model affects the oscillation and synthesis quality

      2019, Speech Communication
      Citation Excerpt :

      For each of the three sentences, a gestural score was manually created using the editor in the VocalTractLab software. Here, the gesture parameters were adjusted for the re-synthesis of recordings of these sentences of a real speaker, i.e., the phone durations and the f0 contour were copied from corresponding natural utterances (Birkholz et al., 2017). Then each of the three gestural scores was synthesized with each of the 12 settings for the glottal flow simulation.

    View all citing articles on Scopus
    View full text