Manipulation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory synthesis
Introduction
Speech prosody encodes linguistic and paralinguistic information (Grichkovtsova et al, 2009, Ladd, 2008). Paralinguistic information includes for example the emotional state of the speaker (Schröder, 2001), speaker traits (Schuller et al., 2015), the speaking style (Yamagishi et al., 2005), and speech and voice disorders. The simulation of these paralinguistic aspects is still a challenging problem in speech synthesis technology. Each paralinguistic function (e.g., the expression of a certain vocal emotion) is implemented by the complex interplay of multiple prosodic features. To synthesize speech that conveys certain paralinguistic information, it is thus necessary to be able to manipulate the involved prosodic features. In this study we analyzed the potential of articulatory speech synthesis to individually control specific prosodic features that are highly relevant for the encoding of paralinguistic information but have rarely been controlled in speech synthesis so far.
In the following, we differentiate the “primary” prosodic features of pitch, duration, and intensity on the one hand, and “secondary” prosodic features like voice quality (Campbell, Mokhtari, 2003, Pfitzinger, 2006), nasality (Scherer, 1978), vocal tract length (Chuenwattanapranithi et al., 2008), and articulatory precision (Beller et al, 2008, Burkhardt, Sendlmeier, 2000) on the other hand. The primary features are easy to manipulate with the prevailing concatenative speech synthesis methods (Hunt and Black, 1996), as well as easy to quantify and analyze in recordings of natural speech. Therefore, the relation between these features and many paralinguistic aspects has been well studied, e.g., in the context of vocal emotions (Scherer et al, 2003, Scherer et al, 2015, Schröder, 2001).
In contrast, most secondary prosodic features are more difficult to manipulate in the acoustic domain of concatenative speech synthesis, because rather simple changes at the articulatory level may have complex consequences in the acoustic domain. For example, a nasal voice quality can be produced by a very specific and localized action at the articulatory level (lowering of the velum) but strongly affects the speech spectrum (introduction of pole–zero pairs in the vocal tract transfer function, shift of resonances). Analogously, a change of articulatory effort could be considered as a simple change of the speed of the articulators in the articulatory domain, but results in a complicated change of the formant trajectories in the acoustic domain. Hence, in order to manipulate secondary prosodic features like nasality or articulatory effort for speech synthesis, it is most favorable to do it in the articulatory domain of an articulatory speech synthesizer.
There is ample evidence that secondary prosodic features are just as important as the primary features for the implementation of diverse paralinguistic functions. For example, with regard to the expression of vocal emotions, the role of the feature phonation type was found to be a major cue in the expression of anger or fear (Airas, Alku, 2006, Birkholz et al, 2015, Burkhardt, 2009, Campbell, Mokhtari, 2003, Gobl, Ní Chasaide, 2003). Also vocal tract length is an important feature in the expression of emotions (Chuenwattanapranithi et al., 2008). According to the bio-informational dimensions theory, speakers modify their vocal tract length to project a larger body size to appear dominant and a smaller body size to appear friendly (Xu et al., 2013). Furthermore, articulatory precision is related to certain vocal emotions (Burkhardt, Sendlmeier, 2000, Murray, Arnott, 1993). For example, precise articulation contributes to a joyful impression and an imprecise one reduces it (Burkhardt and Sendlmeier, 2000). With regard to other paralinguistic functions, a nasal voice quality was, for example, identified as a vocal cue for the expression of body complacency (Sendlmeier and Heile, 1998), extroversion (Scherer, 1978) and sarcasm (Gibbs, 1986).
To vary these features with concatenative speech synthesis, the synthesizer needs a database of speech units that cover the necessary variation. However, since humans are not used to control prosodic features individually, they cannot help but let them co-vary with other features when asked to perform a manipulation. For example, to create a concatenative speech synthesizer for the synthesis of vocal emotions, the speech corpus needs to be recorded with multiple emotions that contain the emotion-specific feature combinations (Black, 2003, Iida et al, 2003). However, this is not only very laborious but the coverage of the feature space remains rather limited with respect to the variety of possible feature combinations.
In contrast to concatenative synthesis, parametric synthesis methods can manipulate the features of the voice source and the vocal tract independently. The main parametric synthesis methods are formant synthesis (Klatt, 1980), HMM-based synthesis (Zen et al., 2009) and articulatory synthesis (Aryal, Gutierrez-Osuna, 2016, Birkholz, 2013a, van den Doel et al, 2006). Formant synthesis has long been the first choice for the synthesis and analysis of (secondary) prosodic features for, e.g., emotions (Burkhardt, Sendlmeier, 2000, Murray, Arnott, 1995). More recently, HMM-based synthesis has been applied to modify secondary prosodic features for modeling different speaking styles and emotions (Yamagishi et al., 2005) or hypo- and hyperarticulated speech (Picart et al., 2014). However, both formant synthesis and HMM-based synthesis model speech in the temporal and spectral domain instead of the articulatory domain.
Articulatory speech synthesis can in principle vary all prosodic features directly at the articulatory and physiological level. Therefore, this kind of synthesis is generally considered as the best choice for research on paralinguistic effects like emotions (Schröder et al., 2010). However, despite considerable progress in the recent years, articulatory speech synthesis still sounds somewhat less natural than unit-selection synthesis, and the articulatory and acoustic models are rather time consuming. Hence, articulatory synthesis is not yet at a level of development where it is competitive for text-to-speech synthesis, but it is very well suited for analysis-by-synthesis experiments as in the present study. The effectiveness of articulatory synthesis in such an experiment for the analysis of phonation type in vocal emotions was recently demonstrated in Birkholz et al. (2015). However, the articulatory synthesis of further secondary prosodic features has so far not been demonstrated in a systematic way. In this study we therefore examined different ways for the variation of the prosodic features of vocal tract length, nasality and articulatory precision using articulatory speech synthesis. It is shown that rule-based articulatory manipulations suffice for the perceptually convincing generation of these features.
Section snippets
Methods
Nine German words (Banane [banaːnə], Birne [bɪrnə], Blaubeere [blaʊbeːʀə], Himbeere [hɪmbeːʀə], Mandarine [mandaʀiːnə], Melone [məloːnə], Mirabelle [mirabɛlə], Orange [oʀanʒə], Rosine [ʀozi:nə]; engl.: banana, pear, blueberry, raspberry, mandarin, melon, mirabelle, orange, raisin) were spoken in a neutral way by a male German native speaker and used as basis words for this study. These words were then re-synthesized as accurately as possible using the articulatory speech synthesizer
Results and discussion
Fig. 4a shows the results of the first task where the subjects rated the naturalness of the stimuli. The standard stimuli, i.e., those without any further manipulations, were rated as most natural with a mean score of 2.96. Among the stimuli variants with prosodic manipulations, the stimuli with higher effort were rated best with a mean score of 2.84, and the very centralized stimuli received the lowest mean score of 2.21. As detailed in Appendix A, there was no statistically significant
Conclusions
This study demonstrated that the secondary prosodic features of vocal tract length, nasality and articulatory precision can be created by simple rule-based articulatory manipulations of neutrally spoken re-synthesized words. The perceptual prosodic correlates of the articulatory variations were identified well by the subjects in a listening experiment. Only the articulatory manipulations made for increased articulatory precision were not successful. The studied prosodic features are potentially
Acknowledgments
The authors would like to thank all volunteers for their participation in the perception experiments and the two anonymous reviewers for their valuable comments on an earlier version of the paper.
References (53)
- et al.
Data driven articulatory synthesis with deep neural networks
Comput. Speech Lang
(2016) - et al.
The role of voice quality in communicating emotion, mood and attitude
Speech Commun
(2003) - et al.
A corpus-based speech synthesis system with emotion
Speech Commun
(2003) - et al.
Implementation and testing of a system for producing emotion-by-rule in synthetic speech
Speech Commun
(1995) - et al.
Mapping emotions into acoustic space: the role of voice production
Biol. Psychol
(2011) - et al.
Analysis and HMM-based synthesis of hypo and hyperarticulated speech
Comput. Speech Lang
(2014) - et al.
Comparing the acoustic expression of emotion in the speaking and the singing voice
Comput. Speech Lang
(2015) - et al.
A survey on perceived speaker traits: personality, likability, pathology, and the first challenge
Comput. Speech Lang
(2015) Acoustic vowel reduction as a function of sentence accent, word stress, and word class
Speech Commun
(1993)- et al.
Statistical parametric speech synthesis
Speech Commun
(2009)
Emotions in vowel segments of continuous speech: analysis of the glottal flow using the normalised amplitude quotient
Phonetica
Articulation degree as a prosodic dimension of expressive speech
3D-Artikulatorische Sprachsynthese
Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets
Modeling consonant-vowel coarticulation for articulatory speech synthesis
PLoS ONE
VocalTractLab [software]
Influence of temporal discretization schemes on formant frequencies and bandwidths in time domain simulations of the vocal tract system
Articulatory synthesis of words in six voice qualities using a modified two-mass model of the vocal folds
Model-based reproduction of articulatory trajectories for consonant-vowel sequences
IEEE Trans. Audio Speech Lang. Process
Synthesis of breathy, normal, and pressed phonation using a two-mass model with a triangular glottis
The contribution of phonation type to the perception of vocal emotions in German: an articulatory synthesis study
J. Acoust. Soc. Am
Unit selection and emotional speech
Praat: doing phonetics by computer [software]
Rule-based voice quality variation with formant synthesis
Verification of acoustical correlates of emotional speech using formant-synthesis
Voice quality: the 4th prosodic dimension
Cited by (20)
Modeling trajectories of human speech articulators using general Tau theory
2023, Speech CommunicationEffect of articulatory and acoustic features on the intelligibility of speech in noise: An articulatory synthesis study
2020, Speech CommunicationCitation Excerpt :The stimuli for the perception experiment were created in three steps: For each German digit word, a recorded natural utterance of that word spoken with a “plain” speaking style was resynthesized in terms of a gestural score similar to Birkholz et al. (2017). In the resynthesized utterances, the phone durations and the f0 contours were closely fitted to those of the natural utterances.
How modeling entrance loss and flow separation in a two-mass model affects the oscillation and synthesis quality
2019, Speech CommunicationCitation Excerpt :For each of the three sentences, a gestural score was manually created using the editor in the VocalTractLab software. Here, the gesture parameters were adjusted for the re-synthesis of recordings of these sentences of a real speaker, i.e., the phone durations and the f0 contour were copied from corresponding natural utterances (Birkholz et al., 2017). Then each of the three gestural scores was synthesized with each of the 12 settings for the glottal flow simulation.
A deep learning approaches in text-to-speech system: a systematic review and recent research perspective
2023, Multimedia Tools and Applications