In Section 1, we displayed the speech samples synthesized by the ProsodyFM with controlled phrasing and intonation. The spectrograms and pitch contours of these speech samples are provided in our paper and Appendix D.
In Section 2, we provided the speech samples synthesized by the ProsodyFM and the other 4 SOTA models under both parallel and non-parallel settings.
In addition to the raw text to be synthesized, we also provided labels for phrase breaks and terminal intonation. <b ↘> refers to a phrase break with a falling tone on the last word of the intonational phrase, <b ↗> refers to a phrase break with a rising tone, and <b ➡️> refers to a phrase break with a level tone.
Sample 1: 1363_139304_000009_000005.wav from the LibriTTS corpus.
Text: Quite suddenly he rolled over <b ↘> stared for a moment <b ↗> and struggled into a sitting position <b ↘>
GT (vocoder)
gt_hifigan_1363_139304_000009_000005.wav
ProsodyFM (original)
prosodyfm_1363_139304_000009_000005.wav
We control the intonation of the word “moment”.
Level k=0
Rising k=+2
Rising k=+4
Falling k=-2
Falling k=-4
We remove the break between “over” and “stared”, or add a break after the word “for”.
Remove the break
Add the break
Sample 2: 7302_86815_000052_000000.wav from the LibriTTS corpus.
Text: Well madam <b ↘> it will be a laudable action on your part <b ↘> and I will thank you for it <b ↘>