In Section 1, we displayed the speech samples synthesized by the ProsodyFM with controlled phrasing and intonation. The spectrograms and pitch contours of these speech samples are provided in our paper and Appendix D.

In Section 2, we provided the speech samples synthesized by the ProsodyFM and the other 4 SOTA models under both parallel and non-parallel settings.

In addition to the raw text to be synthesized, we also provided labels for phrase breaks and terminal intonation. <b ↘> refers to a phrase break with a falling tone on the last word of the intonational phrase, <b ↗> refers to a phrase break with a rising tone, and <b ➡️> refers to a phrase break with a level tone.



Section 1 Prosody Controllability


Sample 1: 1363_139304_000009_000005.wav from the LibriTTS corpus.

Text: Quite suddenly he rolled over <b ↘> stared for a moment <b ↗> and struggled into a sitting position <b ↘>

GT (vocoder)

gt_hifigan_1363_139304_000009_000005.wav

ProsodyFM (original)

prosodyfm_1363_139304_000009_000005.wav

We control the intonation of the word “moment”.

Level k=0

prosofyfm_level.wav

Rising k=+2

prosodyfm_rising_2.wav

Rising k=+4

prosodyfm_rising_4.wav

Falling k=-2

prosodyfm_falling_2.wav

Falling k=-4

prosodyfm_falling_4.wav

We remove the break between “over” and “stared”, or add a break after the word “for”.

Remove the break

prosodyfm_remove_break.wav

Add the break

prosodyfm_add_break.wav


Sample 2: 7302_86815_000052_000000.wav from the LibriTTS corpus.

Text: Well madam <b ↘>  it will be a laudable action on your part <b ↘>  and I will thank you for it <b ↘>