Optimizing Speech Synthesis Quality with Speech Synthesis Markup Language (SSML)

Speech Synthesis Markup Language (SSML) is a markup language used to control factors such as pauses, volume, pitch, speech rate, and pronunciation of nouns in speech synthesis. This language, standardized by the World Wide Web Consortium (W3C) with XML as its foundation, is widely supported by many online speech synthesis services. Providers such as Google Cloud, AWS, Alibaba Cloud, among others, offer speech synthesis services that support SSML. Compared to text-to-speech (TTS) tasks performed using plain text, utilizing SSML allows for finer control over the synthesis of speech, thereby optimizing the quality of speech synthesis.

The SSML standard (latest version being 1.1) defines a series of tags used to control the synthesis of speech. Similar to HTML and XML, these tags enable fine-grained control over the synthesis of speech. A simple example is:

<speak>
<p><s><say-as interpret-as="characters">12345</say-as>，<emphasis level="moderate">大家注意听了：</emphasis><break time="200ms"/></s>
<s><prosody rate="90%" pitch="+3st" volume="110%">我能吞下玻璃而不伤害身体。</prosody></s></p>
</speak>

Most cloud service providers offering online speech synthesis services implement a subset of the W3 standard. The tags shown in the example above represent some of the functionalities widely supported by these speech synthesis services. These functionalities include the <say-as> tag for describing text construction types, the <emphasis> and <prosody> tags for controlling text pronunciation and emphasis, and the  and <s> tags for dividing paragraphs in the text. Google Text-To-Speech provides an interactive demo page to facilitate testing the effects of your SSML code.

Basic Structure

Let we start with the simplest aspect: placing the content to be read out in the <speak> tag. The quickest way to convert plain text to SSML format is by wrapping it with the <speak> tag.

Plain Text	SSML
I can eat glass, it does not hurt me	<speak>I can eat glass, it does not hurt me</speak>
Clearly, “5>3”	<speak>Clearly, “5>3″</speak>
Hey! Yes, you’ve already learned	<speak>Hey! Yes, you’ve already learned</speak>

Try it out on Google Text-To-Speech’s interactive testing page

Segmentation of Paragraphs and Sentences

SSML uses the  and <s> tags to achieve paragraph and sentence segmentation. In traditional text-based speech synthesis tasks, text analyzers infer the boundaries of each paragraph or sentence using automated rules and insert appropriate pauses. However, if the automatically inferred pause style is not what you desire, you can manually specify text boundaries using these tags:

<speak>
  <p>
    <s>This is the first sentence of this paragraph.</s>
    <s>This is another sentence.</s>
  </p>
</speak>

In some implementations, paragraphs divided by the  tag are equivalent to pauses brought by a <break> tag with strength="x-strong", and similarly, sentences divided by the <s> tag are equivalent to pauses brought by a tag with strength="strong".

Break

If explicit pauses need to be declared in the audio, use the <break> tag, which is used to control pauses or other prosodic word boundaries. This tag has two attributes:

strength^[optional]: Pause strength marked with “none”, “x-weak”, “weak”, “medium” (default), “strong”, and “x-strong”. For example, the “none” value indicates no prosodic pause in the synthesized speech.
time: Duration of the pause, following the time notation standard of CSS2, such as “3s”, “200ms”.

Shh<break time="1000ms"/>Don't make a sound<break time="1.5s"/>Did you hear that?

If the <break> tag appears in SSML without any of the mentioned attributes, the standard specifies that the speech synthesizer should make a longer pause compared to when the tag is absent. The priority of the time attribute is higher than strength because it is more precise.

Intonation, Volume, and Speech Rate

Emphasis

The <emphasis> tag is used to mark emphasis in speech. The optional level attribute supports values ranging from “strong”, “moderate”, “none” to “reduced”. Among them, “strong” and “moderate” represent two different degrees of sentence emphasis, with the specific synthesis style determined by the speech synthesizer’s implementation; “none” informs the speech synthesizer to avoid automatic inference of emphasis in this context; while the value “reduced” indicates reduced emphasis:

<emphasis level="strong">Important Notice</emphasis>

Advanced Control

To control pitch, speech rate, and synthesis volume, you need to use the <prosody> tag. This tag provides the following attributes:

pitch: The pitch of the spoken text. Although the definition of “baseline pitch” varies depending on the speech synthesizer, this attribute usually adjusts the pitch effect. Possible values are: values in Hz, or “x-low”, “low”, “medium”, “high”, “x-high”, or “default”; from x-low to x-high represents a non-decreasing sequence of pitches.
volume: The volume of the spoken text. Possible values are a percentage (%) value, or one of the following values: “silent”, “x-soft”, “soft”, “medium”, “loud”, “x-loud”, or “default”; the default value “default” represents +0.0dB; while the value “silent” represents -∞dB; you can also use a value like ±x dB, which represents the decibel relationship between the specified volume and the current default volume.

He came over quietly and whispered to me, "<prosody pitch="x-low" rate="80%">Don't let anyone see.</prosody>"

In addition, the W3C standard also specifies additional attributes for this tag, such as range and duration. For detailed information about the <prosody> element, please refer to the W3 specification.

Specifying Text Structure

SSML provides a range of mechanisms to describe text structure types, facilitating the control of speech synthesis methods for specific text blocks.

Substitution Reading

The  tag is used to replace the text contained in it during speech synthesis with the text specified in the alias attribute value:

The <sub alias="Hypertext Transfer Protocol">HTTP</sub> has been around for many years on the Internet.

In practice, the  tag is commonly used to specify simplified pronunciation for difficult-to-read text, as illustrated in the following example provided in the Google Text-To-Speech documentation:

<sub alias="にっぽんばし">日本橋</sub>

Literal Pronunciation

In addition to the  tag, the standard provides the <say-as> tag to address issues with the different pronunciation of numbers, ordinals, dates, currencies, etc.

The <say-as> element has a mandatory attribute interpret-as, which determines how the value should be pronounced. Depending on the specific interpret-as value, attributes such as format and detail can be selected. The supported interpret-as values may vary depending on the speech synthesis service provider; please refer to the documentation of the specific service for details.

cardinal – Direct Pronunciation of Numbers

The following example reads as “Twelve thousand three hundred forty-five” (American English) or “Twelve thousand three hundred and forty-five” (British English):

<speak> <say-as interpret-as="cardinal">12345</say-as> </speak>

For Chinese text-to-speech, the number will be synthesized with the pronunciation of the digits, i.e., “一万两千三百四十五” (yī wàn liǎng qiān sān bǎi sì shí wǔ).

ordinal – Pronunciation of Ordinals

The following example reads as “First”:

<speak>
  <say-as interpret-as="ordinal">1</say-as>
</speak>

verbatim – Letter-by-Letter Pronunciation

The following example specifies that the numbers and letters are pronounced letter by letter:

<speak>
  The license plate is <say-as interpret-as="verbatim">SBS4567</say-as>.
</speak>

telephone – Pronunciation of Numbers

The following example reads as “yāo sān líng yī èr sān sì wǔ liù qī bā” (in Mandarin, representing the digits):

<speak>
  我的手机号是<say-as interpret-as="telephone">13012345678</say-as>。
</speak>

Acknowledgements

Thanks for Taotie Languages.