Improving text-to-speech with automatic SSML tagging
In 2018, 24.8 million people in the UK (45.5%) listened to online audio. Our belief is that this number will continue to grow and with it the percentage of newsreaders who will listen to audio narratives, such as news articles.
SpeechKit was designed to help news publishers pivot-to-audio, via audio versions of their news stories, without the time and cost required to narrate them — providing newsreaders with the choice of listening to news articles when reading is undesirable, boosting engagement with audio-streaming native demographics.
To keep costs low and the audio scalable we’re using the newly released, neural voices, available through the Amazon text-to-speech (TTS) service (Amazon Polly), to generate lifelike audio.
However for news brands seeking to use TTS, like Amazon Polly, to deliver the best audio experience they can, at scale, they’ll need to use SSML (speech-synthesis-markup) tags. Not using SSML will, without a doubt, lead to a sub-par audio experience and dissatisfied listeners.
SSML gives you additional control over how Amazon generates speech from text. Enhancing audio with SSML involves inserting specific tags into the text. Doing this manually for a single news article can take time, doing so for all published articles is almost impossible.
Amazon Polly supports SSML tags (see table below), but the service does not insert them for you. This usually requires context and Amazon did not develop Amazon Polly just for the news industry.
We’ve developed a middle-layer, called NewsNet, that, amongst other things, automates the SSML tagging process for news articles using a combination of rule-based and neural-network-based techniques.
This post will demonstrate the importance of using SSML when it comes to converting news articles into audio, and highlight the benefit to publishers of using SpeechKit to automate this process.
Amazon Polly accepts inputs as either plain text or SSML. For publishers using SpeechKit, NewsNet automatically cleans and converts all plain text from news articles into SSML and encloses SSML tags around paragraphs, sentences, specific words and phrases, amongst a few other things we’ll discuss in another post.
,
The first step is to wrap the text into a tag. This tells Amazon Polly to process the input as SSML. The second step is to indicate to Amazon Polly that the text should be read as a news item using the tag. The third step, and this is where NewsNet starts to shine, is to tokenise all words, sentences and phrases in the text and apply specific SSML tags to them using either our hardcoded rules or neural nets.
The first of these is the
and tags that indicate whether a string of text is a sentence or paragraph to ensure that appropriate pauses are inserted into the speech — periods are not always reliable segmentation points in news stories.
Other SSML tags inserted using NewsNet include, but are not limited to, , , , and tags.
In some cases, Amazon Polly struggles to pronounce specific words. This is quite common with brands, or in the case below with the president of South Africa. Over time we’ve added hundreds of words, common in the news, and their corrected pronunciation, to NewsNet so that they are detected, tagged, with the phoneme tag, and pronounced correctly appropriately.
,
Different news domains use different acronyms and abbreviations, that when spoken might sound unusual. NewsNet detects them, and using SSML, instructs Amazon Polly to expand them into their full spoken form using the tag or the tag to describes how the text should be interpreted.
TTS services can struggle when it comes to pronouncing foreign language words and phrases in news articles (quite common!). NewsNet detects foreign language words...