Neural voices
Amazon Polly has a Neural text-to-speech (NTTS) engine that can produce even higher quality voices than its standard voices. Standard TTS voices use concatenative synthesis. The standard engine concatenates phonemes of recorded speech, producing very natural-sounding synthesized speech. However, the inevitable variations in speech and the techniques used to segment the waveforms limits the quality of speech. The Amazon Polly NTTS engine doesn't use standard concatenative synthesis to produce speech. It has two parts:
-
A neural network — that converts a sequence of phonemes (the most basic units of language) into a sequence of spectrograms. (Spectograms are snapshots of the energy levels in different frequency bands.)
-
A vocoder — that converts spectrograms into a nearly continuous audio signal.
The first component of the neural TTS system is a sequence-to-sequence model. This model doesn’t create its results solely from the corresponding input but also considers how the sequence of the elements of the input work together. The model chooses the spectrograms that it outputs so that their frequency bands emphasize acoustic features that the human brain uses when processing speech.
The output of this model then passes to a neural vocoder. This converts the spectrograms into speech waveforms. When trained on the large datasets used to build general-purpose concatenative-synthesis systems, this sequence-to-sequence approach will yield higher-quality, more natural-sounding voices.
Available neural voices
Neural voices are available in 35 languages and language variants. The following table lists the voices.
|
Language and language variants | Language code | Name/ID | Gender |
---|---|---|---|---|
1 |
Arabic (Gulf) |
ar-AE |
Hala Zayd |
Female Male |
2 |
Belgian Dutch (Flemish) |
nl-BE |
Lisa |
Female |
3 |
Catalan |
ca-ES |
Arlet |
Female |
4 |
Czech |
cs-CZ |
Jitka |
Female |
5 |
Chinese (Cantonese) |
yue-CN |
Hiujin |
Female |
6 |
Chinese (Mandarin) |
cmn-CN |
Zhiyu |
Female |
7 |
Danish |
da-DK |
Sofie |
Female |
8 |
Dutch |
nl-NL |
Laura |
Female |
9 |
English (Australian) |
en-AU |
Olivia |
Female |
10 |
English (British) |
en-GB |
Amy* Emma Brian Arthur |
Female Female Male Male |
11 |
English (Indian) |
en-IN |
Kajal |
Female |
12 |
English (Irish) |
en-IE |
Niamh |
Female |
13 |
English (New Zealand) |
en-NZ |
Aria |
Female |
14 |
English (South African) |
en-ZA |
Ayanda |
Female |
15 |
English (US) |
en-US |
Danielle Gregory Ivy Joanna* Kendra Kimberly Salli Joey Justin Kevin Matthew* Ruth Stephen |
Female Male Female (child) Female Female Female Female Male Male (child) Male (child) Male Female Male |
16 |
Finnish |
fi-FI |
Suvi |
Female |
17 |
French (Belgian) |
fr-BE |
Isabelle |
Female |
18 |
French (Canadian) |
fr-CA |
Gabrielle Liam |
Female Male |
19 |
French |
fr-FR |
Léa Rémi |
Female Male |
20 |
German |
de-DE |
Vicki Daniel |
Female Male |
21 |
German (Austrian) |
de-AT |
Hannah |
Female |
22 |
German (Swiss) |
de-CH |
Sabrina |
Female |
23 |
Hindi |
hi-IN |
Kajal |
Female |
24 |
Italian |
it-IT |
Bianca Adriano |
Female Male |
25 |
Japanese |
ja-JP |
Takumi Kazuha Tomoko |
Male Female Female |
26 |
Korean |
ko-KR |
Seoyeon |
Female |
27 |
Norwegian |
nb-NO |
Ida |
Female |
28 |
Polish |
pl-PL |
Ola |
Female |
29 |
Portuguese (Brazilian) |
pt-BR |
Camila Vitória/Vitoria Thiago |
Female Female Male |
30 |
Portuguese (European) |
pt-PT |
Inês/Ines |
Female |
31 |
Spanish (Spain) |
es-ES |
Lucia Sergio |
Female Male |
32 |
Spanish (Mexican) |
es-MX |
Mia Andrés |
Female Male |
33 |
Spanish (US) |
es-US |
Lupe* Pedro |
Female Male |
34 |
Swedish |
sv-SE |
Elin |
Female |
35 |
Turkish |
tr-TR |
Burcu |
Female |
*The Amy, Joanna, Lupe, and Matthew voices can be used with the Newscaster speaking style. For more information, see Applying the newscaster voice.
Feature and region compatibility
Neural voices aren't available in all AWS Regions, nor do they support all Amazon Polly features.
Neural voices are supported in the following regions:
-
US East (N. Virginia): us-east-1
-
US West (Oregon): us-west-2
-
Africa (Cape Town): af-south-1
-
Asia Pacific (Tokyo): ap-northeast-1
-
Asia Pacific (Seoul): ap-northeast-2
-
Asia Pacific (Osaka): ap-northeast-3
-
Asia Pacific (Mumbai): ap-south-1
-
Asia Pacific (Singapore): ap-southeast-1
-
Asia Pacific (Sydney): ap-southeast-2
-
Canada (Central): ca-central-1
-
Europe (Frankfurt): eu-central-1
-
Europe (Ireland): eu-west-1
-
Europe (London): eu-west-2
-
Europe (Paris): eu-west-3
-
AWS GovCloud (US-West): us-gov-west-1
Endpoints and protocols for these Regions are identical to those used for standard voices. For more information, see Amazon Polly endpoints and quotas.
The following features are supported for neural voices:
-
Real-time and asynchronous speech synthesis operations.
-
Newscaster speaking style. For more information about the speaking styles, see Applying the newscaster voice.
-
All speech marks.
-
Many (but not all) of the SSML tags that are supported by Amazon Polly. For more information about NTTS-supported SSML tags, see Supported Tags.
As with standard voices, you can choose from various sampling rates to optimize the bandwidth and audio quality for your application. Valid sampling rates for standard and neural voices are 8 kHz, 16 kHz, 22 kHz, or 24 kHz. The default for standard voices is 22 kHz. The default for neural voices is 24 kHz. Amazon Polly supports MP3, OGG (Vorbis), and raw PCM audio stream formats.