Neural voices - Amazon Polly

Neural voices

Amazon Polly has a Neural text-to-speech (NTTS) engine that can produce even higher quality voices than its standard voices. Standard TTS voices use concatenative synthesis. The standard engine concatenates phonemes of recorded speech, producing very natural-sounding synthesized speech. However, the inevitable variations in speech and the techniques used to segment the waveforms limits the quality of speech. The Amazon Polly NTTS engine doesn't use standard concatenative synthesis to produce speech. It has two parts:

  • A neural network — that converts a sequence of phonemes (the most basic units of language) into a sequence of spectrograms. (Spectograms are snapshots of the energy levels in different frequency bands.)

  • A vocoder — that converts spectrograms into a nearly continuous audio signal.

The first component of the neural TTS system is a sequence-to-sequence model. This model doesn’t create its results solely from the corresponding input but also considers how the sequence of the elements of the input work together. The model chooses the spectrograms that it outputs so that their frequency bands emphasize acoustic features that the human brain uses when processing speech.

The output of this model then passes to a neural vocoder. This converts the spectrograms into speech waveforms. When trained on the large datasets used to build general-purpose concatenative-synthesis systems, this sequence-to-sequence approach will yield higher-quality, more natural-sounding voices.

Available neural voices

Neural voices are available in 35 languages and language variants. The following table lists the voices.

Language and language variants Language code Name/ID Gender

1

Arabic (Gulf)

ar-AE

Hala

Zayd

Female

Male

2

Belgian Dutch (Flemish)

nl-BE

Lisa

Female

3

Catalan

ca-ES

Arlet

Female

4

Czech

cs-CZ

Jitka

Female

5

Chinese (Cantonese)

yue-CN

Hiujin

Female

6

Chinese (Mandarin)

cmn-CN

Zhiyu

Female

7

Danish

da-DK

Sofie

Female

8

Dutch

nl-NL

Laura

Female

9

English (Australian)

en-AU

Olivia

Female

10

English (British)

en-GB

Amy*

Emma

Brian

Arthur

Female

Female

Male

Male

11

English (Indian)

en-IN

Kajal

Female

12

English (Irish)

en-IE

Niamh

Female

13

English (New Zealand)

en-NZ

Aria

Female

14

English (South African)

en-ZA

Ayanda

Female

15

English (US)

en-US

Danielle

Gregory

Ivy

Joanna*

Kendra

Kimberly

Salli

Joey

Justin

Kevin

Matthew*

Ruth

Stephen

Female

Male

Female (child)

Female

Female

Female

Female

Male

Male (child)

Male (child)

Male

Female

Male

16

Finnish

fi-FI

Suvi

Female

17

French (Belgian)

fr-BE

Isabelle

Female

18

French (Canadian)

fr-CA

Gabrielle

Liam

Female

Male

19

French

fr-FR

Léa

Rémi

Female

Male

20

German

de-DE

Vicki

Daniel

Female

Male

21

German (Austrian)

de-AT

Hannah

Female

22

German (Swiss)

de-CH

Sabrina

Female

23

Hindi

hi-IN

Kajal

Female

24

Italian

it-IT

Bianca

Adriano

Female

Male

25

Japanese

ja-JP

Takumi

Kazuha

Tomoko

Male

Female

Female

26

Korean

ko-KR

Seoyeon

Female

27

Norwegian

nb-NO

Ida

Female

28

Polish

pl-PL

Ola

Female

29

Portuguese (Brazilian)

pt-BR

Camila

Vitória/Vitoria

Thiago

Female

Female

Male

30

Portuguese (European)

pt-PT

Inês/Ines

Female

31

Spanish (Spain)

es-ES

Lucia

Sergio

Female

Male

32

Spanish (Mexican)

es-MX

Mia

Andrés

Female

Male

33

Spanish (US)

es-US

Lupe*

Pedro

Female

Male

34

Swedish

sv-SE

Elin

Female

35

Turkish

tr-TR

Burcu

Female

*The Amy, Joanna, Lupe, and Matthew voices can be used with the Newscaster speaking style. For more information, see Applying the newscaster voice.

Feature and region compatibility

Neural voices aren't available in all AWS Regions, nor do they support all Amazon Polly features.

Neural voices are supported in the following regions:

  • US East (N. Virginia): us-east-1

  • US West (Oregon): us-west-2

  • Africa (Cape Town): af-south-1

  • Asia Pacific (Tokyo): ap-northeast-1

  • Asia Pacific (Seoul): ap-northeast-2

  • Asia Pacific (Osaka): ap-northeast-3

  • Asia Pacific (Mumbai): ap-south-1

  • Asia Pacific (Singapore): ap-southeast-1

  • Asia Pacific (Sydney): ap-southeast-2

  • Canada (Central): ca-central-1

  • Europe (Frankfurt): eu-central-1

  • Europe (Ireland): eu-west-1

  • Europe (London): eu-west-2

  • Europe (Paris): eu-west-3

  • AWS GovCloud (US-West): us-gov-west-1

Endpoints and protocols for these Regions are identical to those used for standard voices. For more information, see Amazon Polly endpoints and quotas.

The following features are supported for neural voices:

  • Real-time and asynchronous speech synthesis operations.

  • Newscaster speaking style. For more information about the speaking styles, see Applying the newscaster voice.

  • All speech marks.

  • Many (but not all) of the SSML tags that are supported by Amazon Polly. For more information about NTTS-supported SSML tags, see Supported Tags.

As with standard voices, you can choose from various sampling rates to optimize the bandwidth and audio quality for your application. Valid sampling rates for standard and neural voices are 8 kHz, 16 kHz, 22 kHz, or 24 kHz. The default for standard voices is 22 kHz. The default for neural voices is 24 kHz. Amazon Polly supports MP3, OGG (Vorbis), and raw PCM audio stream formats.