Google Text-to-Speech AI

Convert text into natural-sounding speech using an API powered by the best of Google’s AI technologies.

Google Cloud Text-to-Speech enables developers to synthesize natural-sounding speech with 100+ voices, available in multiple languages and variants.

It applies DeepMind’s groundbreaking research in WaveNet and Google’s powerful neural networks to deliver the highest fidelity possible. As an easy-to-use API, you can create lifelike interactions with your users, across many applications and devices.

Features and Demo:

All features

Custom Voice (beta)

Train a custom speech synthesis model using your own audio recordings to create a unique and more natural-sounding voice for your organization. You can define and choose the voice profile that suits your organization and quickly adjust to changes in voice needs without needing to record new phrases. Learn more.

Voice and language selection

Choose from an extensive selection of 220+ voices across 40+ languages and variants, with more to come soon.

WaveNet voices

Take advantage of 90+ WaveNet voices built based on DeepMind’s groundbreaking research to generate speech that significantly closes the gap with human performance.

Text and SSML support

Customize your speech with SSML tags that allow you to add pauses, numbers, date and time formatting, and other pronunciation instructions.

Pitch tuning

Personalize the pitch of your selected voice, up to 20 semitones more or less than the default.

Speaking rate tuning

Adjust your speaking rate to be 4x faster or slower than the normal rate.

Volume gain control

Increase the volume of the output by up to 16db or decrease the volume up to -96db.

Integrated REST and gRPC APIs

Easily integrate with any application or device that can send a REST or gRPC request including phones, PCs, tablets, and IoT devices (e.g., cars, TVs, speakers).

Audio format flexibility

Convert text to MP3, Linear16, OGG Opus, and a number of other audio formats.

Audio profiles

Optimize for the type of speaker from which your speech is intended to play, such as headphones or phone lines.

Text-to-Speech pricing

Text-to-Speech is priced based on the number of characters sent to the service to be synthesized into audio each month. You must enable billing to use Text-to-Speech, and will be automatically charged if your usage exceeds the number of free characters allowed per month. For information about how to keep track of your character totals, see Monitoring API usage. Price is calculated per character.

The total number of characters in the input string are counted for billing purposes, including spaces. All Speech Synthesis Markup Language (SSML) tags except mark are also included in the character count. For example, this input string counts as 79 characters, including the SSML tags, newlines, and spaces:

