MARS5 TTS: Open Source Text to Speech with insane prosodic control! ๐Ÿ”ฅ

  • Voice cloning with less than 5 seconds of audio

  • Two stage Auto-Regressive (750M) + Non-Auto Regressive (450M) model architecture

  • Used BPE tokenizer to enable control over punctuations, pauses, stops etc.

  • AR model predicts L0 coarse tokens, refined further by the NAR DDPM model followed by the vocoder.

The use of Byte Pair Encoding (BPE) tokenizer in text-to-speech (TTS) tools is crucial for handling the nuances of language. It enables precise control over pronunciation, pauses, and other prosodic features by efficiently managing the vocabulary. This leads to more natural and expressive speech synthesis.

Great job Camb AI team! Kudos for open sourcing the artifacts - looking forward to what comes next ;)

