Meta Speech-to-Speech Translation
Last updated
Last updated
In our increasingly interconnected world, where language differences may present a barrier to communication, translation systems can enable people from different linguistic backgrounds to share knowledge and experiences more seamlessly. However, many of these systems today do not preserve key elements of speech that make human communication human. More specifically, it’s not just the words we choose that convey what we want to say—it’s also how we speak them. Tone of voice, pauses, and emphasis carry important signals that help us communicate emotions and intent. Moreover, human speech and translation are sensitive to nuances such as turn-taking and timing controls. Picture, for example, how human interpreters work: they find just the right balance between low-latency and accurate translations. Waiting too long stifles the flow of communication, while going too fast compromises the overall quality of a translation. Translation systems that enable authentic conversations should deliver across all of these elements of communication.
Try the expressive translation demo
RECOMMENDED READS
·Bringing the world closer together with a foundational multimodal model for speech translation
Introducing speech-to-text, text-to-speech, and more for 1,100+ languages
200 languages within a single AI model: A breakthrough in high-quality machine translation
Today, we are excited to share Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real time. To build Seamless, we developed SeamlessExpressive, a model for preserving expression in speech-to-speech translation, and SeamlessStreaming, a streaming translation model that delivers state-of-the-art results with around two seconds of latency. All of the models are built on SeamlessM4T v2, the latest version of the foundational model we released in August. SeamlessM4T v2 demonstrates performance improvements for automatic speech recognition, speech-to-speech, speech-to text, and text-to-speech capabilities. Compared to previous efforts in expressive speech research, SeamlessExpressive addresses certain underexplored aspects of prosody, such as speech rate and pauses for rhythm, while also preserving emotion and style. The model currently preserves these elements in speech-to-speech translation between English, Spanish, German, French, Italian, and Chinese.
SeamlessStreaming unlocks real-time conversations with someone who speaks a different language by generating the translation while the speaker is still talking. In contrast to conventional systems which translate when the speaker has finished their sentence, SeamlessStreaming translates while the speaker is still talking. This means that the person they're speaking to can hear a translation in closer to real-time - there is a delay of a few seconds - rather than waiting until the speaker has finished their sentence. SeamlessStreaming supports automatic speech recognition and speech-to-text translation for nearly 100 input and output languages, and speech-to-speech translation for nearly 100 input languages and 36 output languages. In keeping with our approach to open science, we’re publicly releasing all four models to allow researchers to build on this work.
Introducing metadata, data and data alignment tools
Today, alongside our models, we are releasing metadata, data and data alignment tools to assist the research community, including:
· Metadata of an extension of SeamlessAlign corresponding to an additional 115,000 hours of speech and text alignments on top of the existing 470k hours. In addition to more hours, the latest version of SeamlessAlign covers a broader range of languages (from 37 previously to 76 with the extension). This corpus is the largest public speech/speech and speech/text parallel corpus in terms of total volume and language coverage to date.
· Metadata of SeamlessAlignExpressive, an expressivity-focused version of the dataset above. In this dataset, the pairs are parallel from both a semantic and prosodic perspective. SeamlessAlignExpressive is released as a benchmark to validate our expressive alignment approach. In order to train our expressive models, we applied our alignment method to a proprietary dataset.
· Translated text data for mExpresso, a multilingual, parallel extension of read speech in Expresso, a high-quality expressive speech dataset that includes both read speech and improvised dialogues rendered in different styles. This text benchmark enables benchmarking expressive translation systems from English into other languages.
· Tools to assist the research community in collecting more datasets for translation.
In particular, we are updating our stopes library and SONAR encoders. With these tools, anyone can automatically create multimodal translation pairs from their own speech and/or text monolingual data through parallel data alignment methods.
Our approach
All our models run on fairseq2, the latest update of our sequence modeling toolkit. Similar to our previous work on SeamlessM4T, fairseq2 offers an ideal framework for building our streaming and expressivity updates because it is lightweight, easily composable with other PyTorch ecosystem libraries, and has more efficient modeling and data loader APIs.
UnitY2, a new architecture that has a non-autoregressive text-to-unit decoder, is also instrumental to our work. In SeamlessM4T v2, we used multitask-UnitY2 to enable text input (updated from v1's multitask-UnitY). We also used the architecture for SeamlessStreaming and SeamlessExpressive. As our next generation multitask model, UnitY2 has superior speech generation capabilities through its improved text-to-unit model. This implementation leads to improved consistency between text output and speech output, compared to the SeamlessM4T v1 model.
Instead of using an autoregressive text-to-unit model as in UnitY, we used a non-autoregressive model. Autoregressive models predict the next token based on the previously generated tokens. While autoregressive models model speech naturally, they scale poorly as sequence length increases. They are also more likely to exhibit repetitive degeneration. Non-autoregressive models predict the duration of each segment, which enables each segment to be decoded in parallel. This makes them robust to long sequences, and we see improvements over the initial iteration of UnitY. Since the model inherently predicts duration, it is much more easily adaptable to the streaming use case, because we know exactly how much speech is needed to be generated for each piece of text, which is not the case for autoregressive models.
Streaming
EMMA is our core streaming algorithm, which allows us to intelligently decide when we have enough information to generate the next speech segment or target text. It improves upon previous state-of-the-art algorithms especially for long input sequences, which is the case for speech-to-text or speech-to-speech translation. Further, this algorithm allows us to fine-tune from offline models, which allows us to reap the benefits of the Seamless M4T v2 foundation model. Finally, we show empirically that this algorithm generalizes well across many different language pairs, which is particularly challenging for streaming models because the language pairs may be structured differently.
Expressivity
Preserving expression also requires a new approach. We replaced the unit HiFi-GAN vocoder in SeamlessM4T v2 with PRETSSEL, an expressive unit-to-speech generator. PRETSSEL is conditioned on the source speech for waveform generation to transfer tones, emotional expression, and vocal style qualities. We initialize our model from SeamlessM4T v2 in order to achieve high translation quality, which is the most fundamental need for a speech-to-speech translation system. We also developed Prosody UnitY2, integrating an expressivity encoder in SeamlessM4T v2 to guide unit generation with proper rhythm, speaking rate, and pauses. In addition, we release a suite of evaluation tools to capture the preservation of these aspects of expressivity.
Results
The updates to UnitY2 have resulted in improved translation quality across a variety of tasks. SeamlessM4T v2 achieves sate of the art translation for speech-to-speech and speech-to-text results in 100 languages. In the same model, it also beats Whisper v3’s for automatic speech recognition on average and in particular for lower resource languages.
For speech-to-text translation, SeamlessM4T v2 improves by 10% compared to the model we released in August and by more than 17% over the strongest cascaded models when translating into English. For speech-to-speech translation, SeamlessM4T v2, improves over SeamlessM4T (v1) by more than 15% when translating into English, and by 25% when translating from English.
In other tasks, SeamlessM4T v2 is on par with No Language Left Behind (NLLB) in text-to-text translation. It is also on-par on average with MMS in automatic speech recognition (ASR) (with better performance on mid and high-resource languages while MMS has better performance on low resource languages), and improving over the recently released Whisper-Large-v3 by more than 25%. In the zero-shot task of text-to-speech translation, SeamlessM4T v2 is on-par with strong cascaded models into English, and improves over these baselines by 16 percent in English.
We compared SeamlessExpressive against a cascaded speech-to-text and text-to-speech pipeline, where speech-to-text is from SeamlessM4T v2, and text-to-speech is from strong open-sourced cross-lingual text-to-speech system that supports vocal style and emotion transfer. Results show that SeamlessExpressive is more stable with respect to noise in the source speech such that the output speech maintains high content translation quality, and better preserves styles and speech rate. SeamlessStreaming achieves state of the art low latency quality with speech-to-speech translation.
How we built AI translation systems responsibly: Toxicity mitigation
Accuracy is paramount in translation systems. Translation errors or unintended toxicity can cause misunderstandings between two people who don’t speak the same language.
Keeping with our commitment to building responsible AI, we explored the problem of hallucinated toxicity further. We focused our efforts on SeamlessM4T v2, which serves as the foundation for SeamlessStreaming, SeamlessExpressive, and our unified Seamless model.
The primary root cause for hallucinated toxicity often lies in the training data. Training samples can be noisy and contain unbalanced toxicity. For example, the input language side and target language side can contain different amounts of toxic words by mistake. Prior to training, we discarded any sample that showed signs of this imbalance.
However, filtering is only a passive technique and does not fully prevent hallucinated toxicity. We went one step further this time, and implemented a novel approach that actively mitigates this phenomenon. During the translation generation process, our model automatically detects generated toxic words. When there are misaligned levels of toxicity, we automatically re-adjust the generation process and use a different choice of words. This works at inference time and does not require any fine-tuning of the translation model. By doing so, we significantly reduce added toxicity while preserving translation quality.
Finally, building upon our past work on toxicity and bias evaluation, we’ve extended our evaluation framework with a new hallucinated toxicity detection tool. While our previous approach relied on an intermediate transcription model (ASR), we are now capable of detecting toxicity directly in the speech signal. This is useful in cases where toxicity is not conveyed by individual words, but rather in tone or general style. This allows us to get a more precise picture of the potential toxicity profile of our model. Additional research needs to be done on responsible AI for machine translation; however, we believe these measures bring us closer to realizing safer and more human-centric translation systems.
Audio watermarking
While AI tools can help bring the world closer together, it’s just as important that we include measures to prevent the risk of imitation and other forms of misuse. Our watermarking method offers a better level of reliability compared to passive discriminators, which are becoming less effective at differentiating synthetic voices from human ones as voice preservation technology advances. Watermarking actively embeds a signal that is imperceptible to the human ear, but still detectable within the audio using a detector model. Through this watermark, the origin of the audio can be accurately traced. This helps promote the responsible use of voice preservation technology by establishing a verifiable audio provenance and helps prevent potential abuses.
Beyond sheer detection accuracy, our watermarking solution needs to be robust to various attacks. For example, bad actors can try to modify the audio by adding noise, echo, or filtering some frequencies to dilute the watermark and bypass detection. We tested our watermarking method against a broad range of attack types and the results show that it is more robust than the current state-of-the-art. Our method can also pinpoint AI-generated segments in audio down to the frame level, surpassing the previous state-of-the-art (which only provides a one second resolution).
As in any kind of neural-network based safety mechanism, the watermarking model can be fine-tuned in isolation to forget its core properties. However, fine-tuning SeamlessExpressive and Seamless for translation purposes would not involve any update to the watermarking model itself, which does not play any role on translation quality.
Providing access to our technology
The breakthroughs we’ve achieved with Seamless show that the dream of a universal, real-time translator isn’t science fiction—it’s becoming a technical reality. We invite everyone to try our expressive translation demo. We’re also making our code, model and data available to the research community.
Download the code, model, and data