Moshi by Kyutai
Kyutai unveils today the very first voice-enabled AI openly accessible to all
Moshi by Kyutai
An open speech-to-speech model released before closed GPT4o Architecture 1. 7B Multimodal LM (speech in, speech out) 2. 2 channel I/O - Streaming LM constantly generates text tokens as well as audio codecs (tunable) 3. Achieves 160ms latency (with a Real-Time Factor of 2) 4. The base text language model is a 7B (trained from scratch) - Helium 7B 5. Helium 7B is then jointly trained on w/ text and audio codecs 6. Speech codec is based on a Mimi (their inhouse audio compression model) 7. Mimi is a VQ-VAE capable of 300x compression factor - trained on both semantic and acoustic information 8. Text to Speech Engine supports 70 different emotions and styles like whispering, accents, personas, etc,, Training/ RLHF 1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed. 4. The model can be fine-tuned with less than 30 minutes of audio. 5. Safety: Generated audio is watermarked (possibly w/ audioseal) & generated audios are indexed in a database 6. Trained on Scaleway cluster of 1000 H100 GPUs Inference 1. The deployed demo model is capable of bs=2 at 24GB VRAM (hosted on Scaleway and Hugging Face) 2. Model is capable of 4-bit and 8-bit quantisation 3. Works across backends - CUDA, Metal, CPU 4. Inference code optimised with Rust 5. Further savings to be made with better KV Caching, prompt caching, etc. Future plans 1. Short-term technical report and open model releases. 2. Open model releases would include the inference codebase, the 7B model, the audio codec and the full optimised stack. 3. Scale the model/ refine based on feedback except Moshi 1.1, 1.2, 2.0 4. License as permissive as they can be (yet to be decided) Just 8 team members put all of this together! After using it IRL, it feels magical to have such a quick response. It opens so many avenues: research assistance, brainstorming/Steelman discussion points, language learning, and more importantly, it's on-device with the flexibility to use it however you want! In just 6 months, with a team of 8, the Kyutai research lab developed from scratch an artificial intelligence (AI) model with unprecedented vocal capabilities called Moshi. The team publicly unveiled its experimental prototype today in Paris. At the end of the presentation, the participants – researchers, developers, entrepreneurs, investors and journalists – were themselves able to interact with Moshi.
The interactive demo of the AI will be accessible from the Kyutai website at the end of the day. It can therefore be freely tested online as from today, which constitutes a world first for a generative voice AI. This new type of technology makes it possible for the first time to communicate in a smooth, natural and expressive way with an AI. During the presentation, the Kyutai team interacted with Moshi to illustrate its potential as a coach or companion for example, and its creativity through the incarnation of characters in roleplays.
More broadly, Moshi has the potential to revolutionize the use of speech in the digital world. For instance, its text-to-speech capabilities are exceptional in terms of emotion and interaction between multiple voices. 1 Compact, Moshi can also be installed locally and therefore run safely on an unconnected device. With Moshi, Kyutai intends to contribute to open research in AI and to the development of the entire ecosystem.
The code and weights of the models will soon be freely shared, which is also unprecedented for such technology. They will be useful both to researchers in the field and to developers working on voice-based products and services. This technology can therefore be studied in depth, modified, extended or specialized according to needs. The community will in particular be able to extend Moshi's knowledge base and factuality, which are currently deliberately limited in such a lightweight model, while exploiting its unparalleled voice interaction capabilities. -----------------------------
About Kyutai
Kyutai is a non-profit laboratory dedicated to open research in AI, founded in November 2023 by the iliad Group, CMA CGM and Schmidt Sciences.
Launched with an initial team of six leading scientists, who have all worked with Big Tech labs in the USA, Kyutai continues to recruit at the highest level, and also offers internships to research Master’s degree students. Now comprising a dozen members, the team will launch its first PhD theses at the end of the year.
The research undertaken explores new general-purpose models with high capabilities. The lab is currently working in particular on multimodality, i.e., the possibility for a model to exploit different types of content (text, sound, images, etc.) both for learning and for inference. All the models developed are intended to be freely shared, as are the software and know-how that enabled their creation.
To carry out its work and train its models, Kyutai relies in particular for its compute on the Nabu 23 superpod made available by Scaleway, a subsidiary of the iliad Group.
Follow us on: www.kyutai.org X: @kyutai_labs
Contacts For any requests for interviews and/or photos of the Kyutai team, please send an email to presse@kuytai.org
Last updated