Whisper AI (OpenAI)

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. We are open-sourcing models and inference code to serve as a foundation for building useful applications and for further research on robust speech processing.

ASR Summary Of Model Architecture

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Diagram detailing how ASR models are trained

Other existing approaches frequently use smaller, more closely paired audio-text training datasets,1 2,3 or use broad but unsupervised audio pretraining.4,5,6 Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition. However, when we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models.

About a third of Whisper’s audio dataset is non-English, and it is alternately given the task of transcribing in the original language or translating to English. We find this approach is particularly effective at learning speech to text translation and outperforms the supervised SOTA on CoVoST2 to English translation zero-shot.

ASR training data inputs and outputs

Whisper’s high accuracy and ease of use will allow developers to add voice interfaces to a much wider set of applications.

You can Check out the paper, model card, and code to learn more details and to try out Whisper.

Github
Examples

Whisper is a general-purpose automatic speech recognition (ASR) system developed by OpenAI. It is trained on a massive dataset of multilingual and multitask supervised data collected from the web, and can perform a variety of tasks, including:

  • Multilingual speech recognition

  • Speech translation

  • Language identification

  • Voice activity detection

  • Phrase-level timestamps

Whisper is still under development, but it has already been shown to outperform state-of-the-art ASR systems on a variety of benchmarks. It is also notable for its ability to handle challenging audio inputs, such as noisy environments and multiple speakers.

Whisper is currently available through the OpenAI API. It can be used to transcribe audio files or translate audio in real time. Whisper can also be integrated into other applications, such as video conferencing software or dictation tools.

Here are some potential applications of Whisper AI:

  • Subtitling and transcription: Whisper can be used to generate subtitles for videos or transcribe audio recordings for accessibility purposes.

  • Real-time translation: Whisper can be used to translate audio in real time, making it a valuable tool for communication and collaboration across language barriers.

  • Voice search and dictation: Whisper can be used to improve the accuracy of voice search and dictation tools.

  • Educational applications: Whisper can be used to create educational tools that help students learn new languages or improve their listening comprehension skills.

  • Customer service: Whisper can be used to improve the customer service experience by making it easier for customers to communicate with support representatives in their preferred language.

Overall, Whisper AI is a powerful tool that has the potential to revolutionize the way we interact with computers and with each other.

Last updated