Product attributes
Technology attributes
Other attributes
The Massively Multilingual Speech (MMS) project is building a single multilingual speech recognition model, expanding speech technology to support over 1,100 languages (more than ten times as many as before), language identification models able to identify over 4,000 languages (more than forty times more than before), pre-trained models supporting over 1,400 languages, and text-to-speech models for over 1,100 languages. MMS is a project from Meta aiming to make it easier for people to access information and use their devices in their preferred language. Meta made MMS available for free on May 22, 2023, open-sourcing the code and model weights under the CC-BY-NC 4.0 license.
Many languages are in danger of being lost, and the limitations of speech recognition and generation technology are struggling to capture them. Through MMS, Meta hopes to make a small contribution to preserving the language diversity of the world, helping academics, researchers, and activists document and preserve languages. MMS also has a range of use cases:
- Creating and converting books and tutorials into audiobooks
- Preparing documentation and converting audio or videos into structured documentation
- Audio file analysis, identifying the main focus areas
- Generating closed captioning for videos and audio content
Previous speech recognition models only covered roughly 100 languages, a fraction of the 7,000+ known languages spoken around the world. Upon release, the MMS recognition model supports speech-to-text and text-to-speech for 1,107 languages and language identification for over 4,000 languages. The map below shows the geographic origin of MMS's language coverage.
Meta's results show MMS outperforms existing models. In the future, Meta plans to increase coverage, supporting more languages and taking dialects into account.
MMS combines Meta's self-supervised learning wav2vec 2.0 model with a new dataset containing labeled data for 1,100 languages and unlabeled data for nearly 4,000 languages. Some of these languages, including the Tatuyo language, have only a few hundred speakers and have no prior speech technology coverage.
To collect audio data for thousands of languages, the project used religious texts, such as the Bible, that have been translated and recorded in many different languages. MMS has readings of the New Testament in over 1,100 languages providing roughly thirty-two hours of data per language. Utilizing unlabeled recordings of various other Christian religious readings increased the number of languages to over 4,000. While the recordings are more often read by male speakers and contain religious text, the model performs equally well for male and female voices and does not show bias for religious language. This is due to a connectionist temporal classification approach.
The data was preprocessed to improve quality, making it usable by machine learning algorithms. An alignment model was trained on existing data in over one hundred languages, performing a final cross-validation filtering step based on model accuracy to remove potentially misaligned data. To enable other researchers to create new speech datasets, Meta added the alignment algorithm to PyTorch and released the alignment model. While thirty-two hours of data per language is not enough to train conventional supervised speech recognition models. MMS builds on wav2vec 2.0 to reduce the amount of labeled data needed to train useful systems. Meta-trained self-supervised models on about 500,000 hours of speech data in over 1,400 languages, nearly five times more languages than prior works. The resulting models were then fine-tuned for a specific speech task, such as multilingual speech recognition or language identification.
MMS performance has been evaluated on existing benchmark datasets, including FLEURS. As the number of languages increases, performance slightly decreases. Moving from 61 to 1,107 languages increases the character error rate by only about 0.4 percent but increases the language coverage by over eighteen times. In comparison to OpenAI's Whisper, Meta found that models trained using MMS data achieved half the word error rate while covering eleven times more languages.