Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text.
Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text. Two versions of the model are under development andevelopment—an English-only version trained on 60K hours of data and a multilingual version trained using 50K hours of data covering six languages (English, French, German, Spanish, Polish, and Portuguese). Voicebox can perform a range of tasks, including speech synthesis across six languages, removal of transient noise, content editing, the transfer of audio style within and across languages, and the generating of diverse speech samples.
Voicebox can generalize to various speech-generation tasks that it was not specifically trained for. Previously generative AI models for speech required training for each task using carefully prepared training data. These inputs –, known as monotonic, clean data –, are difficult to produce and only exist in limited quantities, and they result in outputs that sound monotonemonotonous. In contrast, Voicebox can learn from raw audio and an accompanying transcription.
Upon the release of Voicebox, Meta AI provided a series of demos demonstrating the model's capabilities,; these included the following:
June 16, 2023
Voicebox is a speech generative model built upon Meta'sMeta’s non-autoregressive flow matching model, trained to infill speech given audio context and text.
Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text. Two versions of the model are under development an English-only version trained on 60K hours of data and a multilingual version trained using 50K hours of data covering six languages (English, French, German, Spanish, Polish, and Portuguese). Voicebox can perform a range of tasks including speech synthesis across six languages, removal of transient noise, content editing, the transfer of audio style within and across languages, and generating diverse speech samples.
Meta AI announced Voicebox on June 16, 2023, sharing audio samples from the model and a research paper detailing the methodology behind the model. However, due to the potential misuse of the Voicebox model, Meta AI chose not to make the code publicly available, stating:
While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility.
The research paper accompanying the model's release, titled "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale," also details a classifier that can distinguish between authentic speech and audio generated with Voicebox to help mitigate against future misuse of the model.
Voicebox can generalize to various speech-generation tasks that it was not specifically trained for. Previously generative AI models for speech required training for each task using carefully prepared training data. These inputs – known as monotonic, clean data – are difficult to produce and only exist in limited quantities, and they result in outputs that sound monotone. In contrast, Voicebox can learn from raw audio and an accompanying transcription.
Building on Meta's flow matching model means Voicebox can learn highly non-deterministic mapping between text and speech. This enables Voicebox to learn from varied speech data without variations having to be carefully labeled. Therefore, Voicebox can train on more diverse data and a much larger scale of data. Voicebox was trained on recorded speech and transcripts from public-domain audiobooks. Meta states that its flow matching method shows improvement compared to auto-regressive models, outperforming VALL-E on zero-shot text-to-speech in terms of intelligibility (5.9% vs. 1.9% word error rates) and audio similarity (0.580 vs. 0.681) while being as much as 20 times faster. For cross-lingual style transfer, Voicebox outperforms YourTTS to reduce the average word error rate from 10.9% to 5.2% and improve audio similarity from 0.335 to 0.481.
Upon the release of Voicebox, Meta AI provided a series of demos demonstrating the model's capabilities, these included:
June 16, 2023
However, due to the potential misuse of the Voicebox model, Meta AI chose not to make the code publicly available.
Voicebox is a speech generative model built upon Meta's non-autoregressive flow matching model.
Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text.