Voicebox is a speech generative model built upon Meta’s non-autoregressive flow matching model, trained to infill speech given audio context and text.