Product attributes
AI Project attributes
AudioCraft is an open-source PyTorch library for audio processing and generation with deep learning, developed by Meta AI. AudioCrafts offers users a range of generative audio capabilities (music, sound effects, and compression after training on raw audio signals) in a single code base. It consists of three models:
- MusicGen—text-to-music model.
- AudioGen—text-to-sound model.
- EnCodec—neural audio codec.
Both MusicGen and AudioGen consist of a single autoregressive Language Model (LM) operating over streams of compressed discrete music representation (tokens). Meta AI introduced an approach leveraging the internal structure of the parallel streams of tokens, showing that a token interleaving pattern can efficiently model audio sequences while also capturing long-term dependencies in the audio. These models leverage EnCodec to learn the discrete audio tokens from raw waveforms. The codec maps audio signals to one or several parallel streams of discrete tokens. Then a single autoregressive language model recursively models the tokens from EnCodec. Generated tokens are then fed to EnCodec decoder to map back to the audio space, obtaining an output waveform. Different types of conditioning models can control the generation, including a pretrained text encoder for text-to-audio applications.
AudioCraft was released on August 2, 2023. Meta AI chose to open-source the AudioCraft models allowing users to train their own models based on their own datasets. The AudioCraft code is released under the MIT license, and the model weights are released under the CC-BY-NC 4.0 license. Meta has released demos of the models demonstrating samples of audio generated from both the text-to-sound and text-to-music models. The company is aiming for the AudioCraft models to be used as tools for musicians and sound designers, helping users brainstorm new ideas or iterate on their existing compositions in new ways. Meta has also MusicGen could become a new type of instrument similar to the adoption of synthesizers.
The MusicGen model was first described in a paper released in June 2023 titled "Simple and Controllable Music Generation." The model was developed by the FAIR team at Meta AI and trained between April 2023 and May 2023. The training dataset consisted of roughly 400,000 recordings along with text description and metadata, amounting to 20,000 hours of music owned by Meta or licensed from the following sources the Meta Music Initiative Sound Collection, Shutterstock music collection, and the Pond5 music collection.
MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. MusicGen is available in three sizes (300M, 1.5B, and 3.3B parameters) and two variants (text-to-music generation tasks and melody-guided music generation). The model was evaluated using standard music benchmarks, including those below:
- Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish)
- Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST)
- CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model
Additional qualitative studies with human participants were used to evaluate the performance of the model based on the following criteria:
- Overall quality of the music samples
- Text relevance to the provided text input
- Adherence to the melody for melody-guided music generation
AudioGen was also developed by the FAIR team at Meta AI. A paper describing version one of the model was released in September 2022, titled "AudioGen: Textually Guided Audio Generation."
Version two of AudioGen released as part of AudioCraft, was trained between July 2023 and August 2023 on a range of public data sources, including the following:
- A subset of AudioSet
- BBC sound effects
- AudioCaps
- Clotho v2
- VGG-Sound
- FSD50K
- Free To Use Sounds
- Sonniss Game Effects
- WeSoundEffects
- Paramount Motion - Odeon Cinematic Sound Effects.
AudioGen consists of an EnCodec model for audio tokenization and an auto-regressive language model based on the transformer architecture for audio modeling. Version 2 was enhanced by training on 10-second samples vs 5 seconds (version 1), using a retrained EnCodec model on environmental sound data, and not using audio mixing augmentations. Version 2 has 1.5 billion parameters. AudioGen was evaluated using:
- Frechet Audio Distance and
- Kullback-Leibler Divergence.
Again, qualitative studies with human participants were also undertaken.
EnCodec was first released by Meta AI on October 25, 2022. The model was described in a paper titled "High Fidelity Neural Audio Compression." Encodec consists of three parts:
- The encoder—takes uncompressed data and transforms it into a higher dimensional and lower frame rate representation.
- The quantizer—compresses this representation to the targeted size. The quantizer is trained to give the desired size (or set of sizes) while retaining the most important information to rebuild the original signal. This compressed representation is stored on disk or sent through the network.
- The decoder—turns the compressed signal back into a waveform that is as similar as possible to the original. Discriminators are used to improve the perceptual quality of the generated samples by trying to differentiate between real samples and reconstructed samples.