AI Project attributes
Other attributes
BLOOM is a transformer-based autoregressive multi-lingual large language model (LLM) developed by BigScience. BLOOM is trained to continue text from a prompt using vast amounts of text data and industrial-scale computational resources. It is capable of outputting coherent text in forty-six languages and thirteen programming languages and can also perform text tasks it hasn't been explicitly trained for by casting them as text generation tasks. With 176 billion parameters, for most of these languages, BLOOM was the first language model released with over 100B parameters ever created. At release, BLOOM is the world's largest open-science, open-access multilingual large language model (LLM).
BigScience is an open science project composed of hundreds of researchers around the world. It is not structured under a centralized legal entity. However, there are plans to create a legal entity for data governance and community purposes. BigScience is an open collaboration boot-strapped by HuggingFace (a company and platform aiming to advance and democratize AI), GENCI (Grand Equipement National de Calcul Intensif), and IDRIS (Institute for Development and Resources in Intensive Scientific Computing). Organized as a research workshop, BigScience gathers academic, industrial, and independent researchers from many affiliations. There is no formal relationship between any of the affiliated entities of the participants to the workshop.
BLOOM is the result of over a year's work involving over 1000 researchers from 70+ countries and 250+ institutions. It was released on July 12, 2022, allowing anyone to download and run the model to investigate its performance or build on top of it, assuming they agree to BigScience's Responsible AI Licence. BLOOM is accessible through the HuggingFace hub. BigScience plans to improve BLOOM, making it more instructable, adding more languages, compressing the model while maintaining performance, and using it as a starting point for more complex architectures. Four months after the release of BLOOM, BigScience researchers released a paper titled "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model," going into detail on the model. The paper had almost 400 authors.
BLOOM utilizes an architecture modified from the Megatron-LM GPT2. A decoder-only model, BLOOM has a layer normalization applied to a word embeddings layer as well as ALiBI positional encodings with GeLU activation functions. The entire model contains a total of 176,247,271,424 parameters:
- 3,596,615,680 embedding parameters
- 70 layers, 112 attention heads
- Hidden layers 14336-dimensional
- Sequence length of 2048 tokens
BLOOM was trained on the ROOTS corpus, a composite collection of 498 Hugging Face datasets containing 1.61 terabytes of text from forty-six natural languages and thirteen programming languages. A breakdown showing various languages and families of languages is shown below:
The 13 programming languages present are:
- Java
- PHP
- C++
- Python
- JavaScript
- C#
- Ruby
- Lua
- GO
- TypeScript
- C
- Scala
- Rust
The 1.6TB of pre-processed text was converted into 350B unique tokens, using the BLOOM tokenizer—a learned subword tokenizer trained using a byte-level Byte Pair Encoding (BPE) algorithm.The model was trained for 117 days (March 11, 2022, til July 6, 2022) on the Jean Zay supercomputer in the south of Paris, France. The training was funded by a compute grant worth an estimated €3M from French research agencies CNRS (Centre National de la Recherche Scientifique) and GENCI (Grand Equipement National de Calcul Intensif).