Technology attributes
Other attributes
Mixture of experts (MoE) is a machine learning technique where multiple models, or experts, are trained to specialize in different parts of the input space. Each expert makes a prediction based on the input, and these predictions are combined to produce the final output based on their confidence levels. With an MoE approach, the input space is partitioned into multiple regions handled by different experts trained to specialize for that respective region. A gating network is used to determine the weight given to each expert prediction, allowing the model to leverage strengths from each expert with the aim of improving overall performance.
MoE models can capture a wide range of patterns and relationships, making them particularly effective when the input space is large and complex. Typical applications of MoE models include image recognition, natural language processing, and recommendation systems.
One of the most important parameters determining a model's quality is its scale. For a fixed computing budget, it is better to train a larger model for fewer steps than train a small model for more steps. MoE enables artificial intelligence (AI) models to be pretrained using less compute, enabling the model or dataset to scale with the same compute budget as a dense model. MoE also offers faster inference compared to a model with the same number of parameters.
In the context of transformer models, MoE consists of two main elements:
- Sparse MoE layers—used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts,” where each expert is a neural network. In practice, the experts are often FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs.
- Gate network—determines which expert tokens are sent to. A token can also be sent to more than one expert. The network, or router, is composed of learned parameters and is pretrained at the same time as the rest of the network.
MoEs have challenges with fine-tuning, leading to overfitting, and they require high VRAM (video RAM) as all experts are loaded in memory.
MoEs date back to the 1991 paper "Adaptive Mixture of Local Experts" by Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. The original idea was to have a supervised procedure for a system composed of separate networks, each handling a different subset of the training cases. Between 2010 and 2015, two different research areas contributed to later MoE advancement:
- Experts as components—work by Eigen, Ranzato, and Ilya explored MoEs as components of deeper networks. This allows having MoEs as layers in a multilayer network, making it possible for the model to be both large and efficient simultaneously.
- Conditional Computation—Yoshua Bengio researched approaches to dynamically activate or deactivate components based on the input token.
This work led to the exploration of MoE for natural language processing with Shazeer et al. publishing the paper "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" in 2017, where they scaled the idea to a 137B LSTM (Long Short-Term Memory) by introducing sparsity.
Open-source projects to train MoEs include the following:
- Megablocks
- Fairseq
- OpenMoE
Open-access MoEs that have been released include those below:
- Switch Transformers (Google)—A collection of T5-based MoEs going from 8 to 2048 experts. The largest model has 1.6 trillion parameters.
- NLLB MoE (Meta)—A MoE variant of the NLLB translation model.
- Mixtral 8x7B (Mistral)—A high-quality MoE that outperforms Llama 2 70B and has much faster inference. An instruct-tuned model is also released. Read more about it in the announcement blog post.
- OpenMoE—A community effort that has released Llama-based MoEs.