Technology attributes
Other attributes
Transformers are a type of neural network architecture used for the transduction or transformation of input sequences into output sequences during deep learning applications, such as natural language processing (NLP) and computer vision (CV). Transformer neural networks can learn context by tracking relationships in sequential data, such as the words in a sentence. They apply mathematical techniques, called attention or self-attention, detecting subtle ways distant data elements in a series influence or depend on each other, by weighing the importance of each part of the input data differently.
Transformers were developed, invented, and open-sourced by researchers at Google and first described by a paper published on June 12, 2017, titled "Attention Is All You Need." Replacing previous techniques such as convolutional and recurrent neural networks (CNNs and RNNs), transformers have become the basis of leading generative AI models. For example, the acronym GPT used in various OpenAI models, including ChatGPT, stands for "generative pre-trained transformer." Studies show 70 percent of arXiv papers on AI posted between 2020 and 2022 mention transformers.
In comparison to previous neural network architectures, transformers are able to better apply context and make use of parallel computation for faster throughput. For example, when processing natural language, the meaning of a word or phrase can change depending on the context in which it is used. Self-attention mechanisms allow transformers to apply context in a data-driven way, focusing on the most relevant parts of the input for each output.
Transformer architecture uses an encoder-decoder structure that no longer requires recurrence or convolution to generate an output. The encoder receives an input, building a representation of it and its features. The decoder takes this representation along with other inputs to generate a target sequence. These two components can be used independently or in combination, depending on the task:
- Encoder-only models—tasks that require the understanding of an input, such as entity recognition or sentence classification
- Decoder-only models—generative tasks such as text generation
- Encoder-decoder models—tasks that generate based on an input, such as translation or summarization
The architecture first converts the input data into an n-dimensional embedding, which is fed into the encoder. The encoder and decoder consist of modules stacked on each other, which include feed-forward and multi-head attention layers. The encoder maps an input sequence into a series of continuous representations. The decoder receives this output from the encoder and the decoder's output from a previous time step to generate an output sequence.
Self-attention is a mechanism that allows a neural network to contextualize different parts of an input, for example, paying attention to other words that add context to a body of text when analyzing its meaning. This could include distinguishing between homonyms based on the context a word is used.
Transformer models generate a vector called the query, and a vector for each word is called the key. When the query from one word matches the key for another, it adds relevant context to understand its meaning. To provide this appropriate context between words, a third vector is generated that, when combined with the first word, offers a new, more contextualized meaning. Transformers apply multiple parallel attention mechanisms at the same time but reduce the size of vectors. This allows the neural network to have multiple attempts at capturing the different kinds of context and to combine context across larger phrases.
Machine learning models that process text must compute every word and take into consideration how these words come in sequences and relate to each other, i.e., how words in a sentence change meaning when used together. Before transformers, RNNs were the main solution for NLP. RNNs process the first word, feeding back the result into a layer that processes the next word. This allows it to keep track of an entire sentence rather than process each word separately.
However, this approach is slow and unable to take advantage of parallel computing hardware, such as GPUs. RNNs also cannot handle long sequences of text due to a problem known as "vanishing gradients." As the neural network works its way through a sequence of text, the effects of the early words gradually reduce, causing issues if two linked words are very far apart in the text. Additionally, they can only apply relationships between a word and the text preceding it, when in reality, the meaning of a word depends on the words that come both before and after.
Long short-term memory (LSTM) networks, a successor to RNNs, solved the problem of vanishing gradients and could handle larger sequences. However, they were even slower to train RNNs and could not take advantage of parallel computing, still relying on the serial processing of text.
Transformers were first described in a 2017 NeurIPS paper from a research team at Google. The name "transformer" was coined by Jakob Uszkoreit, a senior software engineer on the team. The team trained their model in only 3.5 days, using eight GPUs, which was significantly smaller than the time and cost of training previous models. It utilized datasets with up to a billion pairs of words.
- The first pre-trained transformer model was released in June 2018 by OpenAI. GPT was fine-tuned on a range of NLP tasks.
- Google released BERT in October 2018, a large pre-trained transformer model designed to understand and summarize sentences.