The Retentive network (RetNet) architecture is a foundational architecture for large language models proposed as an alternative to transformers.
In particular, the parallel representation allows for training parallelism. The recurrent representation enables inference, improving decoding throughput, latency, and GPU memory. The chunkwise recurrent representation allows for efficient long-sequence modeling with linear complexity, with eaheach chunk encoded parallelly while recurrently summarizing chunks.
In their paper, the team at Microsoft Research and Tsinghua University conducted a series of experiments to show that RetNet is competitive in terms of both scaling curves and in-context learning with Transformers and its variants. The paper also states that the inference cost of RetNet is length-invariant. For a 7B parameter model and 8k sequence length, RetNet decoded 8.4x faster, saving 70% of the memory compared to transformers with key-value caches. Training RetNet achieves 25-50% memory saving and 7x acceleration than standard Transformerstransformers.
The Retentive Networknetwork (RETNETRetNet) asarchitecture is a foundational architecture for large language models proposed as an alternative to transformer modelstransformers.
The Retentive network (RetNet) architecture is a foundational architecture for large language models (LLMs) proposed as an alternative to transformers. RetNet was first proposed by researchers at Microsoft Research and Tsinghua University, Beijing, in a paper submitted on July 17, 2023. The paper titled "Retentive Network: A Successor to Transformer for Large Language Models" was authored by Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Alongside the paper, the researchers released code on GitHub allowing users to develop their own RetNet models. The code is available through TorchScale, a PyTorch library of foundation architectures.
RetNet derives a connection between recurrence and attention (a key concept in the transformer architecture), proposing the retention mechanism for sequence modeling that supports three computation paradigms:
In particular, the parallel representation allows for training parallelism. The recurrent representation enables inference, improving decoding throughput, latency, and GPU memory. The chunkwise recurrent representation allows for efficient long-sequence modeling with linear complexity, with eah chunk encoded parallelly while recurrently summarizing chunks.
Transformers have become the primary architecture for LLMs. The training parallelism of transformers leads to inefficient inference. With growing sequence lengths, this deficiency increases GPU memory consumption and latency while reducing inference speed. RetNet is a potential next-generation architecture aiming to retain the training parallelism and competitive performance of transformers but with improved inference.
In their paper, the team at Microsoft Research and Tsinghua University conducted a series of experiments to show that RetNet is competitive in terms of both scaling curves and in-context learning with Transformers and its variants. The paper also states that the inference cost of RetNet is length-invariant. For a 7B parameter model and 8k sequence length, RetNet decoded 8.4x faster, saving 70% of the memory compared to transformers with key-value caches. Training RetNet achieves 25-50% memory saving and 7x acceleration than standard Transformers.
July 17, 2023
Authored by Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei, the paper is titled "Retentive Network: A Successor to Transformer for Large Language Models."
AI technology
Retentive Network (RETNET) as a foundational architecture for large language models as an alternative to transformer models.
AI technology
The Retentive network (RetNet) architecture is a foundational architecture for large language models proposed as an alternative to transformers.