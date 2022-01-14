



Posted by: Sneha Kudugunta, Research Software Engineer, Orhan Firat, Research Scientist, Google Research

Large-scale language model scaling has significantly improved the quality of natural language understanding (T5), generation (GPT-3), and multilingual neural machine translation (M4). One of the common approaches to building a larger model is to simply scale the existing dimensions of the network to increase the depth (number of layers) and width (layer dimensions). .. Such a high density model takes an input sequence (divided into smaller components called tokens), passes all tokens to the complete network, and activates all layers and parameters. These large, high-density models have achieved state-of-the-art results in multiple natural language processing (NLP) tasks, but training costs increase in proportion to the size of the model.

An alternative and increasingly popular approach is to build a sparsely activated model based on a mix of experts (MoE) (eg GShard-M4 or GLaM). Each token passed to the network follows a separate subnet by skipping part of the model. parameter. The choice of how to distribute the input tokens to each subnetworks (“experts”) is determined by the small router network that is trained with the rest of the network. This allows researchers to increase the size (and thus performance) of the model without proportionally increasing training costs.

This is an effective strategy during training, but sending a long sequence of tokens to multiple experts increases the computational cost of inference because the experts need to be distributed across many accelerators. For example, 256 TPU-v3 chips are required to provide a GLaM model with 1.2T parameters. As with high-density models, the number of processors required to provide a MoE model increases linearly with model size, increasing computational requirements while significantly increasing communication overhead and engineering. Becomes complicated.

Beyond Distillation: Mixing Task-Level Experts for Efficient Reasoning introduces a method called Task-Level Expert Mixing (TaskMoE). It serves while taking advantage of the quality gains of scaling the model. Our solution trains large multitasking models, from which smaller stand-alone task-by-task subs suitable for inference, with significantly reduced inference latency without compromising model quality. Is to extract the network. We show the effectiveness of this method for multilingual neural machine translation (NMT) in comparison to models compressed using a mixture of other expert models and knowledge distillation.

Training a large, sparsely activated model with task information Learns that a router network sends tokens of each task-specific input to the various subnetworks of the model associated with the task in question. Train a sparsely activated model. For example, in a multilingual NMT, all tokens in a particular language are routed to the same subnet. This is unlike other recent approaches, such as a sparsely gated mix of expert models (such as TokenMoE) that learn that the router network sends different tokens on input to different subnetworks regardless of the task. increase.

Inference: Bypassing Distillation by Extracting Subnetworks The result of this difference in training between TaskMoE and models like TokenMoE lies in the approach to inference. Inference is still computationally expensive because TokenMoE follows the convention of distributing tokens for the same task to many professionals both during training and during inference.

For TaskMoE, dedicate a smaller subnet to a single task ID during training and inference. During inference, discard unused experts and extract subnetworks for each task. TaskMoE and its variants allow you to train a single large multitasking network and use separate subnetworks during inference for each task without using additional compression methods after training. The process of training a TaskMoE network and extracting subnetworks for each task for inference is shown below.

During training, tokens in the same language will be routed to the same expert based on task-based MoE language information (source, target, or both). Later, during inference, it extracts the subnet of each task and destroys the unused experts.

To demonstrate this approach, we will train the model based on the Transformer architecture. Similar to GShard-M4 and GLaM, it replaces the feedforward network of all other transformer layers with the Mixture-of-Experts (MoE) layer, which consists of multiple identical feedforward networks, the “experts”. For each task, the routing network trained with the rest of the model tracks the task IDs of all input tokens and selects a specific number of experts (two in this case) for each layer to be task-specific. Form a subnetwork of. The baseline high density Transformer model has 143M parameters and 6 layers on both the encoder and decoder. Both TaskMoE and TokenMoE we train are 6 layers deep, but all MoE layers have 32 experts with a total of 533M parameters. Train your model using published WMT datasets and use over 431 million statements across 30 language pairs of different language families and scripts. For more information, we will introduce the full article to our readers.

To demonstrate the benefits of using TaskMoE when inferring results, we compare the throughput of the TaskMoE, TokenMoE, and baseline density models, that is, the number of tokens decoded per second. Once the subnet for each task is extracted, TaskMoE is 7 times smaller than the TokenMoE model with 533M parameters and can be served with a single TPU v3 core instead of the 64 cores required for TokenMoE. You can see that the peak throughput of TaskMoE is twice that of the TokenMoE model. In addition, an examination of the TokenMoE model reveals that 25% of the inference time is spent communicating between devices, while communicating with TaskMoE is virtually free of time. Compare the throughput of TaskMoE and TokenMoE in different batch sizes. The maximum batch size for TokenMoE is 1024, but for TaskMoE and high density baseline models it is 4096. Here, TokenMoE has one instance distributed across 64 TPUv3 cores, and TaskMoE and the baseline model have one instance for each of the 64 cores.

A common approach to building a small network that still works well is by distilling knowledge. In this method, a large teacher model trains a small student model with the aim of matching the teacher’s performance. However, this method requires additional calculations needed to train the student from the teacher. Therefore, we also compare TaskMoE with the baseline TokenMoE model, which compresses using knowledge distillation. The size of the compressed TokenMoE model is comparable to the per-task subnetworks extracted from TaskMoE.

In addition to being a simple method that does not require additional training, TaskMoE has been found to improve the distilled TokenMoE model by an average of 2.1 BLEU in all languages ​​of the multilingual translation model. Distillation retains 43% of the performance gains achieved by scaling a dense multilingual model to TokenMoE, but extracting smaller subnetworks from the TaskMoE model does not degrade quality. ..

The BLEU score (higher is better) compares the distilled TokenMoE model to the TaskMoE and TokenMoE models with 12 layers (6 for encoders, 6 for decoders) and 32 experts. Both approaches improve the multilingual dense baseline, but TaskMoE improves the baseline by an average of 3.1 BLEU, and distillation from TokenMoE improves the baseline by an average of 1.0 BLEU.

Next Steps The quality improvements commonly found in scaling machine learning models have led the research community to work on advances in scaling technology that enable efficient training of large models. As the need to train models that can be generalized to multiple tasks and modality increases, so does the need to scale the model further. However, the practicality of providing these large models remains a major challenge. Efficient deployment of large models is an important direction of research, and we believe TaskMoE is a promising step towards a more inferrable algorithm that maintains improved scaling quality.

Acknowledgments First of all, we would like to thank our co-authors Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin and Minh-Thang Luong. We would also like to thank Wolfgang Macherey, Yuanzhong Xu, Zhifeng Chen, and Macduff Richard Hughes for their helpful feedback. Special thanks to the Translate and Brain teams for providing useful information and discussions, and to the entire G Hard development team for their basic contributions to this project. We would also like to thank Tom Small for creating the animation for the blog post.

