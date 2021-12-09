



Posted by Yuanzhong Xu and Yanping Huang, a software engineer. Google Research, Brain Team

Neural network scaling, such as the amount of training data used, model size, and calculations used, is important for improving model quality in many real-world machine learning applications such as computer vision, language comprehension, and neural machine translation. .. .. This, in turn, motivates recent research to examine factors that play a key role in the success of neural model scaling. Increasing the capacity of a model is a good approach to improving the quality of the model, but it presents many system and software engineering challenges that must be overcome. For example, to train a large model that exceeds the memory capacity of the accelerator, the weights and model calculations need to be split into multiple accelerators. This parallelization process increases network communication overhead and can reduce device utilization. In addition, certain algorithms for parallelization usually require a great deal of engineering work, but may not work with different model architectures.

To address these scaling challenges, we introduce GSPMD: General and scalable parallelization of ML computational graphs. This section describes an open source automatic parallelization system based on the XLA compiler. GSPMD can extend most deep learning network architectures and has already been applied to many deep learning models such as GShard-M4, LaMDA, BigSSL, ViT, MetNet-2 and several domains. GSPMD is also integrated with multiple ML frameworks such as TensorFlow and JAX that use XLA as a shared compiler.

Overview GSPMD separates the task of programming ML models from the challenges of parallelization. This allows model developers to program as if they were running on a single device with very high memory and computing power. Users can simply add a few lines of annotation code to a subset of the important tensors in their model code to show how to: Split the tensor. For example, to train a large model parallel transformer, you only need to annotate less than 10 tensors (less than 1% of all tensors in the entire computational graph), which is one additional line of code for each tensor. GSPMD then runs a compiler path that determines the parallelization plan for the entire graph and translates it into mathematically equivalent parallelized computations that can be run on each device. This allows users to focus on building the model rather than implementing parallelization, and easily port existing single-device programs to run on a much larger scale.

By separating model programming and parallelism, developers can also minimize code duplication. GSPMD allows developers to adopt different parallel processing algorithms for different use cases without having to reimplement the model. For example, model code that enhances the GShard-M4 and LaMDA models can apply different parallelization strategies for different models and cluster sizes in the same model implementation. Similarly, by applying GSPMD, the big voice model of BigSSL can share the same implementation as the previous smaller model.

Generality and Flexibility Because different model architectures can be suitable for different parallelization strategies, GSPMD is designed to support different parallel processing algorithms for different use cases. For example, for a small model that fits in the memory of a single accelerator, data parallelism is recommended, where the device trains the same model with different input data. In contrast, models that are larger than the memory capacity of a single accelerator are pipeline algorithms that divide the model into multiple sequential stages or operator-level parallelism (such as Mesh-TensorFlow) (such as those used by GPipe). ) Is suitable. The individual computational operators in the model are divided into smaller parallel operators.

GSPMD supports all of the above parallelization algorithms with uniform abstraction and implementation. In addition, GSPMD supports nested patterns of parallelism. For example, it can be used to divide a model into individual pipeline stages. Each pipeline stage can be further subdivided using operator-level parallelism.

GSPMD also accelerates the innovation of parallel processing algorithms by allowing performance professionals to focus on algorithms that make the most of their hardware, rather than implementations that involve many device-to-device communications. For example, for a large Transformer model, we found a new operator-level parallel processing algorithm that splits multiple dimensions of a tensor on the device’s 2D mesh. Balanced data distribution across multiple dimensions reduces peak accelerator memory usage in proportion to the number of training devices while maintaining high utilization of accelerator computing.

To illustrate this, consider the simplified feedforward layer of the Transformer model annotated in the above way. To perform the first matrix multiplication on a fully split input data, GSPMD applies the MPI-style AllGather communication operator to partially merge with the split data from another device. .. Then perform the matrix multiplication locally to produce the split result. Prior to the second matrix multiplication, GSPMD adds another AllGather to the input on the right and performs the matrix multiplication locally to produce intermediate results that need to be joined and split. To this end, GSPMD adds an MPI-style ReduceScatter communication operator that accumulates and splits these intermediate results. Tensors generated using the AllGather operator at each stage will be larger than the original partition size, but they are short-lived and the corresponding memory buffers will be freed after use, so peak memory usage in training It does not affect the amount.

Left: A simplified feedforward layer for the Transformer model. The blue rectangle represents the tensor, and the red and blue dashed lines overlap to represent the desired partition across the device’s 2×2 mesh. Right: Single partition after GSPMD is applied.

Examples of Transformers with Nested Parallel Processing As a shared robust mechanism for different parallel processing modes, GSPMD allows users to easily switch between modes in different parts of the model. This is especially useful for models with different components that have different performance characteristics, such as multimodal models that process both image and audio. Consider a model with an embedded layer, an encoder stack with a Mixture-of-Expert layer, a decoder stack with a high density feedforward layer, and a Transformer encoder-decoder architecture with a final softmax layer. .. GSPMD allows for complex combinations of several parallel processing modes that process each layer individually in a simple configuration.

The following figure shows a partitioning strategy across 16 devices organized as a logical 4×4 mesh. Blue represents the split along the first mesh dimension X and yellow represents the split along the second mesh dimension Y. X and Y are reused for different model components to achieve different parallel processing modes. For example, the X dimension is used for embedded and softmax layer data parallelism, but for encoder and decoder pipeline parallelism. Y dimensions are used in different ways to divide a vocabulary, batch, or model expert dimension.

Computational Efficiency GSPMD provides industry-leading performance in training large models. The parallel model requires additional communication to coordinate multiple devices to perform the computation. Therefore, the efficiency of the parallel model can be estimated by looking at the percentage of time spent on communication overhead. The higher the usage and the less time spent on communication, the better. In a recent MLPerf performance benchmark set, an encoder-only model like BERT with about 500 billion parameters applying GSPMD to parallelization via 2048 TPU-V4 chips utilizes up to 63%. Has produced very competitive results (see table below). The peak FLOPS provided by TPU-V4 also shows the efficiency benchmarks of some typical large models in the table below. These sample model configurations are open source with the Lingvo framework and also include steps to run on Google Cloud. More benchmark results can be found in the section of our paper.

Model Family Parameter Count% * Number of Experts ** Number of Layers Number of TPU FLOPS Utilization High Density Decoder (LaMDA) 137B 100% 1 64 1024 TPUv3 56.5% High Density Encoder (MLPerf-Bert) 480B 100% 1 64 2048 TPUv4 63% sparse activation encoder decoder (GShard-M4) 577B 0.25% 2048 32 1024 TPUv3 46.8% sparse activation decoder 1.2T 8% 64 64 1024 TPUv3 53.8% * Models activated during inference Percentage of. Measurement of model sparsity. ** The number of experts in the Mixture of Experts layer. A value of 1 corresponds to a standard Transformer without the Mixture of Experts layer.

Conclusion The continued development and success of many useful machine learning applications such as NLP, speech recognition, machine translation, and autonomous driving depends on achieving the highest possible accuracy. This often requires the construction of larger and more complex models, so we are pleased to share the GSPMD papers and corresponding open source libraries with the wider research community. We hope it will be useful for efficient training of large-scale deep neural networks.

Acknowledgments Thanks to Claire Cui, Zhifeng Chen, Yonghui Wu, Naveen Kumar, Macduff Hughes, Zoubin Ghahramani, Jeff Dean for their support and valuable information. Thanks to collaborators Dmitry Lepikhin, HyoukJoong Lee, Dehao Chen, Orhan Firat, Maxim Krikun, Blake Hechtman, Rahul Joshi, Andy Li, Tao Wang, Marcello Maggioni, David Majnemer, Noam Shazeer, Ankur Bapna, Sneha Kudugunta, QuocLe. Mia Chen, Shibo Wang, Jinliang Wei, Ruoming Pang, Zongwei Zhou, David So, Yanqi Zhou, Ben Lee, Jonathan Shen, James Qin, Yu Zhang, Wei Han, Anmol Gulati, Laurent El Shafey, Andrew Dai, Kun Zhang, Nan Du, James Bradbury, Matthew Johnson, Anselm Levskaya, Skye Wanderman-Milne, and Qiao Zhang provided useful discussions and inspiration.

