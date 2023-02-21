



This repository contains JAX, TensorFlow, and PyTorch implementations of the Lion optimizer discovered by symbolic program searches for the “Symbolic Discovery of Optimization Algorithms” paper. Lion is also available on Praxis.

Table of Contents Simple, memory efficient, fast execution time

Compared to AdamW and various adaptive optimizers, which need to store both the first and second moments, Lion only needs momentum, halving the additional memory footprint. This is useful when training large models or with large batch sizes. For example, to train ViT-B/16 with image size 224 and batch size 4,096, AdamW needs at least 16 TPU V4 chips, while Lion only needs 8. Another practical advantage is that due to Lion’s simplicity, the runtime (steps/second) in experiments is fast. Typically 2-15% speedup compared to AdamW and Adafactor, depending on task, codebase and hardware.

Outstanding Performance on Various Architectures, Tasks, and Domains Image Classification Lion outperforms AdamW on various architectures trained from scratch on ImageNet or pre-trained on ImageNet-21K.

Lion saves up to 5x on JFT-300M pre-training costs.

Results after fine tuning with higher resolution and Polyak averaging. The obtained ViT-L/16 matches the previous ViT-H/14 result trained by AdamW, but is two times smaller. Our ViT-G/14 additionally achieves 90.71% accuracy on ImageNet.

In the visual-verbal contrast training LiT, Lion beat AdamW in zero-shot image classification and image text retrieval.

On BASIC-L, Lion achieves 88.3% zero shots and 91.1% fine-tuned ImageNet accuracy, beating the previous best results by 2% and 0.1% respectively.

Diffusion Model In the diffusion model, Lion outperforms AdamW in terms of FID score, saving up to 2.3x in training compute. Left to right: 64×64, 128×128, 256×256 image generation trained on ImageNet.

Language Modeling Lion saves up to twice as much computation in verification complexity when performing language modeling tasks (left: on Wiki-40B, right: on PG-19). Lion achieves greater profits with greater Transformers.

Lion improves average in-context learning ability when training LLM compared to Adafactor.

Lion is also better when it comes to tweaking the T5 with GLUE.

Choosing Hyperparameters and Batch Size

Lion is simpler and doesn’t require $\epsilon$ or factorization stuff, so it has fewer hyperparameters compared to AdamW and Adafactor. To ensure a fair comparison, logarithmic scales are used to calibrate the peak learning rate $lr$ and isolated weight decay $\lambda$ for both AdamW (Adafactor) and Lion. AdamW’s default values ​​for $\beta_1$ and $\beta_2$ are set to 0.9 and 0.999 respectively, and $\epsilon$ is $1e-8$, whereas Lion’s default values ​​for $\beta_1$ and $\beta_1$ $\beta_2$ is found through the program search process and set as 0.9 and 0.99 respectively. We only tune the language task hyperparameters: $\beta_1=0.9$, $\beta_2=0.99$ for AdamW and $\beta_1=0.95$, $\beta_2=0.98$ for Lion. Additionally, AdamW’s $\epsilon$ is set to $1e-6$ instead of the default $1e-8$. This is to improve stability in experiments, similar to our observations with RoBERTa.

The updates produced by Lion have a larger norm than those produced by other optimizers because they are element-wise binary $\pm 1$ as a result of the sign operation. Based on our experience, Lion’s good learning rate is typically 1/10th of AdamW’s learning rate. The effective weight attenuation is $lr * \lambda$, so the value of $\lambda$ used for Lion is ten times larger than that of AdamW to maintain similar strength. for example,

$lr=1e-4$, $\lambda=10.0$ is Lion, $lr=1e-3$, $\lambda=1.0$ is AdamW, ViT-B/16 on ImageNet using powerful extensions train the $lr=3e-5$, $\lambda=0.1$ in the diffusion model is Lion and $lr=3e-4$, $\lambda=0.01$ is AdamW. For 7.5B language modeling, $lr=1e-4$, $\lambda=0.01$ for Lion and $lr=1e-3$, $\lambda=0.001$ for Adafactor.

See the paper for all hyperparameters.

Apart from peak performance, sensitivity to hyperparameters and difficulty in tuning them is also important for adopting optimizers in practice. In the figure below, we change both $lr$ and $\lambda$ when training ViT-B/16 from scratch on ImageNet. As the heatmap suggests, Lion is more robust to different hyperparameter choices compared to AdamW.

With the added noise from the labeling operation, one might wonder if Lion needs a large batch size to accurately determine direction. To address this issue, we train his ViT-B/16 model on ImageNet using various batch sizes while keeping the total training epochs at 300 and incorporating the RandAug and Mixup techniques. AdamW’s optimal batch size is 256 and Lion’s optimal batch size is 4,096, as shown in the figure below. This shows that Lion actually prefers large batch sizes, but its performance remains robust even with batch sizes as small as 64. Furthermore, when the batch size is increased to 32K and only 11K training steps, Lion achieves a significant 2.5% accuracy improvement over AdamW (77.9% vs 75.4%), validating in large batch training settings. showing gender.

Left: Ablation versus batch size effect. Lion prefers larger batches than AdamW. ImageNet accuracy of ViT-B/16 trained from scratch when varying $lr$ and $\lambda$ for AdamW (middle) and Lion (right). Lion is more robust to different hyperparameter choices.

Quote

If you found this work useful, please cite:

@misc{chen2023symbolic, title={Symbolic Discovery of Optimization Algorithms}, author={Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh and Yifeng Lu and Quoc V. Le}, year={2023}, eprint={2302.06675}, archivePrefix={arXiv}, primaryClass={cs.LG} }

