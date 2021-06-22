



Researchers at Google Brain have announced a Deep Learning Computer Vision (CV) model that contains 2 billion parameters. The model was trained on 3 billion images and achieved a top 1 accuracy of 90.45% on ImageNet, setting a new cutting-edge record.

The team described the model and experiment in a treatise published on arXiv. This model, called ViT-G / 14, is based on Google’s recent Vision Transformers (ViT) work. ViT-G / 14 outperformed previous state-of-the-art solutions in several benchmarks, including ImageNet, ImageNet-v2, and VTAB-1k. For a few shots of the image recognition task, the accuracy improvement was more than 5 percentage points. Researchers also trained several smaller versions of the model to investigate the scaling rules of the architecture, noting that performance follows power law functions similar to the Transformer model used for natural language processing (NLP) tasks. Did.

First described by Google researchers in 2017, the Transformer architecture has become a major design for NLP deep learning models, and OpenAI’s GPT-3 is one of the most famous. Last year, OpenAI published a treatise explaining the scaling rules for these models. OpenAI trained many similar models of various sizes and by varying the amount of training data and computational power to determine the power law function to estimate the accuracy of the model. In addition, OpenAI has found that not only does it improve the performance of large models, it also improves computational efficiency.

In contrast to the NLP model, most state-of-the-art CV deep learning models use a convolutional neural network (CNN) architecture. The architecture, first described in 1989, became dominant after the CNN model won the ImageNet Challenge in 2012. With the recent success of Transformers in NLP space, researchers have begun investigating performance in vision tasks. For example, OpenAI recently developed an image generation system based on GPT-3. In particular, Google is active in this area, using its own JFT-300M dataset to train ViT models with 600M parameters in late 2020.

The new ViT-G / 14 model is pre-trained with JFT-3B, an updated version of the dataset containing nearly 3 billion images. The research team was able to make some improvements to the ViT architecture to improve memory utilization and fit the model to a single TPU v3 core. To assess the performance of the ViT-G / 14 and other smaller models, the team performed both a few shots and fine-tuning transfer training on the pre-trained model. The team used the results to develop a scaling law similar to NLP’s.

Scale up the calculations, models, and data according to the power law function to improve accuracy. Accuracy can be a bottleneck on smaller models. Larger models benefit from larger datasets.

The ImageNet leaderboard currently lists ViT-G / 14 scores first. The next eight highest-scoring models were also developed by Google researchers, and the tenth-ranked model was developed by Facebook. In a discussion on Twitter, users asked if Google plans to release ViT-G / 14 code and model weights. Lucas Bayer, a member of the research team, replied:

There are absolutely no weights, it is trained with internal data! Code, good question. I didn’t plan on it because it’s so close to the original published ViT code, but I recommend adding new parts.

Google released last year’s 600M parameter ViT model code and weights on GitHub.

