Researchers at Google Brain announced a deep-learning computer vision (CV) model containing two billion parameters. The model was trained on three billion images and achieved 90.45% top-1 accuracy on ImageNet, setting a new state-of-the-art record.
The team described the model and experiments in a paper published on arXiv. The model, dubbed ViT-G/14, is based on Google’s recent work on Vision Transformers (ViT). ViT-G/14 outperformed previous state-of-the-art solutions on several benchmarks, including ImageNet, ImageNet-v2, and VTAB-1k. On the few-shot image recognition task, the accuracy improvement was more than five percentage-points. The researchers also trained several smaller versions of the model to investigate a scaling law for the architecture, noting that the performance follows a power-law function, similar to Transformer models used for natural language processing (NLP) tasks.
First described by Google researchers in 2017, the Transformer architecture has become the leading design for NLP deep-learning models, with OpenAI’s GPT-3 being one of the most famous. Last year, OpenAI published a paper describing scaling laws for these models. By training many similar models of different sizes and varying the amount of training data and computing power, OpenAI determined a power-law function for estimating a model’s accuracy. In addition, OpenAI found that not only do large models perform better, they are also more compute-efficient.
In contrast to NLP models, most state-of-the-art CV deep-learning models use a convolutional neural network (CNN) architecture. First described in 1989, the architecture gained dominance after a CNN model won the ImageNet challenge in 2012. With the recent success of Transformers in the NLP space, researchers have begun investigating its performance on vision tasks; for example, OpenAI recently developed an image-generation system based on GPT-3. Google in particular has been active in this area, using their proprietary JFT-300M dataset to train a 600M-parameter ViT model in late 2020.
The new ViT-G/14 model is pre-trained on an updated version of the dataset, JFT-3B, which contains nearly three billion images. The research team made several improvements to the ViT architecture, improving memory utilization to allow the model to fit into a single TPUv3 core. To evaluate  performance of ViT-G/14 and the other smaller models, the team performed both few-shot and fine-tuning transfer learning on the pre-trained models. The team used the results to formulate scaling laws, similar to the NLP laws:
– Scaling up compute, model and data improves accuracy, according to a power-law function
– Accuracy can be bottlenecked in smaller models
– Large models benefit from larger datasets
The ImageNet leaderboard currently lists ViT-G/14’s score in first place. The eight next highest-scoring models were also developed by Google researchers, while the tenth-place model was developed by Facebook. In a discussion on Twitter, a user asked if Google planned to release the code and model weights for ViT-G/14. Research team member Lucas Beyer replied,
The weights definitely not, it's trained on internal data! The code, good question. We did not plan to as it's really very close to the original ViT code that is public, but maybe adding the new pieces there would be a good idea.
Lucas Beyer
Google brain