ConvNets Match Vision Transformers at Scale — A Summary🚀

4 min readOct 28, 2023

Summary:

Big Idea: While many believe Vision Transformers (ViTs) outperform Convolutional Neural Networks (ConvNets) on large datasets, researchers at DeepMind found that it’s not all about the architecture. The available compute and data for training are equally crucial!
Study in Spotlight: Researchers evaluated a high-performing ConvNet (specifically, the NFNet model family) pre-trained on the massive JFT-4B dataset.
Big Findings:
ConvNets can match the performance of ViTs when both are given similar computational resources.
NFNet model rocked a Top-1 accuracy of 90.4% after fine-tuning on ImageNet, which is comparable to ViT’s trained in a similar setting.

A Flashback to The Dawn of Deep Learning! 🚀

Long before Taylor Swift shook the world with her tunes, ConvNets were making their own kind of music in the deep learning universe! Credited for many of the foundational successes of deep learning, these networks saw a massive surge in popularity with AlexNet’s success in the 2012 ImageNet challenge.

However, with the evolution of technology, new stars emerged. Vision Transformers (ViTs) began to steal the limelight. But a question lingers -

Do these new contenders truly outperform our old champions, the ConvNets?

Decoding The Debate 🕵️

Researchers often argue in favor of ViTs, but there’s a catch. Most studies tend to compare modern ViTs to older ConvNet architectures (like the original ResNet), which might not be a fair comparison. It’s like comparing the latest iPhone model to one from 5 years ago. Obviously, the new one’s going to shine!

Experiments: A Balancing Act ⚖️

Using the gigantic JFT-4B dataset (think of it as the Netflix library but for images!), researchers at DeepMind pre-trained various NFNet model variants. These are pure ConvNets that have previously set records on ImageNet benchmarks.

NFNet Model Variants in Red. Source: https://arxiv.org/pdf/2102.06171.pdf

💡 Fun Fact: The JFT-4B dataset houses about 4 billion labeled images. Imagine if each image were a pancake; the stack would reach outer space!

The results? As they increased the computational budget, the performance of ConvNets soared, echoing the performance trends we see with ViTs. Like having a more powerful engine in a race car, giving these models more computational juice allows them to race at breakneck speeds.

Surprising Scaling Laws 📈

The experiments unveiled a fascinating trend: as we pour in more computing power and data, both the model size and the number of training epochs (rounds) needed to scale at the same pace. Think of it as matching the size of the tires with the power of the car engine. The bigger the engine, the bigger the tires you need to keep the car stable and speedy.

Going Head-to-Head: ConvNets vs ViTs 🥊

Upon fine-tuning these pre-trained models on ImageNet, the ConvNets demonstrated stellar results that rivaled their ViT counterparts. It’s as if in a race between a seasoned marathon runner (ConvNets) and a sprinter (ViTs), both managed to cross the finish line at nearly the same time when given the right training and conditions.

The Bitter Lesson 🍋

These experiments lead us to a sobering realization. When it comes to model performance, it’s not always about the bells and whistles of the architecture. The core ingredients remain the sheer compute power and the amount of data at our disposal. It’s like baking a cake: no matter how fancy your oven is, if you don’t have the right ingredients in the right amounts, the cake’s not going to taste great.

Concluding Notes 🎶

While ViTs have shown fantastic successes and might have specific advantages in certain areas, this experiments suggests that we shouldn’t discount ConvNets just yet. Both have their strengths, and when given equal opportunities, they can produce marvelously similar results.

Until the next deep dive, keep experimenting and challenging the norms! 🚀

ConvNets Match Vision Transformers at Scale — A Summary🚀

References

Written by Aashi Dutt

No responses yet