Deep Dive into the Vision Transformers Paper

gschoeni · on Dec 1, 2023

We have a reading club every Friday where we go over the fundamentals of a lot of the state of the art techniques used in Machine Learning today. Last week we dove into the "Vision Transformers" Paper from 2021 where the Google Brain team benchmarked training large scale transformers against ResNets.

Though it is not groundbreaking research as of this week, I think with the pace of AI it is important to dive deep into past work and what others have tried! It's nice to take a step back and learn the fundamentals as well as keeping up with the latest and greatest.

Posted the notes and recap here if anyone finds it helpful:

https://blog.oxen.ai/arxiv-dives-vision-transformers-vit/

Also would love to have anyone join us live on Fridays! We've got a pretty consistent and fun group of 300+ engineers and researchers showing up.

plastic-enjoyer · on Dec 2, 2023

Sounds good, how do you join the reading club?

sthoward · on Dec 2, 2023

Sign up here: http://lu.ma/oxenbookclub

vlovich123 · on Dec 2, 2023

I wonder if overlapping the patches would improve accuracy further as a way to kind of anti alias the data learned / inferred. In other words, if position 0 is 0,0 - 16,16 and position 1 is 16,0 - 32,16 instead we used 12,0-28,16 for position 1 where it overlaps 4 pixels of the previous position. You’d have more patches / it would be more expensive compute wise, but it might dealias any artificial aliasing that the patches create during both training and inference.

stevenwalton · on Dec 2, 2023

Logged into my personal account for this one! I'm a lead author on a paper that explored exactly. It does enable faster training and smaller model sizes. For reference, you can get 80% accuracy on CIFAR-10 in ~30 minutes of CPU (not using crazy optimizations). There are open questions about scaling but at the time we did not have access to big compute (really still don't) and our goals were focused on addressing the original ViT's claims of data constraints and necessities of pretraining for smaller datasets (spoiler, augmentation + overlapping patches plays a huge role). Basically we wanted to make a network that allowed people to train transformers from scratch for their data projects because pretrained models aren't always the best solutions or practical.

Paper: https://arxiv.org/abs/2104.05704

Blog: https://medium.com/pytorch/training-compact-transformers-fro...

CPU compute: https://twitter.com/WaltonStevenj/status/1382045610283397120

Crazy optimizations (no affiliation): 94% on CIFAR-10 in <6.3 seconds on a single A100 : https://github.com/tysam-code/hlb-CIFAR10

I also want to give maybe some better information about ViTs in general. Lucas Beyer is a good source and has some lectures as well as Hila Chefer and Sayak Paul's tutorials. Also, just follow Ross Wightman, the man is a beast

Lucas Beyer: https://twitter.com/giffmana/status/1570152923233144832

Chefer & Paul's All Things ViT: https://all-things-vits.github.io/atv/

Ross Wightman : https://twitter.com/wightmanr

His very famous timm package https://github.com/huggingface/pytorch-image-models

gschoeni · on Dec 3, 2023

Thanks for all the good work and all the pointers! Awesome stuff. Let me know if you would want to join us live on a Friday and go over some of your newer work or any recent papers you find interesting. Feel free to reach out at [email protected] if so :)

vlovich123 · on Dec 2, 2023

I very cursorily skimmed your paper but I didn’t spot where it discusses overlapping the patches. Is it the section about using the hybrid model with a convolutional step which de facto accomplishes it (maybe?) instead of overlapping patches?

stevenwalton · on Dec 2, 2023

Yeah so I can get how that might be confusing. Sometimes code is clearer. So in the vanilla transformer you do a patch and then embed operation, right? A quick way to do that is actually with non-overlapping convolutions. Your strides are the same size as your kernel sizes. Look closely at Figure 2 (you can also see a visual representation in Figure 1 but I'll admit there is some artistic liberty there because we wanted to stress the combined patch and embed operation. Those are real outputs though. But basically yeah, change the stride so you overlap. Those create patches, then you embed. So we don't really call it a hybrid the same way you may call a 1x1 cov a channel wise linear (which is the same as permuting linear then permute).

ViT https://github.com/SHI-Labs/Compact-Transformers/blob/main/s...

CCT: https://github.com/SHI-Labs/Compact-Transformers/blob/main/s...

Edit: Actually here's a third party version doing the permutation then linear then reshape operation

https://github.com/lucidrains/vit-pytorch/blob/main/vit_pyto...

But the original implementation uses Conv: https://github.com/google-research/vision_transformer/blob/m...