Yeah so I can get how that might be confusing. Sometimes code is clearer. So in ...

Yeah so I can get how that might be confusing. Sometimes code is clearer. So in the vanilla transformer you do a patch and then embed operation, right? A quick way to do that is actually with non-overlapping convolutions. Your strides are the same size as your kernel sizes. Look closely at Figure 2 (you can also see a visual representation in Figure 1 but I'll admit there is some artistic liberty there because we wanted to stress the combined patch and embed operation. Those are real outputs though. But basically yeah, change the stride so you overlap. Those create patches, then you embed. So we don't really call it a hybrid the same way you may call a 1x1 cov a channel wise linear (which is the same as permuting linear then permute).

ViT https://github.com/SHI-Labs/Compact-Transformers/blob/main/s...

CCT: https://github.com/SHI-Labs/Compact-Transformers/blob/main/s...

Edit: Actually here's a third party version doing the permutation then linear then reshape operation

https://github.com/lucidrains/vit-pytorch/blob/main/vit_pyto...

But the original implementation uses Conv: https://github.com/google-research/vision_transformer/blob/m...