Where are you getting 2 billion from? The original CLIP paper says: > We demonst...

doctorpangloss · on Aug 15, 2024

The 400m images in the paper yield the ~40% zero shot ImageNet accuracy in the chart they publish.

That level of performance is generally not good enough for text conditioning of DDIMs.

The published CLIP checkpoints, and later in the paper, they talk about performance that is almost twice as good at 76.2%. That data point, notably, does not appear in the chart. So the published checkpoints, and the performance they talk about later in the paper, are clearly trained on way more data.

How much data? Let's take a guess. I got the data points from the chart they have, and I went and fit y=a log_⁡b (c+dx) + K to the points in the paper:

    a≈12.31
    b≈0.18
    c≈24.16
    d≈0.81
    K≈−10.47

Then I got 7.55b images to get a performance of 76%. The fit is R^2 = 0.993, I don't have any good intuitions for why this is so high, it could very well be real, and there's no reason to anchor on "7.55b is a lot higher than LAION-4b", although they could just concatenate a social media image dataset of 3b images with LAION-4b, and boom, there's 7b.

OpenCLIP reproduced this work after all with 2b images and got 79.5%. But e.g. Flux and SD3 do not use OpenCLIP's checkpoints. So that one performance figure isn't representative of how bad OpenCLIP's checkpoints are versus how good OpenAI's checkpoints are. It's not straightforward to fit, it's way more than 400m.

Another observation is that there are plenty of Hugging Face spaces with crappy ResNet and crappy small-dataset trained-from-scratch CLIP conditioning to try. Sometimes it actually looks as crappy as Adobe's outputs do, there's a little bit of a chance that Adobe tried and failed to create its own CLIP checkpoint on the crappy amount of data they had.