Ish. The catch is we spend a ton of effort on teaching these models to recognize...

Ish. The catch is we spend a ton of effort on teaching these models to recognize specific things in pictures. Then we ask it to not do that task, but instead count something on the picture. Which, we oddly don't spend a lot of time training the model to do.

It is a lot like the experiment where you ask people to say what color some text is. With the trick where some of the text is the name of another color. Can be surprisingly hard for people that are good at reading.