This paper explores a different aspect of the limitations of VLMs compared to th...

This paper explores a different aspect of the limitations of VLMs compared to the paper VLMs are Blind (https://vlmsareblind.github.io). While in VLMs are Blind, o3 achieved 90% accuracy (https://openai.com/index/thinking-with-images), on similarly easy tasks using the counterfactual images from VLMs are Biased, o3 only reached 18.5%.

This may indicate that while VLMs might possess the necessary capability, their strong biases can cause them to overlook important cues, and their overconfidence in their own knowledge can lead to incorrect answers.