Hypothetically, could this be fixed by changing the input method. For instance, I just quickly looked up how humans process imagery.
"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."
So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.
You are in rarified air as Walter Pitts believed this until the 1959 paper "What the Frog's Eye Tells the Frog's Brain" contributed to his decline.
Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.
Remember that while the value of MLPs for useful work is unquestionable IMHO, be mindful of the map territory relation. MLPs are inspired by and in some cases useful for modeling biological minds, they aren't equivalent.
Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.
There are enough features fed into a VLM to solve the task.
The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.
"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."
So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.