From a quick glance it seems to be about spatial reasoning problems. I think the...

From a quick glance it seems to be about spatial reasoning problems. I think there is good reasons for why it's tricky to become extremely good at these from being trained on text and static images. Future models being further multimodally trained with video and then physics simulators should deal with this much better I think.

There's a recent talk about this by Jim Fan from Nvidia https://youtu.be/_2NijXqBESI