Honestly this is not very surprising. Standardised testing is... well, standardised. You have huge model that learns the textual patterns in hundreds of thousands of test question/answer pairs. It would be surprising if it didn't perform as well as a human student with orders of magnitude less memory.
You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).
You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).