Still below 3.5 and like most top entries no clear training objectives, which usually means they were fine-tuned on the benchmark itself.
-
ARC is easily the most important benchmark right now for widespread adoption of LLMs by laypeople: it's literal grade school multiple choice, created to the bar of an 8th grader.
When you sit people down in front of an LLM and have them interact with it: ARC has by far the closest correlation of how "generally smart" the model will feel.
https://paperswithcode.com/sota/common-sense-reasoning-on-ar...
Any idea of the performance of an instruction-fine-tuned version of LLaMa models? I can't seem to find non-aggregated performance figures on ARC.