I am trying to find support for your last argument. https://paperswithcode.com/s...

BoorishBears · on Sept 13, 2023

I'm not sure if these ARC scores here are fully comparable to those in scaling, but if they are: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

TigerResearch/tigerbot-70b-chat: ARC (76.79)

Still below 3.5 and like most top entries no clear training objectives, which usually means they were fine-tuned on the benchmark itself.

-

ARC is easily the most important benchmark right now for widespread adoption of LLMs by laypeople: it's literal grade school multiple choice, created to the bar of an 8th grader.

When you sit people down in front of an LLM and have them interact with it: ARC has by far the closest correlation of how "generally smart" the model will feel.

behnamoh · on Sept 13, 2023

Wow, the top 14 are all proprietary models (except 2 models which are not commonly used in OS community anyway).

And surprisingly, Llama 33B performs _better_ than Llama 65B!

NhanH · on Sept 13, 2023

That table is comparing few shot prompting GPT 4 to zero shot llama.