Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am trying to find support for your last argument.

https://paperswithcode.com/sota/common-sense-reasoning-on-ar...

  GPT-3       53.2
  GPT-3.5     85.2
  LLaMa-65B   56.0
Any idea of the performance of an instruction-fine-tuned version of LLaMa models? I can't seem to find non-aggregated performance figures on ARC.


I'm not sure if these ARC scores here are fully comparable to those in scaling, but if they are: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

TigerResearch/tigerbot-70b-chat: ARC (76.79)

Still below 3.5 and like most top entries no clear training objectives, which usually means they were fine-tuned on the benchmark itself.

-

ARC is easily the most important benchmark right now for widespread adoption of LLMs by laypeople: it's literal grade school multiple choice, created to the bar of an 8th grader.

When you sit people down in front of an LLM and have them interact with it: ARC has by far the closest correlation of how "generally smart" the model will feel.


Wow, the top 14 are all proprietary models (except 2 models which are not commonly used in OS community anyway).

And surprisingly, Llama 33B performs _better_ than Llama 65B!


That table is comparing few shot prompting GPT 4 to zero shot llama.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: