Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The best benchmark is the community vibe in the weeks following a release.

Claude benchmarks poorly but vibes well. Gemini benchmarks well and vibes well. Grok benchmarks well but vibes poorly.

(yes I know you are gushing with anecdotes, the vibes are simply the approximate color of gray born from the countless black and white remarks.)



> The best benchmark is the community vibe in the weeks following a release.

True, just be careful what community you use as a vibe-check. Most of the mainstream/big ones around AI and LLMs basically have influence campaigns run against them, are made of giant hive-minds that all think alike and you need to carefully asses if anything you're reading is true or not, and votes tend to make it even worse.


I generally check LM Arena as well as which models have had the most weekly tokens on openrouter


the vibes are just a collection anecdotes


"qual"




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: