Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even without tools it also outperforms Gemini 2.5 pro and o3, 25.4% compared to 21.6% and 21.0%. Although I wonder if any of the exam was leaked into the training set or if it was specifically trained to be good at benchmarks, llama 4 style.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: