Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.

That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.

I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.



> I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

Potentially not great.

If you look at the AIME accuracy graph on the OpenAI page [1] you will notice that the x-axis is logarithmic. Which is a problem because (a) compute in general has never scaled that well and (b) semiconductor fabrication will inevitably get harder as we approach smaller sizes.

So it looks like unless there is some ground-breaking research in the pipeline the current transformer architecture will likely start to stall out.

[1] https://openai.com/index/learning-to-reason-with-llms/


It's not a problem, because the point at which we are in the logarithmic curve is the only thing that matters. No one in their right mind ever expected anything linear, because that would imply that creating a perfect oracle is possible.

More compute hasn't been the driving factor of the last developments, the driving factor has been distillation and synthetic data. Since we've seen massive success with that, I really struggle to understand why people continue to doomsay the transformer. I hear these same arguments year after year and people never learn.


I'm very optimistic about it because native multimodal LLMs have hardly been explored.

Also in general, I have yet to see these models plateau, Claude 3.5 Sonnet is a day and night different compared to previous models.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: