This was my first thought when I saw the results: https://news.ycombinator.com/i...

upghost · 2024-12-23T15:09:31 1734966571

Insightful comment. The thing that's extremely frustrating is look at all the energy poured into this conversation around benchmarks. There is a fundamental assumption of honesty and integrity in the benchmarking process by at least some people. But when the dataset is compromised and generation N+1 has miraculous performance gains, how can we see this as anything other than a ploy to pump up valuations? Some people have millions of dollars at stake here and they don't care about the naysayers in the peanut gallery like us.

optimalsolver · 2024-12-23T15:46:03 1734968763

It's sadly inevitable that when billions in funding and industry hype are tied to performance on a handful of benchmarks, scores will somehow, magically, continue to go up.

Needless to say, it doesn't bring us any closer to AGI.

The only solution I see here is people crafting their own, private benchmarks that the big players don't care about enough to train on. That, at least, gives you a clearer view of the field.

upghost · 2024-12-23T16:15:12 1734970512

Not sure why your comment was downvoted, but it certainly shows the pressure going against people who point out fundamental flaws. This is pushing us towards "AVI" rather than AGI-- "Artificially Valued Intelligence". The optimization function here is around the market.

I'm being completely serious. You are correct, despite the downvotes, that this could not be pushing us towards AGI because if the dataset is leaked you can't claim the G-- generalizability.

The point of the benchmark is to lead is to believe that this is a substantial breakthrough. But a reasonable person would be forced to conclude that the results are misleading to due to optimizing around the training data.