Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers.
Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.
Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.
But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.
It's not non-empirical. He was careful to give it the same experiment twice. The dependent variable is his judgment, sure, but why shouldn't we trust that if he's an experienced SWE?
Unless he was able to sample with temperature 0 (and get fully deterministic results both times), this can just be random chance. And experience as SWE doesn't imply experience with statistics and experiment design.
> But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.
Well, if we see this way, this is true for Antrophic’s benchmarks as well.
Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”
So what I described is the exact definition of empirical.
I don't really find this a helpful line to traverse. By this line of inquiry most of the things in software are psychological.
Whether something is a bug or feature.
Whether the right thing was built.
Whether the thing is behaving correctly in general.
Whether it's better at the very moment that the thing occasionally works for a whole range of stuff or that it works perfectly for a small subset.
Whether fast results are more important than absolutely correct results for a given context.
Yes, all things above are also related with each other.
The most we have for LLMs is tallying up each user's experience using an LLM for a period of time for a wide rane of "compelling" use cases (the pairing of their prompts and results are empirical though right?).
This should be no surprise, as humans often can't agree on an end-all-be-all intelligence test for humans either.
No. I'm saying that if you take the same exact LLM on the same exact set of hardware and serve it to the same exact humans, a sizeable amount of them will still complain about "model nerfs".
Giving the same prompt resulting in totally different results is not user evaluation. Nor psychological. You cannot tell the customer you are working for as a developer, that hey, first time it did what you asked, second time it ruined everything, but look, here is the benchmark from Antrophic, according to this there is nothing wrong.
The only thing that matters and that can evaluate performance is the end result.
But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?
The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.
This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).
I did not say that I only ran the prompt once per attempt. When I say that second time it failed it means that I spent hours to restart, clear context, giving hints, everything to help the model to produce something that works.
I never claimed this was a scientific study. It was an observation repeated over time. That is empirical in the plain meaning of the word.
Criticizing it for “not being scientific” is irrelevant, I didn’t present it as science. Are people only allowed to share experiences here if they come wrapped in a peer-reviewed paper?
If you want to debate the substance of the observation, happy to. But don’t rewrite what I said into a claim I never made.
I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.
On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.
For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.
It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.
If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.
But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.
If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.
It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.
Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.
I'm working on a hard problem recently and have been keeping my "model" setting pegged to "high".
Why in the world, if I'm paying the loss leader price for "unlimited" usage of these models, would any of these companies literally respect my preference to have unfettered access to the most expensive inference?
Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.
Because intentionally fucking over their customers would be an impossible secret to keep, and when it inevitably leaks would trigger severe backlash, if not investigations for fraud. The game theoretic model you’re positing only really makes sense if there’s only one iteration of the game, which isn’t the case.
That is unfortunately not true. It's pretty easy to mess with your customers when your whole product is as opaque as LLMs. I mean they don't even understand how they work internally.
1) x% of users have an exceptional first experience by chance. Nobody who has a meh first experience bothers to try a second time.
2) x²% of users also have an exceptional second experience by chance
3) So a lot of people with a great first experience think the model started off great and got suddenly worse
Suppose it's 25% that have a really great first experience. 25% of them have a great second experience too, but 75% of them see a sudden decline in quality and decide that it must be intentional. After the third experience this population gets bigger again.
So by pure chance and sampling biases you end up convincing a bunch of people that the model used to be great but has gotten worse, but a much smaller population of people who thought it was terrible but got better because most of them gave up early.
This is not in their heads- they really did see declining success. But they experienced it without any changes to the model at all.
I think this is pretty easy to explain psychologically.
The first time you see a dog that can make pancakes, you’re really focused on the fact that a dog is making pancakes.
After a few weeks of having them for breakfast, you start to notice that the pancakes are actually kind of overcooked and don’t taste that good. Sure it’s impressive that a dog made them, but what use are sub-par pancakes? You’re naturally more focused on what it can’t do than what it can.
Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.
Is this empirical evidence?
And this is not only my experience.
Calling this phychological is gaslighting.