Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's an interesting review but I really dislike this type of techno-utopian determinism: "When models inevitably improve..." Says who? How is it inevitable? What if they've actually reached their limits by now?



Models are improving every day. People are figuring out thousands of different optimizations to training and to hardware efficiency. The idea that right now in early June 2025 is when improvement stops beggars belief. We might be approaching a limit, but that's going to be a sigmoid curve, not a sudden halt in advancement.


I think at this point we're reaching more incremental updates, which can score higher on some benchmarks but then simultaneously behave worse with real-world prompts, most especially if they were prompt engineered for a specific model. I recall Google updating their Flash model on their API with no way to revert to the old one and it caused a lot of people to complain that everything they've built is no longer working because the model is just behaving differently than when they wrote all the prompts.


Isn't it quite possible they replaced that Flash model with a distilled version, saving money rather than increasing quality? This just speaks to the value of open-weights more than anything.


5 years ago a person would be blown away by today’s LLMs. But people today will merely say “cool” at whatever LLMs are in use 5 years from now. Or maybe not even that.


Most of the developers I know personally who have been radicalized by coding agents, it happened within the past 9 months. It does not feel like we are in a phase of predictable boring improvement.


Radicalized? Going with the flow and wishes of the people who are driving AI is the opposite of that.

To have their minds changed drastically, sure..


Sorry I have no idea what you're trying to say here.


> very different from the usual or traditional

https://www.merriam-webster.com/dictionary/radical

Deciding that AI is going nowhere to suddenly deciding that coding agents are how they will work going forward is a radical change. That is what they meant.



Can you explain exactly what you meant by your second paragraph? The ambiguity is why you got that reply.

If your second paragraph makes that reply irrelevant, are you saying the meaning was "Your use of 'radicalized' is technically correct but I still think you shouldn't have used it here"?


5 years ago GPT2 was already outputting largely coherent speech, there's been progress but it's not all that shocking


Bold prediction…


It is copium that it will suddenly stop and the world they knew before will return.

ChatGPT came out in Nov 2022. Attention Was All There Was in 2017, we were already 5 years in the past. Or 5 years of research to catch up to, and then from 2022 to now ... papers and research have been increasing exponentially. Even in if SOTA models were frozen, we still have years of research to apply and optimize in various ways.


I think it's equally copium that people keep assuming we're just going to compound our way into intelligence that generalizes enough to stop us from handholding the AI, as much as I'd genuinely enjoy that future.

Lately I spend all day post-training models for my product, and I want to say 99% of the research specific to LLMs doesn't reproduce and/or matter once you actually dig in.

We're getting exponentially more papers on the topics and they're getting worse on average.

Every day there's a new paper claiming an X% gain by post-training some ancient 8B parameter model and comparing it to a bunch of other ancient models after they've overfitted on the public dataset of a given benchmark and given the model a best of 5.

And benchmarks won't ever show it, but even ChatGPT 3.5-Turbo has better general world knowledge than a lot models people consider "frontier" models today because post-training makes it easy to cover up those gaps with very impressive one-prompt outputs and strong benchmark scores.

-

It feels like things are getting stuck in a local maxima: we are making forward progress, the models are useful and getting more useful, but the future people are envisioning takes reaching a completely different goal post that I'm not at all convinced we're making exponential progress towards.

There maybe exponential number of techniques claiming to be ground breaking, but what has actually unlocked new capabilities that can't just as easily be attributed to how much more focused post-training has become on coding and math?

Test time compute feels like the only one and we're already seeing the cracks form in terms of its effect on hallucinations, and there's a clear ceiling for the performance the current iteration unlocks as all these models are converging on pretty similar performance after just a few model releases.


The copium is I think many people got comfortable post financial crisis with nothing much changing or happening. I think many people really liked a decade stretch with not much more than web framework updates and smart phone versioning.

We are just back on track.

I just read Oracular Programming: A Modular Foundation for Building LLM-Enabled Software the other day.

We don't even have a new paradigm yet. I would be shocked that in 10 years I don't look back at this time of writing a prompt into a chatbot and then pasting the code into an IDE as completely comical.

The most shocking thing to me is we are right back on track to what I would have expected in 2000 for 2025. In 2019 those expectations seemed like science fiction delusions after nothing happening for so long.


Reading the Oracular paper now, https://news.ycombinator.com/edit?id=44211588

It feels a bit like Halide, where the goal and the strategy are separated so that each can be optimized independently.

Those new paradigms are being discovered by hordes of vibecoders, myself included. I am having wonderful results with TDD and AI assisted design.

IDEs are now mostly browsers for code, and I no longer copy and paste with a chatbot.

Curious what you think about the Oracular paper. One area that I have been working on for the last couple weeks is extracting ToT for the domain and then using the LLM to generate an ensemble of exploration strategies over that tree.


It is "inevitable" in the sense that in 99% of the cases, tomorrow is just like yesterday.

LLMs have been continually improving for years now. The surprising thing would be them not improving further. And if you follow the research even remotely, you know they'll improve for a while, because not all of the breakthroughs have landed in commercial models yet.

It's not "techno-utopian determinism". It's a clearly visible trajectory.

Meanwhile, if they didn't improve, it wouldn't make a significant change to the overall observations. It's picking a minor nit.

The observation that strict prompt adherence plus prompt archival could shift how we program is both true, and it's a phenomenon we observed several times in the past. Nobody keeps the assembly output from the compiler around anymore, either.

There's definitely valid criticism to the passage, and it's overly optimistic - in that most non-trivial prompts are still underspecified and have multiple possible implementations, not all correct. That's both a more useful criticism, and not tied to LLM improvements at all.


Are there places that follow the research that speak to the layperson?


What is ironic, if we buy in to the theory that AI will write majority of the code in the next 5-10 years, what is it going to train on after? ITSELF? Seems this theoretic trajectory of "will inevitably get better" is is only true if humans are producing quality training data. The quality of code LLMs create is very well proportionate on how mature and ubiquitous the langues/projects are.


I think you neatly summarise why the current pre-trained LLM paradigm is a dead end. If these models were really capable of artificial reasoning and learning, they wouldn’t need more training data at all. If they could learn like a human junior does, and actually progress to being a senior, then I really could believe that we’ll all be out of a job—but they just do not.


More compute mean more faster processing, more context.


Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.


This is just people talking past each other.

If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.

But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.

In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.

Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)


For me, improvement means no hallucination, but that only seems to have gotten worse and I'm interested to find out whether it's actually solvable at all.


Why do you care about hallucination for coding problems? You're in an agent loop; the compiler is ground truth. If the LLM hallucinates, the agent just iterates. You don't even see it unless you make the mistake of looking closely.


What on earth are you talking about??

If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong.

You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed?? I have no idea how you came to this belief but it certainly doesn't match my experience.


No, what's happening here is we're talking past each other.

An agent lints and compiles code. The LLM is stochastic and unreliable. The agent is ~200 lines of Python code that checks the exit code of the compiler and relays it back to the LLM. You can easily fool an LLM. You can't fool the compiler.

I didn't say anything about whether code needs to be reviewed line-by-line by humans. I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible. But none of it includes hallucinated API calls.

Also, from where did this "you seem to have a fundamental belief" stuff come from? You had like 35 words to go on.


> If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong

The LLM can easily hallucinate code that will satisfy the agent and the compiler but will still fail the actual intent of the user.

> I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible.

Indeed most code that LLMs generate compiles clean and is nevertheless horrible! I'm happy that you recognize this truth, but the fact that you review that LLM-generated code line-by-line makes you an extraordinary exception vs. the normal user, who generates LLM code and absolutely does not review it line-by-line.

> But none of [the LLM generated code] includes hallucinated API calls.

Hallucinated API calls are just one of many many possible kinds of hallucinated code that an LLM can generate, by no means does "hallucinated code" describe only "hallucinated API calls" -- !


When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

I think at this point our respective points have been made, and we can wrap it up here.


> When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

There is an obvious and categorical difference between the "bugs" that an LLM produces as part of its generated code, and the "bugs" that I produce as part of the code that I write. You don't get to conflate these two classes of bugs as though they are equivalent, or even comparable. They aren't.


They obviously are.


I get that you think this is the case, but it really very much isn't. Take that feedback/signal as you like.


Hallucination is a fancy word?

The parent seems to be, in part, referring to "reward hacking", which tends to be used as a super category to what many refer to as slop, hallucination, cheating, and so on.

https://courses.physics.illinois.edu/ece448/sp2025/slides/le...


You seem to be using "hallucinate" to mean "makes mistakes".

That's not how I use it. I see hallucination as a very specific kind of mistake: one where the LLM outputs something that is entirely fabricated, like a class method that doesn't exist.

The agent compiler/linter loop can entirely eradicate those. That doesn't mean the LLM won't make plenty of other mistakes that don't qualify as hallucinations by the definition I use!

It's newts and salamanders. Every newt is a salamander, not every salamander is a newt. Every hallucination is a mistake, not every mistake is a hallucination.

https://simonwillison.net/2025/Mar/2/hallucinations-in-code/


I'm not using "hallucinate" to mean "makes mistakes". I'm using it to mean "code that is syntactically correct and passes tests but is semantically incoherent". Which is the same thing that "hallucination" normally means in the context of a typical user LLM chat session.


Why would you merge code that was "semantically incoherent"? And how does the answer to that question, about "hallucinations" that matter in practice, allow you to then distinguish between "hallucinations" and "bugs"?


Linting isn't verification of correctness, and yes, you can fool the compiler, linters, etc. Work with some human interns, they are great at it. Agents will do crazy things to get around linting errors, including removing functionality.


have you no tests?


Irrelevant, really. Tests establish a minimum threshold of acceptability, they don't (and can't) guarantee anything like overall correctness.


Just checking off the list of things you've determined to be irrelevant. Compiler? Nope. Linter? Nope. Test suite? Nope. How about TLA+ specifications?


TLA+ specs don’t verify code. They verify design. Such design can be expressed in whatever, including pseudocode (think algorithms notation in textbooks). Then you write the TLA specs that will judge if invariants are truly respected. Once you’re sure of the design, you can go and implement it, but there’s no hard constraints like a type system.


At what level of formal methods verification does the argument against AI-generated code fall apart? My expectation is that the answer is "never".

The subtext is pretty obvious, I think: that standards, on message boards, are being set for LLM-generated code that are ludicrously higher than would be set for people-generated code.


My guy didn't you spend like half your life in the field where your job was to sift through code that compiled but nonetheless had bugs that you tried to exploit? How can you possibly have this belief about AI generated code?


I don't understand this question. Yes, I spent about 20 years learning the lesson that code is profoundly knowable; to start with, you just read it. What challenge do you believe AI-generated code presents to me?


My point is that code review or running things through a compiler is not sufficient to find bugs. At least when a person writes it you can ask them what they were thinking. An AI doesn't really let you do that.


> You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed??

This is a mistaken understanding. The person you responded to has written on these thoughts already and they used memorable words in response to this proposal:

> Are you a vibe coding Youtuber? Can you not read code? If so: astute point. Otherwise: what the fuck is wrong with you?

It should be obvious that one would read and verify the code before they commit it. Especially if one works on a team.

https://fly.io/blog/youre-all-nuts/


We should go one step past this and come up with an industry practice where we get someone other than the author to read the code before we merge it.


I don’t understand your point. Are you saying that it sounds like that wouldn’t happen?


I’m being sarcastic. The person you are responding to is implying that reading code carefully before merging it is some daunting or new challenge. In fact it’s been standard practice in our industry for 2 or more people to do that as a matter of course.


All the benchmarks would disagree with you


The benchmarks also claim random 32B parameter models beat Claude 4 at coding, so we know just how much they matter.

It should be obvious to anyone who with a cursory interest in model training, you can't trust benchmarks unless they're fully private black-boxes.

If you can get even a hint of the shape of the questions on a benchmark, it's trivial to synthesize massive amounts of data that help you beat the benchmark. And given the nature of funding right now, you're almost silly not to do it: it's not cheating, it's "demonstrably improving your performance at the downstream task"


Today’s public benchmarks are yesterday’s training data.





Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: