Hacker News new | past | comments | ask | show | jobs | submit | Sevii's comments login

They could dynamically update the system prompt with ad content on a per request basis. Lots of options.

Models have improved significantly over the last 3 months. Yet people have been saying 'What if they've actually reached their limits by now?' for pushing 3 years.

This is just people talking past each other.

If you want a model that's getting better at helping you as a tool (which for the record, I do), then you'd say in the last 3 months things got better between Gemini's long context performance, the return of Claude Opus, etc.

But if your goal post is replacing SWEs entirely... then it's not hard to argue we definitely didn't overcome any new foundational issues in the last 3 months, and not too many were solved in the last 3 years even.

In the last year the only real foundational breakthrough would be RL-based reasoning w/ test time compute delivering real results, but what that does to hallucinations + even Deepseek catching up with just a few months of post-training shows in its current form, the technique doesn't completely blow up any barriers that were standing the way people were originally touting it.

Overall models are getting better at things we can trivially post-train and synthesize examples for, but it doesn't feel like we're breaking unsolved problems at a substantially accelerated rate (yet.)


For me, improvement means no hallucination, but that only seems to have gotten worse and I'm interested to find out whether it's actually solvable at all.

Why do you care about hallucination for coding problems? You're in an agent loop; the compiler is ground truth. If the LLM hallucinates, the agent just iterates. You don't even see it unless you make the mistake of looking closely.

What on earth are you talking about??

If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong.

You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed?? I have no idea how you came to this belief but it certainly doesn't match my experience.


No, what's happening here is we're talking past each other.

An agent lints and compiles code. The LLM is stochastic and unreliable. The agent is ~200 lines of Python code that checks the exit code of the compiler and relays it back to the LLM. You can easily fool an LLM. You can't fool the compiler.

I didn't say anything about whether code needs to be reviewed line-by-line by humans. I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible. But none of it includes hallucinated API calls.

Also, from where did this "you seem to have a fundamental belief" stuff come from? You had like 35 words to go on.


> If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong

The LLM can easily hallucinate code that will satisfy the agent and the compiler but will still fail the actual intent of the user.

> I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible.

Indeed most code that LLMs generate compiles clean and is nevertheless horrible! I'm happy that you recognize this truth, but the fact that you review that LLM-generated code line-by-line makes you an extraordinary exception vs. the normal user, who generates LLM code and absolutely does not review it line-by-line.

> But none of [the LLM generated code] includes hallucinated API calls.

Hallucinated API calls are just one of many many possible kinds of hallucinated code that an LLM can generate, by no means does "hallucinated code" describe only "hallucinated API calls" -- !


When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

I think at this point our respective points have been made, and we can wrap it up here.


> When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

There is an obvious and categorical difference between the "bugs" that an LLM produces as part of its generated code, and the "bugs" that I produce as part of the code that I write. You don't get to conflate these two classes of bugs as though they are equivalent, or even comparable. They aren't.


They obviously are.

I get that you think this is the case, but it really very much isn't. Take that feedback/signal as you like.

Hallucination is a fancy word?

The parent seems to be, in part, referring to "reward hacking", which tends to be used as a super category to what many refer to as slop, hallucination, cheating, and so on.

https://courses.physics.illinois.edu/ece448/sp2025/slides/le...


You seem to be using "hallucinate" to mean "makes mistakes".

That's not how I use it. I see hallucination as a very specific kind of mistake: one where the LLM outputs something that is entirely fabricated, like a class method that doesn't exist.

The agent compiler/linter loop can entirely eradicate those. That doesn't mean the LLM won't make plenty of other mistakes that don't qualify as hallucinations by the definition I use!

It's newts and salamanders. Every newt is a salamander, not every salamander is a newt. Every hallucination is a mistake, not every mistake is a hallucination.

https://simonwillison.net/2025/Mar/2/hallucinations-in-code/


I'm not using "hallucinate" to mean "makes mistakes". I'm using it to mean "code that is syntactically correct and passes tests but is semantically incoherent". Which is the same thing that "hallucination" normally means in the context of a typical user LLM chat session.

Why would you merge code that was "semantically incoherent"? And how does the answer to that question, about "hallucinations" that matter in practice, allow you to then distinguish between "hallucinations" and "bugs"?

Linting isn't verification of correctness, and yes, you can fool the compiler, linters, etc. Work with some human interns, they are great at it. Agents will do crazy things to get around linting errors, including removing functionality.

have you no tests?

Irrelevant, really. Tests establish a minimum threshold of acceptability, they don't (and can't) guarantee anything like overall correctness.

Just checking off the list of things you've determined to be irrelevant. Compiler? Nope. Linter? Nope. Test suite? Nope. How about TLA+ specifications?

TLA+ specs don’t verify code. They verify design. Such design can be expressed in whatever, including pseudocode (think algorithms notation in textbooks). Then you write the TLA specs that will judge if invariants are truly respected. Once you’re sure of the design, you can go and implement it, but there’s no hard constraints like a type system.

At what level of formal methods verification does the argument against AI-generated code fall apart? My expectation is that the answer is "never".

The subtext is pretty obvious, I think: that standards, on message boards, are being set for LLM-generated code that are ludicrously higher than would be set for people-generated code.


My guy didn't you spend like half your life in the field where your job was to sift through code that compiled but nonetheless had bugs that you tried to exploit? How can you possibly have this belief about AI generated code?

I don't understand this question. Yes, I spent about 20 years learning the lesson that code is profoundly knowable; to start with, you just read it. What challenge do you believe AI-generated code presents to me?

> You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed??

This is a mistaken understanding. The person you responded to has written on these thoughts already and they used memorable words in response to this proposal:

> Are you a vibe coding Youtuber? Can you not read code? If so: astute point. Otherwise: what the fuck is wrong with you?

It should be obvious that one would read and verify the code before they commit it. Especially if one works on a team.

https://fly.io/blog/youre-all-nuts/


We should go one step past this and come up with an industry practice where we get someone other than the author to read the code before we merge it.

I don’t understand your point. Are you saying that it sounds like that wouldn’t happen?

I’m being sarcastic. The person you are responding to is implying that reading code carefully before merging it is some daunting or new challenge. In fact it’s been standard practice in our industry for 2 or more people to do that as a matter of course.

All the benchmarks would disagree with you

The benchmarks also claim random 32B parameter models beat Claude 4 at coding, so we know just how much they matter.

It should be obvious to anyone who with a cursory interest in model training, you can't trust benchmarks unless they're fully private black-boxes.

If you can get even a hint of the shape of the questions on a benchmark, it's trivial to synthesize massive amounts of data that help you beat the benchmark. And given the nature of funding right now, you're almost silly not to do it: it's not cheating, it's "demonstrably improving your performance at the downstream task"


Today’s public benchmarks are yesterday’s training data.


There are lots of reasons not to do it. But if LLMs get good enough that it works consistently people will do it anyway.

What will people call it when coders rely on vibes even more than vibe coding?

Writing specs

Exactly my thought. This is just natural language as a specification language.

...as an ambiguous and inadequately-specified specification language.

In the end, every specification is specified via natural language, this is just where the buck stops. All math books are written in natural language, even the ones about specification languages.

Huh? Is ABNF a "natural language"? Is the Go language spec a "natural language"?

How is ABNF itself specified? Yes, via natural language. And the Go language spec is written in natural language, too, you can check for yourself: https://go.dev/ref/spec

ABNF itself is specified with a well-defined grammar and syntax...

Yes, but the spec is still done in natural language: https://www.rfc-editor.org/rfc/rfc5234

Just like any other RFC.


Haruspicy?

They used ardupilot

Storing fission waste products is a solved problem. You can either reprocess them as is done in France. Or you can store them forever. Neither approach is difficult or poorly understood. We can store an infinite amount of fission waste products in the ocean, underground or in the mantle.


In theory, sure. In practice, complex technological and political issues remain - apparent by the fact that no country has solved the issue yet.

You apparently stable salt mines start leaking. Locals don't like having toxic stuff buried below them. Other countries dislike that you dump nuclear waste in the middle of the Atlantic. Digging deep becomes too expensive.


Apple doesn't have different software platforms for low vs high cost phones. Why is a car different? It doesn't even have as much functionality.


I think it's fair to say that the software in a modern car contains lot more functionality than an average smartphone. Drivers just aren't aware of how much is happening in their car each second.


To someone that knows nothing about car SW architecture, that is surprising to me, I would have expected a number of control loops for things like fuel injection, ABS brake control, drive-by-wire, EV battery charge and discharge, etc. each running on their own processor due real-time safety considerations. These I would expect to be different implementations and parameterizations of the same control theory maths.

On top of this comes some functionality to control windshield wipers, lighting, AC, seat heating, etc. Stuff which is probably not top-tier safety critical, but still important. I would expect that stuff to run on one, maybe two processors.

Then comes the infotainment system, running on its own processor.

Sensors are supplying data to all processors through some kind of modernized CAN bus and some sort of publisher/subscriber protocol. Maybe some safety critical sensors have dedicated wiring to the relevant processor.

A lot of variations on this seems possible with the same SW platform, tuned and parameterized properly. The real-time safety critical stuff would need care, but is doable.

Am I completely off the mark? Can you give some examples of where I am going wrong?


I am also not deeply into this stuff. But there is more going on in a car than what you list.

One probably surprising thing is that an LCD dashboard is usually driven by multiple rendering stacks. One is for the complex graphics and eye candy. The other one is responsible for brake and engine warning lights etc. and is considered safety critical. The second one is very basic and often partitioned off by a hypervisor.

A lot of these controllers are running more than just control loops. They are also actively monitoring their system for failures. The number of possible failure conditions and responses is quite large. I had instances where e.g. the engine warning light came on because the ECU detected that the brake light switch was faulty. In another instance, I had powered steering turn itself off during a drive because it had developed a fault. These kinds of behaviors are the results of dedicated algorithms that are watching just about every component of safety critical systems that can possibly be monitored.

All of these software systems are provided by different vendors who develop the aplication software based on either their own stack or operating systems and middleware provided by other upstream suppliers. It don't think it's uncommon for a car to contain multiple copies of 3 or 4 different RTOS stacks. Nobody at the car manufacturers is enforcing uniformity in the software stacks that the suppliers deliver. The manufacturers tend to want finished, self-contained hardware units that they can plug in, configure and turn on.


I mean there is just no way that that can be true.


Because a low and high cost phone do essentially the same thing, whereas a high trim car will do things like steering assistance in a way the low trim does not do at all.

And to support the differences high trim will have different sensors and differently distributed compute.

This means that the infotainment system will be running in different places on different cars.


This is really a question of slack. How long can it fail for? For a 100 year mission you likely want at least a month of slack in your air supply. Things are going to break. You build in redundancy.


It's useful because it allows you to build generic tools for an LLM to call. If you want your customer service chatbot to be able to reset peoples' modems you could build a MCP tool for that. The LLM could recognize the tool, trigger it and then a call would be sent through MCP to your server which could call whatever API is needed to reset modems.


For every program in production there are 1000s of other programs that accomplish exactly the same output despite having a different hash.


I wouldnt take that too literally, since that is the halting problem.

I suppose AI can provide a heuristic useful in some cases.


Your course sounds like it covers AI assisted coding. What does it have to do with agentic AI?


It is exactly the opposite. Participants collect all the pieces of knowledge to build an agent like my Claudine at the end of the workshop:

https://github.com/xemantic/claudine/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: