Why do you care about hallucination for coding problems? You're in an agent loop...

kiitos · 2025-06-07T05:51:40 1749275500

What on earth are you talking about??

If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong.

You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed?? I have no idea how you came to this belief but it certainly doesn't match my experience.

tptacek · 2025-06-07T06:03:34 1749276214

No, what's happening here is we're talking past each other.

An agent lints and compiles code. The LLM is stochastic and unreliable. The agent is ~200 lines of Python code that checks the exit code of the compiler and relays it back to the LLM. You can easily fool an LLM. You can't fool the compiler.

I didn't say anything about whether code needs to be reviewed line-by-line by humans. I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible. But none of it includes hallucinated API calls.

Also, from where did this "you seem to have a fundamental belief" stuff come from? You had like 35 words to go on.

kiitos · 2025-06-07T08:51:16 1749286276

> If the LLM hallucinates, then the code it produces is wrong. That wrong code isn't obviously or programmatically determinable as wrong, the agent has no way to figure out that it's wrong, it's not as if the LLM produces at the same time tests that identify that hallucinated code as being wrong. The only way that this wrong code can be identified as wrong is by the human user "looking closely" and figuring out that it is wrong

The LLM can easily hallucinate code that will satisfy the agent and the compiler but will still fail the actual intent of the user.

> I review LLM code line-by-line. Lots of code that compiles clean is nonetheless horrible.

Indeed most code that LLMs generate compiles clean and is nevertheless horrible! I'm happy that you recognize this truth, but the fact that you review that LLM-generated code line-by-line makes you an extraordinary exception vs. the normal user, who generates LLM code and absolutely does not review it line-by-line.

> But none of [the LLM generated code] includes hallucinated API calls.

Hallucinated API calls are just one of many many possible kinds of hallucinated code that an LLM can generate, by no means does "hallucinated code" describe only "hallucinated API calls" -- !

tptacek · 2025-06-07T12:58:44 1749301124

When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

I think at this point our respective points have been made, and we can wrap it up here.

kiitos · 2025-06-07T16:30:06 1749313806

> When you say "the LLM can easily hallucinate code that will satisfy the compiler but still fail the actual intent of the user", all you are saying is that the code will have bugs. My code has bugs. So does yours. You don't get to use the fancy word "hallucination" for reasonable-looking, readable code that compiles and lints but has bugs.

There is an obvious and categorical difference between the "bugs" that an LLM produces as part of its generated code, and the "bugs" that I produce as part of the code that I write. You don't get to conflate these two classes of bugs as though they are equivalent, or even comparable. They aren't.

tptacek · 2025-06-07T16:58:39 1749315519

They obviously are.

someothherguyy · 2025-06-07T14:39:21 1749307161

Hallucination is a fancy word?

The parent seems to be, in part, referring to "reward hacking", which tends to be used as a super category to what many refer to as slop, hallucination, cheating, and so on.

https://courses.physics.illinois.edu/ece448/sp2025/slides/le...

simonw · 2025-06-07T13:20:33 1749302433

You seem to be using "hallucinate" to mean "makes mistakes".

That's not how I use it. I see hallucination as a very specific kind of mistake: one where the LLM outputs something that is entirely fabricated, like a class method that doesn't exist.

The agent compiler/linter loop can entirely eradicate those. That doesn't mean the LLM won't make plenty of other mistakes that don't qualify as hallucinations by the definition I use!

It's newts and salamanders. Every newt is a salamander, not every salamander is a newt. Every hallucination is a mistake, not every mistake is a hallucination.

https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

kiitos · 2025-06-07T20:43:31 1749329011

I'm not using "hallucinate" to mean "makes mistakes". I'm using it to mean "code that is syntactically correct and passes tests but is semantically incoherent". Which is the same thing that "hallucination" normally means in the context of a typical user LLM chat session.

someothherguyy · 2025-06-07T08:23:30 1749284610

Linting isn't verification of correctness, and yes, you can fool the compiler, linters, etc. Work with some human interns, they are great at it. Agents will do crazy things to get around linting errors, including removing functionality.

fragmede · 2025-06-07T10:02:36 1749290556

have you no tests?

kiitos · 2025-06-07T11:04:41 1749294281

Irrelevant, really. Tests establish a minimum threshold of acceptability, they don't (and can't) guarantee anything like overall correctness.

tptacek · 2025-06-07T13:12:11 1749301931

Just checking off the list of things you've determined to be irrelevant. Compiler? Nope. Linter? Nope. Test suite? Nope. How about TLA+ specifications?

skydhash · 2025-06-07T14:27:16 1749306436

TLA+ specs don’t verify code. They verify design. Such design can be expressed in whatever, including pseudocode (think algorithms notation in textbooks). Then you write the TLA specs that will judge if invariants are truly respected. Once you’re sure of the design, you can go and implement it, but there’s no hard constraints like a type system.

tptacek · 2025-06-07T16:59:31 1749315571

At what level of formal methods verification does the argument against AI-generated code fall apart? My expectation is that the answer is "never".

The subtext is pretty obvious, I think: that standards, on message boards, are being set for LLM-generated code that are ludicrously higher than would be set for people-generated code.

saagarjha · 2025-06-07T10:22:11 1749291731

My guy didn't you spend like half your life in the field where your job was to sift through code that compiled but nonetheless had bugs that you tried to exploit? How can you possibly have this belief about AI generated code?

tptacek · 2025-06-07T12:57:31 1749301051

I don't understand this question. Yes, I spent about 20 years learning the lesson that code is profoundly knowable; to start with, you just read it. What challenge do you believe AI-generated code presents to me?

lcnPylGDnU4H9OF · 2025-06-07T15:10:48 1749309048

> You seem to have this fundamental belief that the code that's produced by your LLM is valid and doesn't need to be evaluated, line-by-line, by a human, before it can be committed??

This is a mistaken understanding. The person you responded to has written on these thoughts already and they used memorable words in response to this proposal:

> Are you a vibe coding Youtuber? Can you not read code? If so: astute point. Otherwise: what the fuck is wrong with you?

It should be obvious that one would read and verify the code before they commit it. Especially if one works on a team.

https://fly.io/blog/youre-all-nuts/

kasey_junk · 2025-06-07T21:28:15 1749331695

We should go one step past this and come up with an industry practice where we get someone other than the author to read the code before we merge it.

lcnPylGDnU4H9OF · 2025-06-08T00:32:24 1749342744

I don’t understand your point. Are you saying that it sounds like that wouldn’t happen?

kasey_junk · 2025-06-08T12:01:51 1749384111

I’m being sarcastic. The person you are responding to is implying that reading code carefully before merging it is some daunting or new challenge. In fact it’s been standard practice in our industry for 2 or more people to do that as a matter of course.