> What this interaction shows is how much knowledge you need to bring when you interact with an LLM. The “one big flaw” Claude produced in the middle would probably not have been spotted by someone less experienced with crypto code than this engineer obviously is. And likewise, many people would probably not have questioned the weird choice to move to PBKDF2 as a response
For me this is the key takeaway. You gain proper efficiency using LLMs when you are a competent reviewer, and for lack of a better word, leader. If you don't know the subject matter as well as the LLM, you better be doing something non-critical, or have the time to not trust it and verify everything.
LLMs make learning new material easier than ever. I use them a lot and I am learning new things at an insane pace in different domains.
The maximalists and skeptics both are confusing the debate by setting up this straw man that people will be delegating to LLMs blindly.
The idea that someone clueless about OAuth should develop an OAuth lib with LLM support without learning a lot about the topic is... Just wrong. Don't do that.
But if you're willing to learn, this is rocket fuel.
Learning from LLMs is akin to learning from Joe Rogan.
You are getting a stylised view of a topic from an entity who lacks the deep understanding needed to be able to fully distill the information. But it is enough to gain enough knowledge for you to feel confident which is still valuable but also dangerous.
And I assure you that many, many people are delegating to LLMs blindly e.g. it's a huge problem in the UK legal system right now because of all the invented case law references.
On the flip side, I wanted to see what common 8 layer PCB stackups were yesterday. ChatGPT wasn't giving me an answer that really made sense. After googling a bit, I realized almost all of the top results were AI generated, and also had very little in the way of real experience or advice.
It's going to be like the pre-internet dark ages, but worse. Back then you only didn't find the information. Now, you find unlimited information, but it is all wrong.
I don't know, this sounds a lot like in the late 90s when we heard a lot about how anyone could put information on the internet and that you shouldn't trust what you read online.
Well it turns out you can manage just fine.
You shouldn't blindly trust anything. Not what you read, not what people say.
Using LLMs effectively is a skill too, and that does involve deciding when and how to verify information.
The difference is in scale. Back then, only humans were sometimes putting up false information, and other humans had a chance to correct it. Now, machines are writing infinitely more garbage than humans can ever read. Search engines like Google are already effectively unusable.
You missed the full context: you would never be able to trust a bunch of amateur randos self-policing their content. Turns out it's not perfect but better than a very small set of professionals; usually there's enough expertise out there, it's just widely distributed. The challenge this time is 1. the scale, 2. the rate of growth, 3. the decline in expertise.
>> Using LLMs effectively is a skill too, and that does involve deciding when and how to verify information.
How do you verify when ALL the sources are share the same AI-generated root, and ALL of the independent (i.e. human) experts have aged-out and no longer exist?
>> LLMs make learning new material easier than ever.
feels like there's a logical flaw here, when the issue is that LLMs are presenting the wrong information or missing it all together. The person trying to learn from it will experience Donald Rumsfield's "unknown unknowns".
I would not be surprised if we experience an even more dramatic "Cobol Moment" a generation from now, but unlike that one thankfully I won't be around to experience it.
> LLMs make learning new material easier than ever. I use them a lot and I am learning new things at an insane pace in different domains.
With learning, aren’t you exposed to the same risks? Such that if there was a typical blind spot for the LLM, it would show up in the learning assistance and in the development assistance, thus canceling out (i.e unknown unknowns)?
If you trust everything the LLM tells you, and you learn from code, then yes the same exact risks apply. But this is not how you use (or should use) LLMs when you’re learning a topic. Instead you should use high quality sources, then ask the LLM to summarize them for you to start with (NotebookLM does this very well for instance, but so can others). Then you ask it to build you a study plan, with quizzes and exercises covering what you’ve learnt. Then you ask it to setup a spaced repetition worksheet that covers the topic thoroughly. At the end of this you will know the topic as well as if you’d taken a semester-long course.
One big technique it sounds like the authors of the OAuth library missed is that LLMs are very good at generating tests. A good development process for today’s coding agents is to 1) prompt with or create a PRD, 2) break this down into relatively simple tasks, 3) build a plan for how to tackle each task, with listed out conditions that should be tested, 3) write the tests, so that things are broken, TDD style and finally 4) write the implementation. The LLM can do all of this, but you can’t one-shot it these days, you have to be a human in the loop at every step, correcting when things go off track. It’s faster, but it’s not a 10x speed up like you might imagine if you think the LLM is just asynchronously taking a PRD some PM wrote and building it all. We still have jobs for a reason.
> Instead you should use high quality sources, then ask the LLM to summarize them for you to start with (NotebookLM does this very well for instance, but so can others).
How do you determine if the LLM accurately reflects what the high-quality source contains, if you haven't read the source? When learning from humans, we put trust on them to teach us based on a web-of-trust. How do you determine the level of trust with an LLM?
When I'm exploring a topic, I make sure to ask for links to references, and will do a quick keyword search in there or ask for an excerpt to confirm key facts.
This does mean that there's a reliance on me being able to determine what are key facts and when I should be asking for a source though. I have not experienced any significant drawbacks when compared to a classic research workflow though, so in my view it's a net speed boost.
However, this does mean that a huge variety of things remain out of reach for me to accomplish, even with LLM "assistance". So there's a decent chance even the speed boost is only perceptual. If nothing else, it does take a significant amount of drudgery out of it all though.
> With learning, aren’t you exposed to the same risks? Such that if there was a typical blind spot for the LLM, it would show up in the learning assistance and in the development assistance, thus canceling out (i.e unknown unknowns)?
I don't think that's how things work. In learning tasks, LLMs are sparring partners. You present them with scenarios, and they output a response. Sometimes they hallucinate completely, but they can also update their context to reflect new information. Their output matches what you input.
how do you gain anything useful from a sycophantic tutor that agrees with everything you say, having being trained to behave as if the sun shines out of your rear end?
making mistakes is how we learn, and if they are never pointed out...
It's a bit of a skill. Gaining an incorrect understanding of some topic is a risk anyway you learn, and I don't feel it's greater with LLMs than many of the alternatives.
Sure, having access to legit experts who can tutor you privately on a range of topics would be better, but that's not realistic.
What I find is that if I need to explore some new domain within a field I'm broadly familiar with, just thinking through what the LLM is saying is sufficient for verification, since I can look for internal consistency and check against things I know already.
When exploring a new topic, often times my questions are superficial enough for me to be confident that the answers are very common in the training data.
When exploring a new topic that's also somewhat niche or goes into a lot of detail, I use the LLM first to get a broad overview and then drill down by asking for specific sources and using the LLM as an assistant to consume authoritative material.
And you are saying human teachers or online materials won't lie to you once or twice for every 20 facts? no matter how small. Did you do any comparison?
> LLMs will tell you 1 or 2 lies for each 20 facts. Its a hard way to learn.
That was my experience when growing up with school also, except you got punished one way or another for speaking up/trying to correct the teacher. If I speak up with the LLM they either explain why what they said is true, or corrects themselves, 0 emotions involved.
You are ignoring the fact that the types of mistakes or lies are of a different nature.
If you are in class, and you incorrectly argue, there is a mistake in an explanation of Derivatives or Physics, but you are the one in error, your Teacher hopefully, will not say: "Oh, I am sorry you are absolutely correct. Thank you for your advice.."
Yeah, no of course if I'm wrong I don't expect the teacher to agree with me, what kind of argument is that? I thought it was clear, but the base premise of my previous comment is that the teacher is incorrect and refuse corrections...
This, for me, has been the question since the beginning. I’m yet to see anyone talk/think about the issue head on too. And whenever I’ve asked someone about it, they’ve not had any substantial thoughts.
Engineers will still exist and people will vibe code all kinds of things into existence. Some will break in spectacular ways, some of those projects will die, some will hire a real engineer to fix things.
I cannot see us living in a world of ignorance where there are literally zero engineers and no one on the planet understands what's been generated. Weirdly we could end up in a place where engineering skills are niche and extremely lucrative.
to use your own example though, many of these core skills are declining, mechanized or viewed through a historical lens vs. application. I don't know if this is net good or bad, but it is very different. Maybe humans will think as you say, but it feels like there will be significantly less diverse areas of thought. If you look at the front page of HN as a snapshot of "where's tech these days" it is very homgenous compared to the past. Same goes for the general internet and the AI-content continues to grow. IMO published works are a precursor to future human discovery, forming the basis of knowledge, community and growth.
No, but it'll become a hobby or artistic pursuit, just like running, playing chess, or blacksmithing. But I personally think it's going to take longer than 30 years.
In a few years hopefully the AI reviewers will be far more reliable than even the best human experts. This is generally how competency progresses in AI...
For example, at one point a human + computer would have been the strongest combo in chess, now you'd be insane to allow a human to critic a chess bot because they're so unlikely to add value, and statistically a human in the loop would be far more likely to introduce error. Similar things can be said in fields like machine vision, etc.
Software is about to become much higher quality and be written at much, much lower cost.
My prediction is that for that to happen we’ll need to figure out a way to measure software quality in the way we can measure a chess game, so that we can use synthetic data to continue improving the models.
I don’t think we are anywhere close to doing that.
Not really... If you're an average company you're not concerned about producing perfect software, but optimising for some balance between cost and quality. At some point companies via capitalist forces will naturally realise that it's more productive to not have humans in the loop.
A good analogy might be how machines gradually replaced textile workers in the 19th century. Were the machines better? Or was there a was to quantitatively measure the quality of their output? No. But at the end of the day companies which embraced the technology were more productive than those who didn't, and the quality didn't decrease enough (if it did at all) that customers would no longer do business with them – so these companies won out.
The same will naturally happen in software over the next few years. You'd be an moron to hire a human expert for $200,000 to critic a cybersecurity optimised model which costs maybe a 100th of the cost of employing a human... And this would likely be true even if we assume the human will catch the odd thing the model wouldn't because there's no such thing as perfect security – it's always a trade off between cost and acceptable risk.
Bookmark this and come back in a few years. I made similar predictions when ChatGPT first came out that within a few years agents would be picking up tickets and raising PRs. Everyone said LLMs were just stochastic parrots and this would not happen, well now it has and increasingly companies are writing more and more code with AI. At my company it's a little over 50% at the mo, but this is increasing every month.
I’m puzzled when I hear people say ‘oh, I only use LLMs for things I don’t understand well. If I’m an expert, I’d rather do it myself.’
In addition to the ability to review output effectively, I find the more closely I’m able to describe what I want in the way another expert in that domain would, the better the LLM output. Which isn’t really that surprising for a statistical text generation engine.
I guess it depends. In some cases, you don't have to understand the black box code it gives you, just that it works within your requirements.
For example, I'm horrible at math, always been, so writing math-heavy code is difficult for me, I'll confess to not understanding math well enough. If I'm coding with an LLM and making it write math-heavy code, I write a bunch of unit tests to describe what I expect the function to return, write a short description and give it to the LLM. Once the function is written, run the tests and if it passes, great.
I might not 100% understand what the function does internally, and it's not used for any life-preserving stuff either (typically end up having to deal with math for games), but I do understand what it outputs, and what I need to input, and in many cases that's good enough. Working in a company/with people smarter than you tends to make you end up in this situation anyways, LLMs or not.
Though if in the future I end up needing to change the math-heavy stuff in the function, I'm kind of locked into using LLMs for understanding and changing it, which obviously feels less good. But the alternative is not doing it at all, so another tradeoff I suppose.
I still wouldn't use this approach for essential/"important" stuff, but more like utility functions.
That's why outsource most other things in our life though, why would it be different with LLMs?
People don't learn how a car works before buying one, they just take it to a mechanic when it breaks. Most people don't know how to build a house, they have someone else build it and assume it was done well.
I fully expect people to similarly have LLMs do what the person doesn't know how and assume the machine knew what to do.
I've found llms are very quick to add defaults, fallbacks, rescues–which all makes it very easy for code to look like it is working when it is not or will not. I call this out three different places in my CLAUDE.md trying to adjust for this, and still occasionally get.
I've been using an llm to do much of a k8s deployment for me. It's quick to get something working but I've had to constantly remind it to use secrets instead of committing credentials in clear text. A dangerous way to fail. I wonder if in my case this is caused by the training data having lots of examples from online tutorials that omit security concerns to focus on the basics.
> my case this is caused by the training data having
I think it's caused by you not having a strong enough system prompt. Once you've built up a slightly reusable system prompt for coding or for infra work, where you bit by bit build it up while using a specific model (since different models respond differently to prompts), you end up getting better and better responses.
So if you notice it putting plaintext credentials in the code, add to the system prompt to not do that. With LLMs you really get what you ask for, and if you miss to specify anything, the LLM will do whatever the probabilities tells it to, but you can steer this by being more specific.
Imagine you're talking to a very literal and pedantic engineer who argues a lot on HN and having to be very precise with your words, and you're like 80% of the way there :)
Yes, you are definitely right on that. I still find it a concerning failure mode. That said, maybe it's no worse than a junior copying from online examples without reading all the text some the code which of course has been very common also.
> It's quick to get something working but I've had to constantly remind it to use secrets instead of committing credentials in clear text.
This is going to be a powerful feedback loop which you might call regression to the intellectual mean.
On any task, most training data is going to represent the middle (or beginning) of knowledge about a topic. Most k8s examples will skip best practices, most react apps will be from people just learning react, etc.
If you want the LLM to do best practices in every knowledge domain (assuming best practices can be consistently well defined), then you have to push it away from the mean of every knowledge domain simultaneously (or else work with specialized fine tuned models).
As you continue to add training data it will tend to regress toward the middle because that's where most people are on most topics.
You will always trust domain experts at some junction; you can't build a company otherwise. The question is: Can LLMs provide that domain expertise? I would argue, yes, clearly, given the development of the past 2 years, but obviously not on a straight line.
I just finished writing a Kafka consumer to migrate data with heavy AI help. This was basically best case a scenario for AI. It’s throw away greenfield code in a language I know pretty well (go) but haven’t used daily in a decade.
For complicated reasons the whole database is coming through on 1 topic, so I’m doing some fairly complicated parallelization to squeeze out enough performance.
I’d say overall the AI was close to a 2x speed up. It mostly saved me time when I forgot the go syntax for something vs looking it up.
However, there were at least 4 subtle bugs (and many more unsubtle ones) that I think anyone who wasn’t very familiar with Kafka or multithreaded programming would have pushed to prod. As it is, they took me a while to uncover.
On larger longer lived codebases, I’ve seen something closer to a 10-20% improvement.
All of this is using the latest models.
Overall this is at best the kind of productivity boost we got from moving to memory managed languages. Definitely not something that is going to replace engineers with PMs vibe coding anytime soon (based on rate of change I’ve seen over the last 3 years).
My real worry is that this is going to make mid level technical tornadoes, who in my experience are the most damaging kind of programmer, 10x as productive because they won’t know how to spot or care about stopping subtle bugs.
I don’t see how senior and staff engineers are going to be able to keep up with the inevitable flood of reviews.
I also worry about the junior to senior pipeline in a world where it’s even easier to get something
up that mostly works—we already have this problem today with copy paste programmers, but we’ve just make copy paste programming even easier.
I think the market will eventually sort this all out, but I worry that it could take decades.
> I’ve seen something closer to a 10-20% improvement.
The seems to match my experience in "important" work too; a real increase but not essentially changing the essence of software development. Brook's "No Silver Bullet" strikes again...
Yeah, the AI-generated bugs are really insidious. I also pushed a couple subtle bugs in some multi-threaded code I had AI write, because I didn't think it through enough. Reviews and tests don't replace the level of scrutiny something gets when it's hand-written. For now, you have to be really careful with what you let AI write, and make sure any bugs will be low impact since there will probably be more than usual.
I agree with the last paragraph about doing this yourself. Humans have tendency to take shortcuts while thinking. If you see something resembling what you expect for the end product you will be much less critical of it. The looks/aesthetics matter a lot on finding problems with in a piece of code you are reading. You can verify this by injecting bugs in your code changes and see if reviewers can find them.
On the other hand, when you have to write something yourself you drop down to slow and thinking state where you will pay attention to details a lot more. This means that you will catch bugs you wouldn't otherwise think of. That's why people recommend writing toy versions of the tools you are using because writing yourself teaches a lot better than just reading materials about it. This is related to know our cognition works.
I suggest they freeze a branch of it, then spawn some AIs to introduce and attempt to hide vulnerabilities, and another to spot and fix them. Every commit is a move, and try to model the human evolution of chess.
This is why I have multiple LLMS review and critique my specifications document, iteratively and repeatedly so, before I have my preferred LLM code it for me. I address all important points of feedback in the specifications document. This really fixes 80% of the critical expertise issues.
Moreover, after developing the code, I have multiple LLMs critique the code, file by file or even method by method.
Part of me this "written by LLM" has been a way to get attention on the codebase and plenty of free reviews by domain expert skeptics, among the other goals (pushing AI efficiency to investors, experimenting, etc).
> Many of these same mistakes can be found in popular Stack Overflow answers, which is probably where Claude learnt them from too.
This is what keeps me up at night. Not that security holes will inevitably be introduced, or that the models will make mistakes, but that the knowledge and information we have as a society is basically going to get frozen in time to what was popular on the internet before LLMs.
An approach I don't see discussed here is having different agents using different models critique architecture and test coverage and author tests to vet the other model's work, including reviewing commits. Certainly no replacement for human in the loop but it will catch a lot of goofy "you said to only check in when all the tests pass so I disabled testing because I couldn't figure out how to fix the tests".
> At ForgeRock, we had hundreds of security bugs in our OAuth implementation, and that was despite having 100s of thousands of automated tests run on every commit, threat modelling, top-flight SAST/DAST, and extremely careful security review by experts.
Wow. Anecdotally it's my understanding that OAuth is ... tricky ... but wow.
Some would say it's a dumpster fire. I've never read the spec or implemented it.
This was before LLMs. It was a combination of unit and end-to-end tests and tests written to comprehensively test every combination of parameters (eg test this security property holds for every single JWT algorithm we support etc). Also bear in mind that the product did a lot more than just OAuth.
Really interesting breakdown. What jumped out to me wasn’t just the bugs (CORS wide open, incorrect Basic auth, weak token randomness), but how much the human devs seemed to lean on Claude’s output even when it was clearly offbase. That “implicit grant for public clients” bit is wild; it’s deprecated in OAuth 2.1, and Claude just tossed it in like it was fine, and then it stuck.
"...A more serious bug is that the code that generates token IDs is not sound: it generates biased output. This is a classic bug when people naively try to generate random strings, and the LLM spat it out in the very first commit as far as I can see. I don’t think it’s exploitable: it reduces the entropy of the tokens, but not far enough to be brute-forceable. But it somewhat gives the lie to the idea that experienced security professionals reviewed every line of AI-generated code...."
In the Github repo Cloudflare says:
"...Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards..."
Admittedly I have done some cryptographic string generation based on different alphabet sizes and characteristics a few years ago, which is pretty specifically relevant, and I’m competent at cryptographic and security concerns for a layman, but I certainly hope security reviewers will be more skilled at these things than me.
I’m very confident I would have noticed this bias in a first pass of reviewing the code. The very first thing you do in a security review is look at where you use `crypto`, what its inputs are, and what you do with its outputs, very carefully. On seeing that %, I would have checked characters.length and found it to be 62, not a factor of 256; so you need to mess around with base conversion, or change the alphabet, or some other such trick.
This bothers me and makes me lose confidence in the review performed.
Mostly a good writeup, but I think there's some serious shifting the goalposts of what "vibe coded" means in a disingenuous way towards the end:
'Yes, this does come across as a bit “vibe-coded”, despite what the README says, but so does a lot of code I see written by humans. LLM or not, we have to give a shit.'
If what most people do is "vibe coding" in general, the current definition of vibe coding is essentially meaningless. Instead, the author is making the distinction between "interim workable" and "stainless/battle tested" which is another dimension of code entirely. To describe that as vibe coding causes me to view the author's intent with suspicion.
I find ”vibe coding” to be one of the, if not the, concepts in this business to lose its meaning the fastest. Similar to how everything all of a sudden was ”cloud” now everything is ”vibe coded”, even though reading the original tweet really narrows it down thoroughly.
A very good piece that clearly illustrates one of the dangers with LLS's: responsibility for code quality is blindly offloaded on the automatic system
> There are some tests, and they are OK, but they are woefully inadequate for what I would expect of a critical auth service. Testing every MUST and MUST NOT in the spec is a bare minimum, not to mention as many abuse cases as you can think of, but none of that is here from what I can see: just basic functionality tests.
and
> There are some odd choices in the code, and things that lead me to believe that the people involved are not actually familiar with the OAuth specs at all. For example, this commit adds support for public clients, but does so by implementing the deprecated “implicit” grant (removed in OAuth 2.1).
As Madden concludes "LLM or not, we have to give a shit."
> A very good piece that clearly illustrates one of the dangers with LLS's: responsibility for code quality is blindly offloaded on the automatic system
It does not illustrate that at all.
> Claude's output was thoroughly reviewed by Cloudflare engineers with careful attention paid to security and compliance with standards.
> To emphasize, *this is not "vibe coded"*. Every line was thoroughly reviewed and cross-referenced with relevant RFCs, by security experts with previous experience with those RFCs.
The humans who worked on it very, very clearly took responsibility for code quality. That they didn’t get it 100% right does not mean that they “blindly offloaded responsibility”.
Perhaps you can level that accusation at other people doing different things, but Cloudflare explicitly placed the responsibility for this on the humans.
Note that this has very little to do with AI assisted coding; the authors of the library explicitly approved/vetted the code. So this comes down to different coders having different thoughts about what constitutes good and bad code, with some flaunting of credentials to support POVs, and nothing about that is particularly new.
The whole point of this is that people will generally put the least effort into work as they think they can get away with, and LLMs will accelerate that force. This is the future of how code will be "vetted".
It's not important whose responsbility led to mistakes, it's important to understand we're creating a responsbility gap.
For me this is the key takeaway. You gain proper efficiency using LLMs when you are a competent reviewer, and for lack of a better word, leader. If you don't know the subject matter as well as the LLM, you better be doing something non-critical, or have the time to not trust it and verify everything.
reply