What people don't like about LLM PRs is typically:
a. The person proposing the PR usually lacks adequate context and so it makes communication and feedback, which are essential, difficult if not impossible. They cannot even explain the reasoning behind the changes they are proposing,
b. The volume/scale is often unreasonable for human reviewed to contend with.
c. The PR may not be in response to an issue but just the realization of some "idea" the author or LLM had, making it even harder to contextualize.
d. The cost asymmetry, generally speaking is highly unfavorable to the maintainers.
At the moment, it's just that LLM driven PRs have these qualities so frequently that people use LLM bans as a shorthand since writing out a lengthy policy redescrbiing the basic tenets of participation in software development is tedious and shouldn't be necessary, but here we are, in 2025 when everyone has seemingly decided to abandon those principles in favor of lazyily generating endless reams of pointless code just because they can.
But to determine its merit a maintainer must first donate their time and read through the PR.
LLMs reduce the effort to create a plausible PR down to virtually zero. Requiring a human to write the code is a good indicator that A. the PR has at least some technical merit and B. the human cares enough about the code to bother writing a PR in the first place.
It's absolutely possible to use an LLM to generate code, carefully review, iterate and test it and produce something that works and is maintainable.
The vast majority of of LLM generated code that gets submitted in PRs on public GitHub projects is not that - see the examples they gave.
Reviewing all of that code on its merits alone in order to dismiss it would take an inordinate amount of time and effort that would be much better spent improving the project. The alternative is a blanket LLM generated code ban, which is a lot less effort to enforce because it doesn't involve needing to read piles and piles of nonsense.
Usually I hate quoting "laws" but think about it. I do agree that it would be awesome if we scrutinize 10+k lines of code to bring big changes but its not really feasible is it?
Also, there have been increasing reports of open source maintainers dealing with LLM generated PRs: https://news.ycombinator.com/item?id=46039274. GitHub seems perfectly positioned to help manage that issue, but in all likelihood will do nothing about it: '"Either you have to embrace the Al, or you get out of your career," Dohmke wrote, citing one of the developers who GitHub interviewed.'
I used to help maintain a popular open source library and I do not envy what open source maintainers are now up against.
> GitHub seems perfectly positioned to help manage that issue, but in all likelihood will do nothing about it
I genuinely don't understand this position. Is this not what Github issues bots were made for? No matter where your repo is hosted, you take the onus of moderating it onto yourself.
Downtimes are an issue, it's why I jokingly mentioned it. Besides that I'm without gripe. Make Github a high-nines service and I'll keep using it until the wheels fall off.
AFAICT, kimi k2 was the first to apply this technique [1]. I wonder if Anthropic came up with it independently or if they trained a model in 5 months after seeing kimi’s performance.
The real issue is that there is no reliable system currently in place for the end user (other than being willing to burn the cash and run your own benchmarks regularly) to detect changes in performance.
It feels to me like a perfect storm. A combination of high cost of inference, extreme competition, and the statistical nature of LLMs make it very tempting for a provider to tune their infrastructure in order to squeeze more volume from their hardware. I don't mean to imply bad faith actors: things are moving at breakneck speed and people are trying anything that sticks. But the problem persists, people are building on systems that are in constant flux (for better or for worse).
Again, I'm not claiming malicious intent. But model performance depends on a number of factors and the end-user just sees benchmarks for a specific configuration. For me to have a high degree of confidence in a provider I would need to see open and continuous benchmarking of the end-user API.
All those are completely irrelevant. Quantization is just a cost optimization.
People are claiming that Anthropic et all changes the quality of the model after the initial release, which is entirely different and the industry as a whole has denied. When a model is released under a certain version, the model doesn’t change.
The only people who believe this are in the vibe coding community, believing that there’s some kind of big conspiracy, but any time you mention “but benchmarks show the performance stays consistent” you’re told you’re licking corporate ass.
I might be misunderstanding your point, but quantization can have a dramatic impact on the quality of the model's output.
For example, in diffusion, there are some models where a Q8 quant dramatically changes what you can achieve compared to fp16. (I'm thinking of the Wan video models.) The point I'm trying to make is that it's a noticeable model change, and can be make-or-break.
Of course, no one is debating that. What’s being debated is whether this is done after a model’s initial release, eg Anthropic will secretly change the new Opus model to perform worse but be more cost efficient in a few weeks.
That’s not the point — it’s just a day in the life of ops to tweak your system to improve resource utilization and performance. Which can cause bugs you don’t expect in LLMs. it’s a lot easier to monitor performance in a deterministic system, but harder to see the true impact a change has to the LLM
Thanks for sharing. I hear people make extraordinary claims about LLMs (not saying that is what you are doing) but it's hard to evaluate exactly what they mean without seeing the results. I've been working on a similar project (a static analysis tool) and I've been using sonnet 4.5 to help me build it. On cursory review it produces acceptable results but closer inspection reveals obvious performance or architectural mistakes. In its current state, one-shotted llm code feels like wood filler: very useful in many cases but I would not trust it to be load bearing.
I'd agree with that, yeah. If this was anything more important, I'd give it much more guidance, lay down the core architectural primitives myself, take over the reins more in general, etc - but for what this is, it's perfect.
Sure, we are still closer to alchemy than materials science, but its still early days. But consider this blogpost that was on the front page today: https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompt.... The table on the bottom shows a generally steady increase in performance just by iterating on prompts. It feels like we are on the path to true engineering.
Engineers usually have at least some sense as to why their efforts work though. Does anybody who iterates on prompts have even the fuzziest idea why they work? Or what the improvement might be? I do not.
If there is ANY relationship to engineering here maybe it's like reverse engineering a bios in a clean room, were you poke away and see what happens. The missing part is the use of anything resembling the scientific method in terms of hypothesis, experiment design, observation guiding actions, etc and the deep knowledge that will allow you to understand WHY something might be happening based on the inputs. "Prompt Engineering" seems about as close to this as probing for land mines in a battlefield, only with no experience and your eyes closed.
> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner.
They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.
That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)
---
Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.
But then the whole thing somehow works really well together (~1% fail rate on most benchmarks. Worse for code retrieval.)
I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.
Thanks for sharing! I am working on a rag engine and that document provides great guidance.
And, agreed, each individual technique seems marginal but they really add up. What seems to be missing is some automated layer that determines the best way to chunk documents into embeddings. My use case is mostly normalized mostly technical documents so I have a pretty clear idea of how to chunk to preserve semantics. But I imagine that for generalized documents it is a lot trickier.
reply