More

davidsainez · 2025-11-27T03:23:28 1764213808

Not wanting to review and maintain code that someone didn't even bother to write themselves is childish?

GaryBluto · 2025-11-27T03:51:03 1764215463

Denying code not on it's merits but it's source is childish.

voidhorse · 2025-11-27T04:09:41 1764216581

I think most people are in complete agreement.

What people don't like about LLM PRs is typically:

a. The person proposing the PR usually lacks adequate context and so it makes communication and feedback, which are essential, difficult if not impossible. They cannot even explain the reasoning behind the changes they are proposing, b. The volume/scale is often unreasonable for human reviewed to contend with. c. The PR may not be in response to an issue but just the realization of some "idea" the author or LLM had, making it even harder to contextualize. d. The cost asymmetry, generally speaking is highly unfavorable to the maintainers.

At the moment, it's just that LLM driven PRs have these qualities so frequently that people use LLM bans as a shorthand since writing out a lengthy policy redescrbiing the basic tenets of participation in software development is tedious and shouldn't be necessary, but here we are, in 2025 when everyone has seemingly decided to abandon those principles in favor of lazyily generating endless reams of pointless code just because they can.

davidsainez · 2025-11-27T04:19:05 1764217145

But to determine its merit a maintainer must first donate their time and read through the PR.

LLMs reduce the effort to create a plausible PR down to virtually zero. Requiring a human to write the code is a good indicator that A. the PR has at least some technical merit and B. the human cares enough about the code to bother writing a PR in the first place.

p1necone · 2025-11-27T05:22:55 1764220975

It's absolutely possible to use an LLM to generate code, carefully review, iterate and test it and produce something that works and is maintainable.

The vast majority of of LLM generated code that gets submitted in PRs on public GitHub projects is not that - see the examples they gave.

Reviewing all of that code on its merits alone in order to dismiss it would take an inordinate amount of time and effort that would be much better spent improving the project. The alternative is a blanket LLM generated code ban, which is a lot less effort to enforce because it doesn't involve needing to read piles and piles of nonsense.

lenkite · 2025-11-27T16:17:49 1764260269

> Denying code not on it's merits but it's source is childish.

No, its pretty standard legal policy actually.

franktankbank · 2025-11-27T14:14:46 1764252886

Brandolini's law

Usually I hate quoting "laws" but think about it. I do agree that it would be awesome if we scrutinize 10+k lines of code to bring big changes but its not really feasible is it?

wilg · 2025-11-27T06:19:27 1764224367

This argument obviously makes no sense. Especially when one of the examples is a 7 character diff.

But it's fine to say "this PR makes no sense to me explain it better please" and close it.

davidsainez · 2025-11-27T03:04:19 1764212659

> works flawlessly

> intermittent outages

Those seem like conflicting statements to me. Last outage was only 13 days ago: https://news.ycombinator.com/item?id=45915731.

Also, there have been increasing reports of open source maintainers dealing with LLM generated PRs: https://news.ycombinator.com/item?id=46039274. GitHub seems perfectly positioned to help manage that issue, but in all likelihood will do nothing about it: '"Either you have to embrace the Al, or you get out of your career," Dohmke wrote, citing one of the developers who GitHub interviewed.'

I used to help maintain a popular open source library and I do not envy what open source maintainers are now up against.

xeonmc · 2025-11-27T04:06:08 1764216368

GitHub: 60% of the time, it works every time.

bigyabai · 2025-11-27T21:17:14 1764278234

> GitHub seems perfectly positioned to help manage that issue, but in all likelihood will do nothing about it

I genuinely don't understand this position. Is this not what Github issues bots were made for? No matter where your repo is hosted, you take the onus of moderating it onto yourself.

Downtimes are an issue, it's why I jokingly mentioned it. Besides that I'm without gripe. Make Github a high-nines service and I'll keep using it until the wheels fall off.

davidsainez · 2025-11-25T00:23:11 1764030191

AFAICT, kimi k2 was the first to apply this technique [1]. I wonder if Anthropic came up with it independently or if they trained a model in 5 months after seeing kimi’s performance.

1: https://www.decodingdiscontinuity.com/p/open-source-inflecti...

BoorishBears · 2025-11-25T05:30:31 1764048631

OpenAI has been doing this since at least O3 in January, Anthropic has been doing it since 4 in May.

And the July Kimi K2 release wasn't a thinking model, the model in that article was released less than 20 days ago.

davidsainez · 2025-11-24T21:42:25 1764020545

There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-....

The real issue is that there is no reliable system currently in place for the end user (other than being willing to burn the cash and run your own benchmarks regularly) to detect changes in performance.

It feels to me like a perfect storm. A combination of high cost of inference, extreme competition, and the statistical nature of LLMs make it very tempting for a provider to tune their infrastructure in order to squeeze more volume from their hardware. I don't mean to imply bad faith actors: things are moving at breakneck speed and people are trying anything that sticks. But the problem persists, people are building on systems that are in constant flux (for better or for worse).

Wowfunhappy · 2025-11-24T21:51:47 1764021107

> There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-...

There was one well-documented case of performance degradation which arose from a stupid bug, not some secret cost cutting measure.

davidsainez · 2025-11-24T22:33:33 1764023613

I never claimed that it was being done in secrecy. Here is another example: https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe....

I have seen multiple people mention openrouter multiple times here on HN: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Again, I'm not claiming malicious intent. But model performance depends on a number of factors and the end-user just sees benchmarks for a specific configuration. For me to have a high degree of confidence in a provider I would need to see open and continuous benchmarking of the end-user API.

stingraycharles · 2025-11-25T13:36:45 1764077805

All those are completely irrelevant. Quantization is just a cost optimization.

People are claiming that Anthropic et all changes the quality of the model after the initial release, which is entirely different and the industry as a whole has denied. When a model is released under a certain version, the model doesn’t change.

The only people who believe this are in the vibe coding community, believing that there’s some kind of big conspiracy, but any time you mention “but benchmarks show the performance stays consistent” you’re told you’re licking corporate ass.

mopierotti · 2025-11-25T15:18:17 1764083897

I might be misunderstanding your point, but quantization can have a dramatic impact on the quality of the model's output.

For example, in diffusion, there are some models where a Q8 quant dramatically changes what you can achieve compared to fp16. (I'm thinking of the Wan video models.) The point I'm trying to make is that it's a noticeable model change, and can be make-or-break.

stingraycharles · 2025-11-26T01:27:26 1764120446

Of course, no one is debating that. What’s being debated is whether this is done after a model’s initial release, eg Anthropic will secretly change the new Opus model to perform worse but be more cost efficient in a few weeks.

anon7000 · 2025-11-25T07:15:55 1764054955

> some secret cost cutting measure

That’s not the point — it’s just a day in the life of ops to tweak your system to improve resource utilization and performance. Which can cause bugs you don’t expect in LLMs. it’s a lot easier to monitor performance in a deterministic system, but harder to see the true impact a change has to the LLM

davidsainez · 2025-11-21T16:53:08 1763743988

Thanks for sharing. I hear people make extraordinary claims about LLMs (not saying that is what you are doing) but it's hard to evaluate exactly what they mean without seeing the results. I've been working on a similar project (a static analysis tool) and I've been using sonnet 4.5 to help me build it. On cursory review it produces acceptable results but closer inspection reveals obvious performance or architectural mistakes. In its current state, one-shotted llm code feels like wood filler: very useful in many cases but I would not trust it to be load bearing.

Philpax · 2025-11-21T17:59:52 1763747992

I'd agree with that, yeah. If this was anything more important, I'd give it much more guidance, lay down the core architectural primitives myself, take over the reins more in general, etc - but for what this is, it's perfect.

davidsainez · 2025-11-21T03:12:58 1763694778

Access to virtually infinite cash had more to do with Android's success than the source being proprietary.

davidsainez · 2025-11-20T09:05:30 1763629530

Golang I think (mostly) successfully resisted this temptation

Simran-B · 2025-11-20T09:17:44 1763630264

Generics though

davidsainez · 2025-11-18T21:47:50 1763502470

Doesn’t have to be an in house system, just basic redundancy is fine. eg a simple hook that pushes to both GitHub and gitlab

davidsainez · 2025-11-15T02:39:13 1763174353

Sure, we are still closer to alchemy than materials science, but its still early days. But consider this blogpost that was on the front page today: https://www.levs.fyi/blog/2-years-of-ml-vs-1-month-of-prompt.... The table on the bottom shows a generally steady increase in performance just by iterating on prompts. It feels like we are on the path to true engineering.

raddan · 2025-11-15T03:07:45 1763176065

Engineers usually have at least some sense as to why their efforts work though. Does anybody who iterates on prompts have even the fuzziest idea why they work? Or what the improvement might be? I do not.

skeeter2020 · 2025-11-15T13:48:26 1763214506

If there is ANY relationship to engineering here maybe it's like reverse engineering a bios in a clean room, were you poke away and see what happens. The missing part is the use of anything resembling the scientific method in terms of hypothesis, experiment design, observation guiding actions, etc and the deep knowledge that will allow you to understand WHY something might be happening based on the inputs. "Prompt Engineering" seems about as close to this as probing for land mines in a battlefield, only with no experience and your eyes closed.

davidsainez · 2025-11-14T13:12:18 1763125938

> We tried multiple vectorization and classification approaches. Our data was heavily imbalanced and skewed towards negative cases. We found that TF-IDF with 1-gram features paired with XGBoost consistently emerged as the winner.

andai · 2025-11-14T14:21:51 1763130111

Anthropic found a similar result for retrieval: embeddings + BM25 keyword search (variant of TF-IDF) produced significantly better results.

https://www.anthropic.com/engineering/contextual-retrieval

They also found improvements from augmenting the chunks with Haiku by having it add a summary based on extra context.

That seems to benefit both the keyword search and the embeddings by acting as keyword expansion. (Though it's unclear to me if they tried actual keyword expansion and how that would fare.)

---

Anyway what stands out to me most here is what a Rube Goldberg machine it is. Embeddings, keywords, fusion, contextual augmentation, reranking... each adding marginal gains.

But then the whole thing somehow works really well together (~1% fail rate on most benchmarks. Worse for code retrieval.)

I have to wonder how this would look if it wasn't a bunch of existing solutions taped together, but actually a full integrated system.

davidsainez · 2025-11-14T18:55:02 1763146502

Thanks for sharing! I am working on a rag engine and that document provides great guidance.

And, agreed, each individual technique seems marginal but they really add up. What seems to be missing is some automated layer that determines the best way to chunk documents into embeddings. My use case is mostly normalized mostly technical documents so I have a pretty clear idea of how to chunk to preserve semantics. But I imagine that for generalized documents it is a lot trickier.

killerstorm · 2025-11-14T13:25:16 1763126716

Well, "vectorization" can be anything. BERT is in same capability class as GPT, very different from LSA people did in 1980s...