I have worked at large, competent companies, and the problem of "which e2e tests...

Yoric · 2025-09-06T19:11:09 1757185869

...and I'm not confident at all that Claude can do anything at that level.

johnfn · 2025-09-06T19:18:51 1757186331

How does that reconcile with the article, which states:

> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.

If you have some particular issue with the author's methodology, you should state that.

cerved · 2025-09-06T19:36:08 1757187368

Well since it never broke for some rando on the internet, surely that means it will always work for everyone

johnfn · 2025-09-06T19:46:34 1757187994

If you have some particular issue with the article, you should state that. Otherwise, the most charitable interpretation of your position I can come up with is "the article is wrong for some reason I refuse to specify", which doesn't lead to a productive dialogue.

ambicapter · 2025-09-06T22:08:39 1757196519

I think you're the one being uncharitable here. The meaning of what he's saying is very clear. You can't say this probabilistic method (using LLMs to decide your e2e test plan) works if you only have a single example of it working.

johnfn · 2025-09-07T01:38:33 1757209113

It's really not clear. Using probabilistic methods to determine your e2e test plan is already best practice at large tech shops, and to be quite honest the heuristics that they used to use were pretty poor and arbitrary.

cerved · 2025-09-12T22:32:27 1757716347

The author said they used Claude to decide which E2E tests to run and "Claude never missed a relevant E2E test."

How many times did they conduct this experiment? Over how long time? How did they determine which tests were relevant and that Claude didn't miss them? Did they try it on more than one project?

My point was that none of this tells me this will work in general

bgwalter · 2025-09-06T19:39:01 1757187541

If the author can keep the whole function code_change -> relevant E2E_TESTS in his head, it seems to be a trivial application.

We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.

troupo · 2025-09-06T20:53:41 1757192021

Easy. The article asks us to believe.

There's a handy list to check against the article here: https://dmitriid.com/everything-around-llms-is-still-magical... starting at "For every description of how LLMs work or don't work we know only some, but not all of the following"

johnfn · 2025-09-06T21:23:56 1757193836

It seems to me like we have the answers to all those questions.

- Do we know which projects people work on?

It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.

- Do we know which codebases (greenfield, mature, proprietary etc.) people work on

The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.

- Do we know the level of expertise the people have?

It seems like they work on nontrivial production apps.

- How much additional work did they have reviewing, fixing, deploying, finishing etc.?

The article says very little.

troupo · 2025-09-06T21:36:57 1757194617

> The article says very little.

And that's the crux, isn't it. Because that checklist really is just the tip of the iceberg.

Some people have completely opposite experiences: https://news.ycombinator.com/item?id=45152139

Others question the validity of the approach entirely: https://news.ycombinator.com/item?id=45152668

Oh, don't get me wrong: I like the idea. I would trust LLMs with this idea about as far as I could throw them.