I have worked at large, competent companies, and the problem of "which e2e tests to execute" is significantly more complicated than you seem to suggest that it is. I've worked with smart engineers that put a lot of time into this problem to only get only middling results.
How does that reconcile with the article, which states:
> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.
If you have some particular issue with the author's methodology, you should state that.
If you have some particular issue with the article, you should state that. Otherwise, the most charitable interpretation of your position I can come up with is "the article is wrong for some reason I refuse to specify", which doesn't lead to a productive dialogue.
I think you're the one being uncharitable here. The meaning of what he's saying is very clear. You can't say this probabilistic method (using LLMs to decide your e2e test plan) works if you only have a single example of it working.
It's really not clear. Using probabilistic methods to determine your e2e test plan is already best practice at large tech shops, and to be quite honest the heuristics that they used to use were pretty poor and arbitrary.
The author said they used Claude to decide which E2E tests to run and "Claude never missed a relevant E2E test."
How many times did they conduct this experiment? Over how long time? How did they determine which tests were relevant and that Claude didn't miss them? Did they try it on more than one project?
My point was that none of this tells me this will work in general
If the author can keep the whole function code_change -> relevant E2E_TESTS in his head, it seems to be a trivial application.
We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.
It seems to me like we have the answers to all those questions.
- Do we know which projects people work on?
It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.
- Do we know which codebases (greenfield, mature, proprietary etc.) people work on
The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.
- Do we know the level of expertise the people have?
It seems like they work on nontrivial production apps.
- How much additional work did they have reviewing, fixing, deploying, finishing etc.?