Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have worked at large, competent companies, and the problem of "which e2e tests to execute" is significantly more complicated than you seem to suggest that it is. I've worked with smart engineers that put a lot of time into this problem to only get only middling results.


...and I'm not confident at all that Claude can do anything at that level.


How does that reconcile with the article, which states:

> Did Claude catch all the edge cases? Yes, and I'm not exaggerating. Claude never missed a relevant E2E test. But it tends to run more tests than needed, which is fine - better safe than sorry.

If you have some particular issue with the author's methodology, you should state that.


Well since it never broke for some rando on the internet, surely that means it will always work for everyone


If you have some particular issue with the article, you should state that. Otherwise, the most charitable interpretation of your position I can come up with is "the article is wrong for some reason I refuse to specify", which doesn't lead to a productive dialogue.


I think you're the one being uncharitable here. The meaning of what he's saying is very clear. You can't say this probabilistic method (using LLMs to decide your e2e test plan) works if you only have a single example of it working.


It's really not clear. Using probabilistic methods to determine your e2e test plan is already best practice at large tech shops, and to be quite honest the heuristics that they used to use were pretty poor and arbitrary.


The author said they used Claude to decide which E2E tests to run and "Claude never missed a relevant E2E test."

How many times did they conduct this experiment? Over how long time? How did they determine which tests were relevant and that Claude didn't miss them? Did they try it on more than one project?

My point was that none of this tells me this will work in general


If the author can keep the whole function code_change -> relevant E2E_TESTS in his head, it seems to be a trivial application.

We don't know the methodology, since the author does not state how he verified that function or how he would verify the function for a large code base.


Easy. The article asks us to believe.

There's a handy list to check against the article here: https://dmitriid.com/everything-around-llms-is-still-magical... starting at "For every description of how LLMs work or don't work we know only some, but not all of the following"


It seems to me like we have the answers to all those questions.

- Do we know which projects people work on?

It's pretty easy to discover that OP works on https://livox.com.br/en/, a tool that uses AI to let people with disabilities speak. That sounds like a reasonable project to me.

- Do we know which codebases (greenfield, mature, proprietary etc.) people work on

The e2e tests took 2 hours to run and the website quotes ~40M words. That is not greenfield.

- Do we know the level of expertise the people have?

It seems like they work on nontrivial production apps.

- How much additional work did they have reviewing, fixing, deploying, finishing etc.?

The article says very little.


> The article says very little.

And that's the crux, isn't it. Because that checklist really is just the tip of the iceberg.

Some people have completely opposite experiences: https://news.ycombinator.com/item?id=45152139

Others question the validity of the approach entirely: https://news.ycombinator.com/item?id=45152668

Oh, don't get me wrong: I like the idea. I would trust LLMs with this idea about as far as I could throw them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: