While your example of "site adds a modal" is true, with browser based regression...

While your example of "site adds a modal" is true, with browser based regression tests, the brittleness I find much more often is minor changes to the DOM, which don't have a large impact on what the user sees. (Sometimes this is due to refactoring the DOM, sometimes it's because the DOM doesn't have good class names to use as handles, and sometimes it's because I want to reuse a step on 2 different pages, but the rendered DOM is slightly different).

My mental model of how I think about drift (which might be totally worthless from a LLM perspective):

1) I give a prompt

2) snapshot of dom is taken

3) gpt looks at that snapshot DOM and implements some solution which works

4) that solution is transformed into a more concrete implementation

4i) whether this "transformation" is any of the following isn't SUPER important, and to be honest I'd love to see all 3: a) selenium/playwright code writen by Taxy, b) a hybrid of explicit code and a gpt prompt, or c) developer can override with fully custom code

4) it runs correctly for X amount of time

5) application dom changes

6) taxy notices the step is failing

7) taxy takes a new snapshot of the dom

8) taxy runs a (reinforcement?) algorithm against the new snapshot and confirms it finds the "same" dom element as the one from old snapshot.

unrelated: the other thing which I've found very hard to program into my browser tests (and makes the code hard to interpret), so I'm curious how gpt/taxy could help:

Given I have this dom:

<ul>

  <li>

    <div class="tweet">

      <h1>Check out my sandwhich!</h1>

      <button>Retweet</button>

    </div>

  </li>

  <li>

    <div class="tweet">

      <h1>Check out my shoes!</h1>

      <button>Retweet</button>

    </div>

  </li>

</ul>

I want to write test code which is:

1) When I load the page, I should see a tweet "Check out my sandwhich!"

2) I can retweet that tweet.

Currently, I either need todo:

a) a dom traversal: find(text: "Check out my sandwhich", css: ".tweet h1").parents(".tweet").find(text: "retweet"). It's that `parents(".tweet")` part which becomes awkward at scale and incentives developers to only create 1 tweet in the test database....

b) use Page Objects, which I love, but adds overhead/training for the team

I would love if gpt could figure out, these 2 elements are "related". :)