More RVV questionable optimization cases:

- broadcasting a loaded value: a stride-0 load can be used for this, and could be faster than going through a GPR load & vmv.v.x, but could also be much slower.

- reversing: could use vrgather (could do high LMUL everywhere and split into multiple LMUL=1 vrgathers), could use a stride -1 load or store.

- early-exit loops: It's feasible to vectorize such, even with loads via fault-only-first. But if vl=vlmax is used for it, it might end up doing a ton of unnecessary computation, esp. on high-VLEN hardware. Though there's the "fun" solution of hardware intentionally lowering vl on fault-onlt-first to what it considers reasonable as there aren't strict requirements for it.