If every medium becomes editable like text, i don't see why it should be possible to watermark images or video any easier than text.
Images have the aliasing problem, which is NP-hard, but aliasing gets close to 100% correct after editing an image just by cutting shapes, and throw it in an image generator to create a new one with 99% similarity. In Stable Diffusion XL it need 70% similarity or something like that. The new image will be very similar to the old one with correct aliasing, but edited as much as you like.
When you generate a text with an LLM, you always have some choice. So you can sample in a way that is very likely under your watermark scheme, and unlikely otherwise
but that's what I'm saying, When you ask for exclamation marks after each word that must change the likelihoods of next token by quite a bit. You then remove the marks which hides the fact that you just changed every word without loss of meaning