Absolutely. I want a "Cat on the Counter" detector, but a) the hardware needs to be cheap, and b) it can't take more than a few seconds to analyze a frame.
Especially doable since the owner can probably get lots of pictures of their cat in different poses and lighting conditions and really overfit on their cat instead of just any cat.
Btw, it doesn't really sound like the problem needs a video as an input to llm. Feels like sending an image is okay. So that makes it less demanding(?)