It's a good test to determine if the replier is human, but it isn't a good test of whether or not it's a decent answer to the question posed. Can you definitively sit here and say that flipping a plate necessitates the antecedent? Replace "cupcake", "plate" and "flip" with technical libraries and terminology and you can easily re-create ambiguity. The other thing is, there is that 1 in 100 human that would assume you mean just the plate. I think there's a lot of nuance being glossed over in this test.
Being able to aggregate and understand nuance is what's supposed to make ChatGPT an improvement over logic programming with axioms and rules from the 80s and 90s. If you have to enumerate every single subject, object, and predicate, and how they exist in the world, then we've burned hundreds of millions of dollars of compute power to recreate Prolog