He didn’t misstate the puzzle, the whole point is to give an alternative version of the puzzle, and GPT 4 doesn’t notice that alternative. It’s exactly as difficult as the standard version as long as you are doing the logic instead of pattern-matching the puzzle form to text.
Ah, I had missed that interpretation. Although, that may explain why GPT-4 got it wrong: there's so much context in its training data about the relationship between lions and humans, and this puzzle specifically, that like this human its response was swayed...
But I think that's the whole point of the exercise? That GPT-4 is leaning on stringing tokens together in a reply rather than reasoning through the problem itself which, I would think, would be "required" for AGI (though we may end up finding out that well trained language models in specific domains eliminate the need for generalized cognition).
In any case, it's an interesting exercise regardless of your opinion/stance on the matter!
But the human (in the comment chain) here made exactly the same mistake!
In that sense this test doesn’t seem to be a good fit for testing the reasoning capabilities. Since it‘s also easy to get wrong for humans (and humans also don’t always reason about everything from first principles, especially if they have similar answers already cached in their memory).
It seems you would need novel puzzles that aren’t really common (even if in kind) and don’t really sound similar to existing puzzles to get a handle on its reasoning capabilities.
The human recognized that they made the mistake and fixed it. As mentioned in the original comment, GPT failed to recognize the mistake even after being told. That's the key here that indicates it can't "reason."
There are open questions about whether or not it really needs to reason given sufficient training, but that seems to be the gap here between the human and the machine.
Rewrite the problem in way that doesn't make it biased to common priors and it reasons it out just fine
"Suppose I have an object of type C, an object of type G and an object of type L, and I need to get them across a river. I have a boat that can only carry myself and a single other object. I am not allowed to leave the objects of type C and L alone together, and I am not allowed to leave the objects of type L and G alone together. How can I safely get all three across?"
"This is a classic puzzle known as the "river crossing problem". Here's how you can safely get all three objects across the river:
First, take the object of type L across the river. Return alone to the original side of the river. Take the object of type C across the river. Return with the object of type L to the original side of the river. Take the object of type G across the river. Return alone to the original side of the river. Finally, take the object of type L across the river again. By following these steps, you can safely transport all three objects across the river without violating the restrictions given in the problem. Note that you have to start with the object of type L because it cannot be left alone with either of the other two objects. Once L is transported to the other side, you can use it to ensure that C and G are never left alone together."
Or With Bing, you don't even need to tell you what it assumed wrong - I just told it that it's not quite the same as the classic puzzle, and it responded by correctly identifying the difference and asking me if that's what I meant, but forgot that lion still eats the goat. When I pointed that out, it solved the puzzle correctly.
Again, this is not about being able to write the prompt in a way that allows GPT to find the answer. I’m not doubting its ability to do so. It’s that a human can reason through why the answer should be different, despite any common priors, and arrive at the correct judgment.
It indicates that there’s still something a human does that the machine doesn’t, even if we’re not able to place what it is. This is neither an argument for nor against progress towards AGI, just an observation. It’s interesting regardless (to me).
It can do that though..? Kind of the point with the Bing example. I told it it was making a wrong assumption (didn't tell it what was wrong) with it's original answer and it figured it out.
this here is why it's not fair to criticize GPT-4 so quickly on this question.
for the record, I made the same mistake as nonfamous at first, i almost commented "but it's correct" before going back to double check what i was missing.
i simply skimmed the problem, recognized it as a common word problem and totally missed the unusual constraints from the question. i just didn't pay attention to the whole question.
Which to be fair is what most people reading that problem understood the first time. I wonder what would happen if you then tell gpt "No, it's the lion that can't be with the cabbage, not the goat, try again"
It's even better. You can tell it that it's not quite the classic puzzle, and then it will actually figure out the differences and summarize them. From there it can solve it.