Claude 3.7 feels overtuned. I asked it to write tests. It replicated my code int...

Claude 3.7 feels overtuned. I asked it to write tests. It replicated my code into mocks and "monkey-patched" it to pass tests. Basically if it didn't pass tests, it will rewrite the mock to pass.

It seems that cursor-thinking will come up with 3 options and pick the dumbest one. Leading to much worse performance than non-thinking sometimes.

A big part of it seems to be increased focus on following instructions vs 3.5. If you don't tell it to not cheat, it cheats lol. Sometimes it's even aware of this, saying things like "I should be careful to not delete this" but deleting it is the fastest path to solving the question as asked, so it deletes.