It's not about the generation, it's about verification.
Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.
Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.
The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:
Changing my tests from the strings I was interested in to four or more letter common words _did_ improve the ability of reasoning LLMs to get the right answer, at the cost of the context exploding to thousands of tokens.
Unfortunately I can't tell you by how much because the couple of dozen tests I did after reading your post ate my $50 I keep in an account for these types of things.
The following question ate through 8k thinking tokens to get the right answer in Claude3.7 Sonnet Extended:
---
Given the following grammar:
Is the following sentence valid:Rome Paris Rome end_path Rome London end_path end_company
---
Incidentally it got the right answer no less than 4 times in the thinking token stream. I'd not seen this model act like this before.