Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a test, I asked Gemini 2.5 Flash and Gemini 2.5 Pro to decode a single BASE64 string.

Flash answered correctly in ~2 seconds, at most. Pro answered very wrongly after thinking and elaborating for ~5 minutes.

Flash was also giving a wrong answer for the same string in the past, but it improved.

Prompt was the same: "Hey, can you decode $BASE64_string?"

I have no further comments.




well that's not a very convincing argument. That's just a failure to recognize when the use of a tool- base64 decoder- is needed, not a reasoning problem at all, right?


A moderately smart human who understands how Base64 works can decode it by hand without external tools other than pen and paper. Coming up with the exact steps to perform is a reasoning problem.


That's not really a cop out here: both models had access to the same tools.

Realistically there are many problems that non-reasoning models do better on, especially when the answer cannot be solved by a thought process: like recalling internal knowledge.

You can try to teach the model the concept of a problem where thinking will likely steer it away from the right answer, but at some point it becomes like the halting problem... how does the model reliably think its way into the realization a given problem is too complex to be thought out?


Translating to BASE64 is a good test to see how well it works as a language translator without changing things, because its the same skill for an AI model.

If the model changes things it means it didn't really capture the translation patterns for BASE64, so then who knows what it will miss when translating between languages if it can't even do BASE64?


If the reasoning model was truly reasoning while the flash model was not then by definition shouldn’t it be better at knowing when to use the tool than the non-reasoning model? Otherwise it’s not really “smarter” as claimed, which seems to line up perfectly with the paper’s conclusion.


I don't know whether Flash uses a tool or not, but it answers pretty quickly. However, Pro opts to use its own reasoning, not a tool. When I look at the reasoning train, it pulls and pulls knowledge endlessly, refining that knowledge and drifting away.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: