If the reasoning model was truly reasoning while the flash model was not then by definition shouldn’t it be better at knowing when to use the tool than the non-reasoning model? Otherwise it’s not really “smarter” as claimed, which seems to line up perfectly with the paper’s conclusion.