> Rather than standard benchmarks (e.g., math problems), we adopt controllable p...

pegasus · 2025-06-07T17:10:37 1749316237

Is (1) that surprising? If I ask someone a simple question but tell them to "think really hard about it", they'll be more likely to treat it as a trick question and look for a non-obvious answer. Overthinking it, basically.

curious_cat_163 · 2025-06-08T16:25:29 1749399929

It is hard to compare models with humans so not sure how to answer it for both. :)

But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]

If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)

[1] https://github.com/deepseek-ai/DeepSeek-R1

make3 · 2025-06-09T02:59:59 1749437999

Using puzzles is not special or anything, it has been done a million times since before (and including) the LSTM paper (1997) https://www.bioinf.jku.at/publications/older/2604.pdf

jimmySixDOF · 2025-06-11T18:42:08 1749667328

The Arc Prize just released a new update and it's all minigame puzzles

https://arcprize.org/