Is (1) that surprising? If I ask someone a simple question but tell them to "think really hard about it", they'll be more likely to treat it as a trick question and look for a non-obvious answer. Overthinking it, basically.
It is hard to compare models with humans so not sure how to answer it for both. :)
But, for models, this is an interesting finding because a lot of LRMs are LLMs with a _bunch_ of post-training done on top. We know this about DeepSeek R1 (one of the models evaluated in the Apple paper) for sure. They write extensively about how they took DeepSeek-V3-Base and made R1 with it. [1]
If the post-training is resulting in lower performance on simpler tasks then it ought to inspire more research on how to make it so that it doesn't -- i.e., with more training (of any kind), we should be gaining more capabilities. This has been a problem with DNNs historically, btw. We had these issues when fine-tuning text/image classifiers as well. Some weight changes can be destructive. So, it has to be done with a _lot_ of care. And, I am sure folks are working on it, to be honest. Maybe some of them will say something here. :-)