I think this might be part of the reason Apple is “behind” on generative AI … LLMs have not really proven to be useful outside of relatively niche areas such as coding assistants, legal boiler plate and research, and maybe some data science/analysis which I’m less familiar with
Other “end user” facing use cases have so far been comically bad or possibly harmful, and they just don’t meet the quality bar for inclusion in Apple products, which as much as some people like to doo doo on them and say they have gotten worse, still have a very high expectations of quality and UX from customers.
Apple prides itself on building high quality user experiences. One can argue over whether that’s true anymore or ever was but it’s very clear they pride themselves on that. This is why Apple tends to “be late” on many features. I think like you’re saying it’s becoming even more clear they aren’t seeing a UX they are willing to ship to customers. We’ll see with WWDC if they found something in the last year to make it better but this paper seems to indicate they haven’t.
I have friends that do pretty disparate things (e.g. education consulting, grant writing, solar project planning, etc). They all use LLMs in aspects of their jobs; tasks like rewording emails for tone, extracting themes from brainstorming sessions, rough drafting project plans, etc.
None of them are doing the equivalent of “vibe-coding”, but they use LLMs to get 20-50% done, then take over from there.
Apple likes to deliver products that are polished. Right now the user needs to do the polishing with LLM output. But that doesn’t mean it isn’t useful today
No matter how much computing power you give them, they can't solve harder problems.
Why would anyone ever expect otherwise?
These models are inherently handicapped and always will be in terms of real world experience. They have no real grasp of things people understand intuitively --- like time or money or truth ... or even death.
They only *reality* they have to work from is a flawed statistical model built from their training data.
I agree, obviously, but half the internet is still running around claiming we're on the verge of a singularity, so demonstrating the actual limitations of these systems concretely is important.
No matter how much computing power you give them, they can't solve harder problems.
This research suggests we're not as close to AGI as the hype suggests.
Current "reasoning" breakthroughs may be hitting fundamental walls that can't be solved by just adding more data or compute.
Apple's researchers used controllable puzzle environments specifically because:
• They avoid data contamination • They require pure logical reasoning • They can scale complexity precisely • They reveal where models actually break
Models could handle 100+ moves in Tower of Hanoi puzzles but failed after just 4 moves in River Crossing puzzles.
This suggests they memorized Tower of Hanoi solutions during training but can't actually reason.
https://x.com/RubenHssd/status/1931389580105925115