It is more complexity than just "ship everything to an LLM and use tool calls", but the payoff - perfect behavior, along with offline support, for your most common inputs - is worth it I think.
I disagree about things being less consistent. Let's imagine a 100% LLM world - in this world, you use a bunch of training to try to get the LLM to match your hardcoded responses for common inputs. If you get your training really right, you get 100% accuracy for these inputs. In this world, no one is complaining about consistency! So why not just hardcode that behavior?
The whole benefit of LLMs is that humans are not consistent enough. Or at least Apple, Amazon, Google and Microsoft all believe normies don't want to be consistent enough to speak the most common input the same way, allowing to use much simpler and efficient approaches to voice input - like the ones that worked off-line 15+ years ago on a regular PC.
LLMs are actually the only reason I'd consider processing voice in the cloud to be a good idea. Alas, knowing how the aforementioned companies designed their assistants in the past, I'm certain they'll find a way to degrade the experience and strip most of the benefits of having LLMs in the loop. After all, as past experience shows, you can't have an assistant letting you operate commercial products and services without speaking the brand names out loud. That's unthinkable.