I think you are right, even if I beleve next token prediction can work, I dont think it can happen in this autoregressive way where we fully collapse the token to feed it back in. Can you imagine how much is lost from each torch.multinomial?
Maybe the way forward is in LCM or go JEPA, therwise, as this Apple paper suggests, we will just keep pushing the "pattern matching" further, maybe we get some sort of phase transition at some point or maybe we have to switch architecture, we will see. It could be that things change when we get physical multimodality and real world experience, I dont know.
Take language out of the equation and drawing a circle, triangles, letters is just statistical physics. We can capture in energy models stored in an online state, statistical physics relative to the machine; its electromagnetic geometry: https://iopscience.iop.org/article/10.1088/1742-6596/2987/1/...
Our language doesn’t exist without humans. It’s not an immutable property of physics. It’s obfuscation and mind viruses. It’s story mode.
The computer acting as a web server or an LLM has an inherent energy model to it. New models of those patterns will be refined to a statefulness that strips away unnecessary language constructs in the system; like a lot of software most don’t use just developers.
I look forward to continuing my work in the hardware world to further compress and reduce the useless state of past systems of though we copy paste around to serve developers, to reduce context to sort through, and improve model quality: https://arxiv.org/abs/2309.10668