> how to build systems where the whole is bigger than the sum of its parts
A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.
> And, even the dumber LLMs would slot in naturally into such a process
That is what I am struggling with, it is really easy at the moment to slot LLM and make everything worse. Mainly because its output is coming from torch.multinomial with all kinds of speculative decoding and quantizations and etc.
But I am convinced it is possible, just not the way I am doing it right now, thats why I am spending most of my time studying.
For studying? Mainly watching and re-watching Karpathy's 'Zero To Hero'[1] and Stanford's 'Introduction to Convolutional Neural Networks for Visual Recognition'[2], also a lot of transformers from scratch videos like Umar Jamali's videos[3], and I also study backwards to McCulloch and Pitts. Reading the 30 papers https://punkx.org/jackdoe/30.html and so on.
And of course Yannic Kilcher[4], and also listening in on the paper discussions they do on discord.
Practicing a lot with just doing backpropagation by hand and making toy models by hand to get intuition for the signal flow, and building all kinds of smallish systems, e.g. how far can you push whisper, small qwen3, and kokoro to control your computer with voice?
People think that deepseek/mistral/meta etc are democratizing AI, but its actually Karpathy who teaches us :) so we can understand them and make our own.
I think you are right, even if I beleve next token prediction can work, I dont think it can happen in this autoregressive way where we fully collapse the token to feed it back in. Can you imagine how much is lost from each torch.multinomial?
Maybe the way forward is in LCM or go JEPA, therwise, as this Apple paper suggests, we will just keep pushing the "pattern matching" further, maybe we get some sort of phase transition at some point or maybe we have to switch architecture, we will see. It could be that things change when we get physical multimodality and real world experience, I dont know.
Take language out of the equation and drawing a circle, triangles, letters is just statistical physics. We can capture in energy models stored in an online state, statistical physics relative to the machine; its electromagnetic geometry: https://iopscience.iop.org/article/10.1088/1742-6596/2987/1/...
Our language doesn’t exist without humans. It’s not an immutable property of physics. It’s obfuscation and mind viruses. It’s story mode.
The computer acting as a web server or an LLM has an inherent energy model to it. New models of those patterns will be refined to a statefulness that strips away unnecessary language constructs in the system; like a lot of software most don’t use just developers.
I look forward to continuing my work in the hardware world to further compress and reduce the useless state of past systems of though we copy paste around to serve developers, to reduce context to sort through, and improve model quality: https://arxiv.org/abs/2309.10668
A bit tangential, but I look at programming as inherently being that. Every task I try to break down into some smaller tasks that together accomplish something more. That leads me to think that, if you structure the process of programming right, you will only end up solving small, minimally interwined problems. Might sound far-fetched, but I think it's doable to create such a workflow. And, even the dumber LLMs would slot in naturally into such a process, I imagine.