I don't understand how people feel comfortable writing 'LLMs are done improving, this plateau is it.' when we haven't even gone an entire calendar year without seeing improvements to LLM based AI.
As one on that side of that argument, I have to say I have yet to see LLMs fundamentally improve, rather than being benchmaxxed on a new set of common "trick questions" and giving off the illusion of reasoning.
Add an extra leg to any animal in a picture. Ask the vision LLM to tell you how many legs it sees. It will answer the same amount as a person would expect from a healthy individual, because it's not actually reasoning, it's not perceiving anything, it's pattern matching. It sees dog, it answers 4 legs.
Maybe sometime in the future it won't do that, because they will add this kind of trick to their benchmaxxing set (training LLMs specifically on pictures that have less or more legs than the animal should), as they do every time there's a new generation of those illusory things. But that won't fix the fundamental that these things DO NOT REASON.
Training LLMs on sets of thousands and thousands and thousands of reasoning trick questions people ask on LM arena is borderline scamming people on the true nature of this technology. If we lived in a sane regulatory environment OAI would have a lot to answer for.
Someone is going to make a new OS based around LLMs soon. It might end up being mobile but desktop could work. Voice and tool calling are close to being good enough.
It's still too early but at some point we are going to start to see infra and frameworks designed to be easier for LLMs to use. Like a version of terraform intended for AI. Or an edition of the AWS api for LLMs.
For factory optimization they didn't ask people to track their hours at 7 minute intervals. They put a camera on them and analyzed what they did over an entire shift. Then they redesign the factory line to optimize things. Having people manually track time has far more overhead and is less accurate in general. In practice I've never seen a software org take estimates or time tracking seriously beyond "Track your hours in 15 min increments against tickets".
I agree. It’s a lot faster to tell it what I want and work on something else in the meantime. You end up ready code diffs more than writing code but it saves time.
Yes, that, and AI Explained [1] on YouTube - no BS hype, strong yet approachable analysis, and an in-depth private community if you want a deeper dive.
I recently used claude code and go to build a metric receiving service for a weather station project. I had it build all the code for this. It created the SQL statements, the handlers, the deployment scripts, the systemd service files, the tests, a utility to generate api keys.
I think you are using the wrong language to be honest. LLMs are best at languages like Python, Javascript and Go. Relatively simple structures and huge amounts of reference code. Rust is a less common language which is much harder to write.
Did you give claude code tests and the ability to compile in a loop? It's pretty good in go at least at debugging and fixing issues when allowed to loop.
Where? I have candidates solve a real closed-ended problem in the space we’re working in. I also give them a lot of source code to read and respond to and find issues with.
reply