You focused on writing software, but the real problem is the spec used to produce the software, LLMs will happily hallucinate reasonable but unintended specs, and the checker won’t save you because after all the software created is correct w.r.t. spec.
Also tests and proof checkers only catch what they’re asked to check, if the LLM misunderstands intent but produces a consistent implementation+proof, everything “passes” and is still wrong.
This is why every one of my coding agent sessions starts with "... write a detailed spec in spec.md and wait for me to approve it". Then I review the spec, then I tell it "implement with red/green TDD".
Users don't want changes that rapidly. There's not enough people on the product team to design 20x more features. 20x more features means 400x more cross-team coordination. There's only positive marginal ROI for maybe 1.5-2x even if development is very cheap.
The premise is in progress. We are only at the beginning of the fourth year of this hype-phase, and we haven't even reached AGI yet. It's obviously not perfect, maybe never will, but we are not a the point yet were we can conclude which future is true. The singularity hasn't happend yet, so we are still moving with (llm-enhanced) human speed at the moment, meaning things need time.
Maybe, but you're responding to a thread about why AI might or might not be able to replace an entire engineering team:
> Ultimately I think over the next two years or so, Anthropic and OpenAI will evolve their product from "coding assistant" to "engineering team replacement", which will include standard tools and frameworks that they each specialize in (vendor lock in, perhaps), but also ways to plug in other tech as well.
This is the context of how this thread started, and this is the context in which DrammBA was saying that the spec problem is very hard to fix [without an engineering team].
Might be good to define the (legacy) engineering team. Instead of thinking 0/1 (ugh, almost nothing happens this way), the traditional engineering team may be replaced by something different. A team mostly of product, spec writers, and testers. IDK.
The job of AI is to do what we tell it to do. It can't "create a spec" on its own. If it did and then implemented that spec, it wouldn't accomplish what we want it to accomplish. Therefore we the humans must come up with that spec. And when you talk about a software application, the totality of its spec written out, can be very complex, very complicated. To write and understand, and evolve and fix such a spec takes engineers, or what used to be called "system analysts".
To repeat: To specify what a "system" we want to create does is a highly complicated task, which can only be dones by human engineers who understand the requirements for the system, and how parts of those requirements/specs interact with other parts of the spec, what are the consequences of one (part of the) spec to other parts of it. We must not writ e"impossible specs" like draw me a round square. Maybe the AI can check whether the spec is impossible or not, but I'm not so sure of that.
So I expect that software engineers will still be in high demand, but they will be much more productive with AI than without it. This means there will be much more software because it will be cheaper to produce. And the quality of the software will be higher in terms of doing what humans need it to do. Usability. Correctness. Evolvability. In a sense the natural language-spec we give the AI is really something written in a very high-level programming-language - the language of engineers.
BTW. As I write this I realize there is no spell-checker integrated into Hacker News. (Or is there?). Why? Because it takes developers to specify and implement such a system - which must be integrated into the current HN implementation. If AI can do that for HN, it can be done, because it will be cheap enough to do it -- if HN can exactly spell out what kind of system it wants. So we do need more software, better software, cheaper software, and AI will helps us do that.
A 2nd factor is that we don't really know if a spec is "correct" until we test the implemented system with real users. At that point we typically find many problems with the spec. So somebody must fix the problems with the spec, evolve the spec and rinse and repeat the testing with real users -- the developers who understand the current spec and why it is is not good enough.
AI can write my personal scripts for me surely. But writing a spec for a system to be used by thousands of humans, still takes a lot of (human) work. The spec must work for ALL users. That makes it complicated and difficult to get right.
Same, and similarly something like a "create a holistic design with all existing functionality you see in tests and docs plus new feature X, from scratch", then "compare that to the existing implementation and identify opportunities for improvement, ranked by impact, and a plan to implement them" when the code starts getting too branchy. (aka "first make the change easy, then make the easy change"). Just prompting "clean this code up" rarely gets beyond dumb mechanical changes.
Given so much of the work of managing these systems has become so rote now, my only conclusion is that all that's left (before getting to 95+% engineer replacement) is an "agent engineering" problem, not an AI research problem.
I guess thats arguable, a memory leak can make a system unpleasant to use although I accept it can be solved by repeatedly restarting the offending app.
Without getting into your specific injury or sport, what was the biggest change compared to the trainer’s program?
Was it something unexpected like "exercise this seemingly unrelated muscle group that has nothing do with your injury but just happens to reduce pain by 75% for some inexplicable reason"?
Or was it something more mundane like "instead of exercising this muscle every day, do it every other day to give it time to rest"?
I'm not entirely sure, but here is my educated guess.
The biggest change was that I spent a lot of time vetting each exercise for my specific injury points and asking whether this was really the best way to work that muscle group. I ended up replacing 60% of the workout with new exercises that allow me to lift more weight or target different muscle groups, while taking pressure off those injury points.
I think I had grown to use more weight with a few exercises that, on paper, shouldn't cause a problem, but were causing more stress on my injury and the supporting muscles. I found ways to isolate those muscles without putting as much tension on that area. I also added more core-strength exercises, including some for the hip flexors, which might be helping support as well. I was likely doing planks for too long, and switched to hardstyle, etc.
Last year, I was pain-free 90% of the year, and most years I run around 95% to 98%. Last year just felt different, and the rehab wasn't working the way it was. Since switching to this workout about 8 weeks ago I've been 100% pain free in a way that is hard to describe. My back has just felt light and happy, I can jump up on boxes and back down with no worries.
This is on the back of 10 years of rehab, 10 years of education, 10 years of learning about my injury and body, etc. AI is not some magic button to all the people who might jump on this thread :), it's a tool, and I want to stress that. But I've tried to do this in years past, and I couldn't do it. This was a game-changer. I tred with ChatGPT3 and it was useless at the time as well.
Funny thing I went down a rabbit hole, cause I first scanned the open PRs and saw a PR to enable universal builds to support intel macs but the whole thing was pure AI slop and someone commented that codexbar already supports intel, and sure enough v.15 added it (the AI slop PR completely missed that), I then looked into the cask script and it has a hardcoded dependency on arm which prevents brew from installing v.17 even if it's already an universal binary since v.15.
> Today at CES, Intel unveiled Intel Core Ultra Series 3 processors, the first AI PC platform built on Intel 18A process technology that was designed and manufactured in the United States. Powering over 200 designs from leading, global partners, Series 3 will be the most broadly adopted and globally available AI PC platform Intel has ever delivered.
What in the world is this disaster of an opening paragraph? From the weird "AI PC platform" (not sure what that is) to the "will be the most broadly adopted and globally available AI PC platform" (is that a promise? a prediction? a threat?).
And you just gotta love the processor names "Intel Core Ultra Series 3 Mobile X9/X7"
I think I have given up on chip naming. I honestly can't tell anymore there are so many modifiers on the names these days. I assume 9 is better than 7 right? Right?
Oh, the number of times I’ve heard someone assume their five- or ten-year-old machine must be powerful because it’s an i7… no, the i3-14100 (released two years ago) is uniformly significantly superior to the i7-9700 (released five years before that), and only falls behind the i9-9900 in multithreaded performance.
Within the same product family and generation, I expect 9 is better than 7, but honestly it wouldn’t surprise me to find counterexamples.
>>Within the same product family and generation, I expect 9 is better than 7
Ah the good old Dell laptop engineering, where the i9 is better on paper, but in reality it throttles within 5 seconds of starting any significant load and the cpu nerfs itself below even i5 performance. Classic Dell move.
Apple had the same problem before they launched the M1. Unless your workloads are extremely bursty the i9 MacBook is almost guaranteed to be slower than the base i7.
The latest iPhone base model performs better than the iPhone Air despite the latter having a Pro chip, because that Pro is so badly throttled due to the device form factor.
Are they throttling with the fan off? Because I don't recall ever hearing the fan on my M3 Max 14" (granted no heavy deliberate computational beyond regular dev work).
AFAIK it’s only something that happens under sustained heavy load. The 14” Max should still outperform the Pro for shorter tasks but I’d reckon few people buy the most expensive machine for such use cases.
Personally I think that Apple should not even be selling the 14” Max when it has this defect.
Within the same family and generation, I don’t think this should happen any more. But especially in the past, some laptops were configurable with processors of different generations or families (M, Q, QM, U, so many possibilities) so that the i7 option might have worse real-world performance than the i5 (due to more slower cores).
It's been a cooling problem on a lot of i9 laptops... the CPU will hit thermal peaks, then throttle down, this has an incredibly janky feel as a user... then it spins back up, and down... the performance curves just wacky in general.
Today is almost worse, as the thermal limits will be set entirely different between laptop vendors on the same chips, so you can't even have apples to apples performance expectations from different vendors.
Same for the later generation Intel Macbook Pros... The i9 was so bad, and the throttling made it practically unusable for me. If it weren't a work issued laptop, I'd have either returned it, or at least under-volted and under-clocked it so it didn't hiccup every time I did anything at all.
I had an X1 Carbon like this, only it'd crash for no apparent reason. The internet consensus that Lenovo wouldn't own up to was that the i7 CPUs were overpowered for the cooling, so your best bet is either underthrottling them or getting an i5.
Yeah, putting an i9 in any laptop that's not an XL gaming rig with big fans is very nearly always a waste of money (there might exist a few rare exceptions for some oddball workloads). Manufacturers selling i9s in thin & light laptops at an ultra price premium may fall just short of the legal definition of fraud but it's as unconscionable as snake-oil audiophile companies selling $500 USB cables.
Tbf 2 jobs ago I had a Dell enterprise workstation laptop, an absolute behemoth of a thing, it was like 3.5kg, it was the thicker variant of the two available with extra cooling, specifically sold to companies like ours needing that extra firepower, and it had a 20 core i9, 128GB of DDR5 CAMM ram, and a 3080Ti - I think the market price of that thing was around £14k, it was insane. And it had exactly that kind of behaviour I described - I would start compiling something in Visual Studio, I would briefly see all cores jump to 4GHz and then immediately throttle down to 1.2GHz, to a point where the entire laptop was unresponsive while the compilation was ongoing. It was a joke of a machine - I think that's more of a fraud than what you described, because companies like ours were literally buying hundreds of these from Dell and they were literally unsuitable for their advertised use.
(to add insult to the injury - that 3080Ti was literally pointless as the second you started playing any game the entire system would throttle so hard you had extreme stuttering in any game, it was like driving a lamborghini with a 5 second fuel reserve. And given that I worked at a games studio that was kinda an essential feature).
That's still assigning too much significance to the "i9" naming. Sometimes, the only difference between the i9 part and the top i7 part was something like 200MHz of single-core boost frequency, with the core counts and cache sizes and maximum power limit all being equal. So quite often, the i7 has stood to gain just as much from a higher-power form factor as the i9.
A machine learning model can place a CPU on the versioning manifold but I'm not confident that it could translate it to human speech in a way that was significantly more useful than what we have now.
At best, 14700KF-Intel+AMD might yield relevant results.
AI PC has been in the buzz for more than 2 years now (despite itself being a near useless concept), and intel has like 75% marketshare for laptop. Both of those are well with in norm for an intel marketing piece.
It’s not really meant for consumer. Who would even visit newsroom.intel.com?
An AI PC has a CPU, a GPU and an NPU, each with specific AI acceleration capabilities. An NPU, or neural processing unit, is a specialized accelerator that handles artificial intelligence (AI) and machine learning (ML) tasks right on your PC instead of sending data to be processed in the cloud.
https://newsroom.intel.com/artificial-intelligence/what-is-a...
It'd be interesting to see some market survey data showing the number of AI laptops sold & the number of users that actively use the acceleration capabilities for any task, even once.
Remove background from an image. Summarize some text. OCR to select text or click links in a screenshot. Relighting and centering you in your webcam. Semantic search for images and files.
A lot of that is in the first party Mac and Windows apps.
> Are ZBooks good or do I want an OmniBook or ProBook? Within ZBook, is Ultra or Fury better? Do I want a G1a or a G1i? Oh you sell ZBook Firefly G11, I liked that TV show, is that one good?
Apple is very consistent. You have the MacBook Air (lighter, more portable variant) and the MacBook Pro (more expensive and powerful variant). They don’t mess around with model numbers.
Apple is so "consistent" the way to know which kind of an Air or Pro it is, is to find the tiny print on the bottom that's a jumble of letters like "MGNE3" and google it.
And depending on what you're trying to use it for, you need to map it to a string like "MacBookAir10,1" or "A2337" or "Macbook Air Late 2022".
Oh also the Macbook Air (2020) is a different processor architecture than Macbook Air (2020).
The canonical way if you need a version number is the "about this Mac" dialog (here it says Mac Studio 2022).
If you need to be technical, System Information says Mac13,1 and these identifiers have been extremely consistent for about 30 years.
Your product number encodes much more information than that, and about the only time when it is actually required is to see whether it is eligible for a recall.
> Oh also the Macbook Air (2020) is a different processor architecture than Macbook Air (2020).
Right, except that one is MacBook Air (retina, 2020), Macbookair9,1, and the other is MacBook Air (M1, 2020), MacBookAir10,1. It happens occasionally, but the fact that you had to go back 5 years to a period in which the lineup underwent a double transition speaks volume.
> Apple is very consistent. You have the MacBook Air (lighter, more portable variant) and the MacBook Pro (more expensive and powerful variant).
What about the iBook? That wasn’t tidy. Ebooks or laptops?
Or the iPhone 9? That didn’t exist.
Or MacOS? Versioning got a bit weird after 10.9, due the X thing.
They do mess around with model numbers and have just done it again with the change to year numbers. I don’t particularly care but they aren’t all clean and pure.
> What about the iBook? That wasn’t tidy. Ebooks or laptops?
Back then, there were iBooks (entry-level) and PowerBooks (professional, high performance and expensive). There had been PowerBooks since way back in 1991, well before any ebook reader. I am not sure what your gripe is.
> Or the iPhone 9? That didn’t exist.
There’s a hole in the series. In what way is it a problem, and how on earth is it similar to the situation described in the parent?
> Or MacOS? Versioning got a bit weird after 10.9, due the X thing.
It never got weird. After 10.9.5 came 10.10.0. Version numbers are not decimals.
Seriously, do you have a point apart from "Apple bad"?
I’m not sure I hear people call MacOS X 10.10 “ten ten ten”. I think I remember them calling it “ten ten” verbally.
So you’d say “MacOS ten ten”.
At least that’s what I’m used to, it is entirely possible that’s what other people said and you would write it that way. No one wrote “MacOS X.10” or “MacOS X .10” but they would write “MacOS X 10.10”.
So yeah it’s all a bit of a mess. There’s a reason people often use the name of the release, like Snow Leopard or Tahoe, instead of the number numbers.
"iBook" referred to a laptop from 1999 to 2006. "iBooks" referred to the eBook reader app and store from 2010 to 2019. I'll grant that there is some possibility for confusion, but only if the context of the conversation spans multiple decades but doesn't make it clear whether you're talking about hardware or software.
Back when there were MacBooks, it was MacBook (standard model), MacBook Air (lighter variant), and MacBook Pro (more expensive, high-performance variant). Sure, 3 is more complicated than 2, but come on.
If you really want to complain, you can go back to the first unibody MacBook, which did not fit that pattern, or the interim period when high-DPI displays were being rolled out progressively, but let’s be serious. The fact is that even at the worst of times their range could be described in 2 sentences. Now, try to do that for any other computer brand. To my knowledge, he only other with an understandable lineup was Microsoft, before they lost interest.
> The fact is that even at the worst of times their range could be described in 2 sentences.
It’s a good time to buy one. They are all good.
It would be interesting to know how many SKUs are hidden behind the simple purchase interface on their site. With the various storage and colour options, it must be over 30.
Loads, I assume. But those are things like "MacBook Pro M1 Max with a 1TB SSD and a matte screen coating" versus "MacBook Pro M1 with a 256GB SSD and a standard screen". The granularity of say Dell’s product numbers is not enough for that either, and you still need a long product number when searching their knowledge base.
Intel marketing isn’t the best but I am struggling to understand what issue you’re taking with this.
It’s an AI PC platform. It can do AI. It has an NPU and integrated GPU. That’s pretty straightforward. Competitors include Apple silicon and AMD Ryzen AI.
They’re predicting it’ll sell well, and they have a huge distribution network with a large number of partner products launching. Basically they’re saying every laptop and similar device manufacturer out there is going to stuff these chips in their systems. I think they just have some well-placed confidence in the laptop segment, because it’s supposed to combine the strong efficiency of the 200 series with the kind of strong performance that can keep up with or exceed competition from AMD’s current laptop product lineup.
Their naming sucks but nobody’s really a saint on that.
i cant believe we're still putting NPUs into new designs.
silicon taken up that couldve been used for a few more compute units on the GPU, which is often faster at inference anyway and way more useful/flexible/programmable/documented.
You can thank Microsoft for that. Intel architects in fact did not want to waste area on an NPU. That caused Microsoft to launch their AI-whatever branded PCs with Qualcomm who were happy to throw in whatever Microsoft wanted to get to be the launch partner. After than Intel had to follow suit to make Microsoft happy.
That doesn’t explain why Apple “wastes” die area on their NPU.
The thing is, when you get an Apple product and you take a picture, those devices are performing ML tasks while sipping battery life.
Microsoft maybe shouldn’t be chasing Apple especially since they don’t actually have any marketshare in tablets or phones, but I see where they’re getting at: they are probably tired of their OS living on devices that get half the battery life of their main competition.
And here’s the thing, Qualcomm’s solution blows Intel out of the water. The only reason not to use it is because Microsoft can’t provide the level of architecture transition that Apple does. Apple can get 100% of their users to switch architecture in about 7 years whenever they want.
Bingo. Maybe Microsoft shouldn’t even be chasing them but I think they have a point to try and stay competitive. They can’t just have their OS getting half the battery life of their main competitor.
When you use an Apple device, it’s performing ML tasks while barely using any battery life. That’s the whole point of the NPU. It’s not there to outperform the GPU.
Every modern chip needs some percentage dedicated to dark silicon. There is no cheating the thermal reality. You could add more compute units in the GPU, but you then have to make up for it somewhere else. It’s a balancing act.
The Core Ultra lineup is supposed to be low-power, low-heat, right? If you want more compute power, pick something from a different product series.
> Every modern chip needs some percentage dedicated to dark silicon. There is no cheating the thermal reality. You could add more compute units in the GPU, but you then have to make up for it somewhere else. It’s a balancing act.
I think that "dark silicon" mentality is mostly lingering trauma from when the industry first hit a wall with the end of Dennard scaling. These days, it's quite clear that you can have a chip that's more or less fully utilized, certainly with no "dark" blocks that are as large as a NPU. You just need to have the ability to run the chip at lower clock speeds to stay within power and thermal constraints—something that was not well-developed in 2005's processors. For the kind of parallel compute that GPUs and NPUs tackle, adding more cores but running them at lower clock speeds and lower voltages usually does result in better efficiency in practice.
The real answer to the GPU vs NPU question isn't that the GPU couldn't grow, but that the NPU has a drastically different architecture making very different power vs performance tradeoffs that theoretically give it a niche of use cases where the NPU is a better choice than the GPU for some inference tasks.
It's... the launch vehicle for a new process. Literally the opposite of "cost cutting", they went through the trouble of tooling up a whole fab over multiple years to do this.
Will 18A beat TSMC and save the company? We don't know. But they put down a huge bet that it would, and this is the hand that got dealt. It's important, not something to be dismissed.
Lunar Lake integrated DRAM on the package, which was faster and more power efficient, this reverts that. They also replaced part of the chip from being sourced from TSMC to from themselves. And if their foundry is competitive, they should be shaking other foundry customers down the way TSMC is.
If they have actually mostly caught up to TSMC, props, but also, I wish they hadn't given up on EUV for so long. Instead they decided to ship chips overclocked so high they burn out in months.
> Lunar Lake integrated DRAM on the package, which was faster and more power efficient, this reverts that.
On package memory is slightly more power efficient but it isnt any faster, it still uses industry standard LPDDR. And Panther Lake supports faster LPDDR than Lunar Lake, so its definitely not a regression.
I don't see how any of that substantiates "Panther Lake and 18A are just cost cutting efforts vs. Lunar Lake". It mostly just sounds like another boring platform flame.
Again, you're talking about the design of Panther Lake, the CPU IC. No one cares, it's a CPU. The news here is the launch of the Intel 18A semiconductor process and the discussion as to if and how it narrows or closes the gap with TSMC.
Trying to play this news off as "only cost cutting" is, to be blunt, insane. That's not what's happening at all.
I'm not GP, but I think that it really doesn't matter if Intel is able to sell this process to other companies. But if they're only producing their own chips on it, that's quite a valid criticism.
And for the fourth time, it may be a valid "criticism" in the sense of "Does Intel Suck or Rule?". It does not validate the idea that this product release, which introduces the most competitive process from this company in over a decade, is merely a "cost reduction" change.
It's only as exciting as a cost reduction because they're playing catch-up by trying to not need to outsource their highest performance silicon. Let me know when Intel gets perf/watt to be high enough to be of interest to Apple, gamers, or anyone who isn't just buying a basic PC because their old one died, or an Intel server because that's what they've always had.
Every single performance figure in TFA is compared to their own older generations, not to competitors.
Putting on my CISO hat, if they release the source, someone else could then create an app, but this time maliciously with said exfiltration of information, and publish it on play with paid ad time.
You seem to be confused about your terms, both SSR and SSG can rehydrate and become interactive, you only need SSR if you have personalized content that must be fetched on an actual user request, and with frameworks like astro introducing island concept it even let's you mix SSG and SSR content on a single page.
That depends on how you interpret "static render".
I did not interpret that as React SSG. SSG is the default behavior of NextJS unless you dynamically fetch data, turning it into SSR automatically.
What I thought of is React's "renderToString()" at build time which will produce static HTML with event handlers stripped, in preparation for a later "hydrateRoot()" on the client side.
What's concerning to many of us is that you've (and others) have said this same thing s/Opus 4.5/some other model/
That feels more like chasing than a clear line of improvement. It's interrupted very different from something like "my habits have changed quite a bit since reading The Art of Computer Programming". They're categorically different.
It's because the models keep getting better! What you could do with GPT-4 was more impressive than what you could do with GPT 3.5. What you could do with Sonnet 3.5 was more impressive yet, and Sonnet 4, and Sonnet 4.5.
Some of these improvements have been minor, some of them have been big enough to feel like step changes. Sonnet 3.7 + Claude Code (they came out at the same time) was a big step change; Opus 4.5 similarly feels like a big step change.
If you're sincerely trying these models out with the intention of seeing if you can make them work for you, and doing all the things you should do in those cases, then even if you're getting negative results somehow, you need to keep trying, because there will come a point where the negative turns positive for you.
If you're someone who's been using them productively for a while now, you need to keep changing how you use them, because what used to work is no longer optimal.
Models keep getting better but the argument I'm critiquing stays the same.
So does the comment I critiqued in the sibling comment to yours. I don't know why it's so hard to believe we just haven't tried. I have a Claude subscription. I'm an ML researcher myself. Trust me, I do try.
But that last part also makes me keenly aware of their limitations and failures. Frankly I don't trust experts who aren't critiquing their field. Leave the selling points to the marketing team. The engineer and researcher's job is to be critical. To find problems. I mean how the hell do you solve problems if you're unable to identify them lol. Let the marketing team lead development direction instead? Sounds like a bad way to solve problems
> benchmark shows huge improvements
Benchmarks are often difficult to interpret. It is really problematic that they got incorporated into marketing. If you don't understand what a benchmark measures, and more importantly, what it doesn't measure, then I promise you that you're misunderstanding what those numbers mean.
For METR I think they say a lot right here (emphasis my own) that reinforces my point
> Current frontier AIs are vastly better than humans at text prediction and knowledge tasks. They outperform experts on most *exam-style problems* for a fraction of the cost. ... And yet the best AI agents are not currently able to carry out substantive projects by themselves or directly substitute for human labor. *They are unable to reliably handle even relatively low-skill*, computer-based work like remote executive assistance. It is clear that capabilities are increasing very rapidly in some sense, but it is unclear how this corresponds to real-world impact.
So make sure you're really careful to understand what is being measured. What improvement actually means. To understand the bounds.
It's great that they include longer tasks but also notice the biases and distribution in the human workers. This is important in properly evaluating.
Also remember what exactly I quoted. For a long time we've all known that being good at leetcode doesn't make one a good engineer. But it's an easy thing to test and the test correlates with other skills that are likely to be learned to be good at those tests (despite being able to metric hack). We're talking about massive compression machines. That pattern match. Pattern matching tends to get much more difficult as task time increases but this is not a necessary condition.
Treat every benchmark adversarialy. If you can't figure out how to metric hack it then you don't know what a benchmark is measuring (and just because you know what can hack it doesn't mean you understand it nor that that's what is being measured)
I think you should ask yourself: If it were true that 1) these things do in fact work, 2) these things are in fact getting better... what would people be saying?
The answer is: Exactly what we are saying. This is also why people keep suggesting that you need to try them out with a more open mind, or with different techniques: Because we know with absolute first-person iron-clad certainty what is possible, and if you don't think it's possible, you're missing something.
It seems to be "people keep saying the models are good"?
That's true. They are.
And the reason people keep saying it is because the frontier of what they do keeps getting pushed back.
Actual, working, useful code completion in the GPT 4 days? Amazing! It could automatically write entire functions for me!
The ability to write whole classes and utility programs in the Claude 3.5 days? Amazing! This is like having a junior programmer!
And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!
But now we are beginning to see that programming in 6 months time might look very different to now because these AI system code very differently to us. That's exactly the point.
So what is it you are arguing against?
I think you said you didn't like that people are saying the same thing, but in this post it seems more complicated?
> And now, with Opus 4.5 or Codex Max or Gemini 3 Pro we can write substantial programs one-shot from a single prompt and they work. Amazing!
People have been doing this parlor trick with various "substantial" programs [1] since GPT 3. And no, the models aren't better today, unless you're talking about being better at the same kinds of programs.
[1] If I have to see one more half-baked demo of a running game or a flight sim...
It’s a vague statement that I obviously cannot defend in all interpretations, but what I mean is: the performance of models at making non-trivial applications end-to-end, today, is not practically better than it was a few years ago. They’re (probably) better at making toys or one-shotting simple stuff, and they can definitely (sometimes) crank out shitty code for bigger apps that “works”, but they’re just as terrible as ever if you actually understand what quality looks like and care to keep your code from descending into entropy.
I think "substantial" is doing a lot of heavy lifting in the sentence I quoted. For example, I’m not going to argue that aspects of the process haven’t improved, or that Claude 4.5 isn't better than GPT 4 at coding, but I still can’t trust any of the things to work on any modestly complex codebase without close supervision, and that is what I understood the broad argument to be about. It's completely irrelevant to me if they slay the benchmarks or make killer one-shot N-body demos, and it's marginally relevant that they have better context windows or now hallucinate 10% less often (in that they're more useful as tools, which I don't dispute at all), but if you want to claim that they're suddenly super-capable robot engineers that I can throw at any "substantial" problem, you have to bring evidence, because that's a claim that defies my day-to-day experience. They're just constantly so full of shit, and that hasn't changed, at all.
FWIW, this line of argument usually turns into a mott and bailey fallacy, where someone makes an outrageous claim (e.g. "models have recently gained the ability to operate independently as a senior engineer!"), and when challenged on the hyperbole, retreats to a more reasonable position ("Claude 4.5 is clearly better than GPT 3!"), but with the speculative caveat that "we don't know where things will be in N years". I'm not interested in that kind of speculation.
Have you spent much time with Codex 5.1 or 5.2 in OpenAI Codex or a Claude Opus 4".5 in Claude code over the last ~6 weeks?
I think they represent a meaningful step change in what models can build. For me they are the moment we went from building relatively trivial things unassisted to building quite large and complex system that take multiple hours, often still triggered by a single prompt.
- A WebAssembly runtime in Python which I haven't yet published
The above projects all took multiple prompts, but were still mostly built by prompting Claude Code for web on my iPhone in between Christmas family things.
I'm not confident any of these projects would have worked with the coding agents and models we had had four months ago. There is no chance they would've worked with the January 2025 available models.
I’ve used Sonnet 4.5 and Codex 5 and 5.1, but not in their native environment [1].
Setting aside the fact that your examples are mostly “replicate this existing thing in language X” [2], again, I’m not saying that the models haven’t gotten better at crapping out code, or that they’re not useful tools. I use them every day. They're great tools, when someone actually intelligent is using them. I also freely concede that they're better tools than a year ago.
The devil is (as always) in the details: how many prompts did it take? what exactly did you have to prompt for? how closely did you look at the code? how closely did you test the end result? Remember that I can, with some amount of prompting, generate perfectly acceptable code for a complex, real-world app, using only GPT 4. But even the newest models generate absolute bullshit on a fairly regular basis. So telling me that you did something complex with an unspecified amount of additional prompting is fine, but not particularly responsive to the original claim.
[1] Copilot, with a liberal sprinkling of ChatGPT in the web UI. Please don’t engage in “you’re holding it wrong” or "you didn't use the right model" with me - I use enough frontier models on a regular basis to have a good sense of their common failings and happy paths. Also, I am trying to do something other than experiment with models, so if I have to switch environments every day, I’m not doing it. If I have to pay for multiple $200 memberships, I’m not doing it. If they require an exact setup to make them “work”, I am unlikely to do it. Finally, if your entire argument here hinges on a point release of a specific model in the last six weeks…yeah. Not gonna take that seriously, because it's the same exact argument, every six weeks. </caveats>
[2] Nothing really wrong with this -- most programming is an iterative exercise of replicating pre-existing things with minor tweaks -- but we're pretty far into the bailey now, I think. The original argument was that you can one-shot a complex application. Now we're in "I can replicate a large pre-existing thing with repeated hand-holding". Fine, and completely within my own envelope for model performance, but not really the original claim.
I know you said don't engage in "you're holding it wrong"... but have you tried these models running in a coding agent tool loop with automatic approvals turned on?
Copilot style autocomplete or chatting with a model directly is an entirely different experience from letting the model spend half an hour writing code, running that code and iterating on the result uninterrupted.
Here's an example where I sent a prompt at 2:38pm and it churned away for 7 minutes (executing 17 bash commands), then I gave it another prompt and it churned for half an hour and shipped 7 commits with 160 passing tests: https://static.simonwillison.net/static/2025/claude-code-mic...
> I know you said don't engage in "you're holding it wrong"... but have you tried these models running in a coding agent tool loop with automatic approvals turned on?
edit: I wrote a different response here, then I realized we might be talking about different things.
Are you asking if I let the agents use tools without my prior approval? I do that for a certain subset of tools (e.g. run tests, do requests, run queries, certain shell commands, even use the browser if possible), but I do not let the agents do branch merges, deploys, etc. I find that the best models are just barely good enough to produce a bad first draft of a multi-file feature (e.g. adding an entirely new controller+view to a web app), and I would never ever consider YOLOing their output to production unless I didn't care at all. I try to get to tests passing clean before even looking at the code.
Also, I am happy to let Copilot burn tokens in this manner and will regularly do it for refactors or initial drafts of new features, I'm honestly not sure if the juice is worth the squeeze -- I still typically have to spend substantial time reworking whatever they create, and the revision time required scales with the amount of time they spend spinning. If I had to pay per token, I'd be much more circumspect about this approach.
Yes, that's what I meant. I wasn't sure if you meant classic tab-based autocomplete or Copilot tool-based agent Copilot.
Letting it burn tokens on running tests and refactors (but not letting it merge branches or deploy) is the thing that feels like a huge leap forward to me. We are talking about the same set of capabilities.
For me it is something I can describe in a single casual prompt.
For example I wrote a fully working version of https://tools.nicklothian.com/llm_comparator.html in a single prompt. I refined it and added features with more prompts, but it worked from the start.
Good question. No strict line, and it's always going to be subjective and a little bit silly to categorize, but when I'm debating this argument I'm thinking: a product that does not exist today (obviously many parts of even a novel product will be completely derivative, and that's fine), with multiple views, controllers, and models, and a non-trivial amount of domain-specific business logic. Likely 50k+ lines of code, but obviously that's very hand-wavy and not how I'd differentiate.
Think: SaaS application that solves some domain specific problem in corporate accounting, versus "in-browser speadsheet", or "first-person shooter video game with AI, multi-player support, editable levels, networking and high-resolution 3D graphics" vs "flappy bird clone".
When you're working on a product of this size, you're probably solving problems like the ones cited by simonw multiple times a week, if not daily.
But re-reading your statement you seem to be claiming that there are no 50k SAAS apps that are build even using multi-shot techniques (ie, building a feature at a time).
- It's 45K of python code
- It isn't a duplicate of another program (indeed, the reason it isn't finished is because it is stuck between ISO Prolog and SWI Prolog and I need to think about how to resolve this, but I don't know enough Prolog!)
- Not a *single* line of code is hand written.
Ironically this doesn't really prove that the current frontier models are better because large amounts of code were written with non-frontier models (You can sort of get an idea of what models were used with the labels on https://github.com/nlothian/Vibe-Prolog/pulls?q=is%3Apr+is%3...)
But - importantly - this project is what convinced me that the frontier models are much better than the previous generation. There were numerous times I tried the same thing in a non-Frontier model which couldn't do it, and then I'd try it in Claude, Codex or Gemini and it would succeed.
Is there an endpoint for AI improvement? If we can go from functions to classes to substantial programs then it seems like just a few more steps to rewriting whole software products and putting a lot of existing companies out of business.
"AI, I don't like paying for my SAP license, make me a clone with just the features I need".
- Models keep getting better[0]
- Models since GPT 3 are able to replace junior developers
It's true that both of these can be true at the same time but they are still in contention. We're not seeing agents ready to replace mid level engineersand quite frankly I've yet to see a model actually ready to replace juniors. Possibly low end interns but the major utility of interns is to trial run employment. Frankly it still seems like interns and juniors are advancing faster than these models in the type of skills that matter for companies (not to mention that institutional knowledge is quite valuable). But there's interns that started when GPT 3.5 came out that are seniors now.
The problem is we've been promised that these employees would be replaced[1] any day now, yet that's not happening.
People forget, it is harder to advance when you're already skilled. It's not hard to go from non-programmer to a junior level. Hard to go from junior to senior. And even harder to advance to staff. The difficulty level only increases. This is true for most skills and this is where there's a lot of naivity. We can be advancing faster while the actual capabilities begin to crawl forward rather than leap.
[0] Implication is not just at coding test style questions but also in more general coding development.
[1] Which has another problem in the pipeline. If you don't have junior devs and are unable to replace both mid and seniors by the time that a junior would advance to a senior then you have built a bubble. There's a lot of big bets being made that this will happen yet the evidence is not pointing that way.
Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.
Why do you use the word "chasing" to describe this? I don't understand. Maybe you should try it and compare it to earlier models to see what people mean.
> Why do you use the word "chasing" to describe this?
I think you'll get the answer to this if you read my comment and your response to understand why you didn't address mine.
Btw, I have tried it. It's annoying that people think the problem is not trying. It was getting old when GPT 3.5 came out. Let's update the argument...
Looking forward to hearing about how you're using Opus 4.5, from my experience and what I've heard from others, it's been able to overcome many obstacles that previous iterations stumbled on
Please do. I'm trying to help other devs in my company get more out of agentic coding, and I've noticed that not everyone is defaulting to Opus 4.5 or even Codex 5.2, and I'm not always able to give good examples to them for why they should. It would be great to have a blog post to point to…
> Opus 4.5 is categorically a much better model from benchmarks and personal experience than Opus 4.1 & Sonnet models. The reason you're seeing a lot of people wax about O4.5 is that it was a real step change in reliable performance. It crossed for me a critical threshold in being able to solve problems by approaching things in systematic ways.
Reality is we went from LLMs as chatbots editing a couple files per request with decent results. To running multiple coding agents in parallel to implement major features based on a spec document and some clarifying questions - in a year.
Even IF llms don't get any better there is a mountain of lemons left to squeeze in their current state.
reply