Yet another instance of tech bros trying to automate away the things that humans should be doing instead of trying to automate away actually hard or dangerous things that humans should _not_ be doing.
Yet another blogpost that looks super impressive, until you get to the bottom and see the charts assessing held out task performance on ASKA and MineDojo and see that it's still a paltry 15% success rate. (Holy misleading chart batman!) Yes, it's a major improvement over SIMA 1, but we are still a long way from this being useful for most people.
To be fair, it's 65% on all tasks (with a 75% human baseline) and 15% on unseen environments. They don't provide a human baseline for that, but I'd imagine it's much more than 15%.
I personally am extremely impressed about it reaching 15% on unseen environments. Note that just this year, we were surprised that LLMs became capable of making any progress whatsoever in GBA Pokemon games (that have significantly simpler worlds and control schemes).
As for "true intelligence" - I honestly don't think that there is such a thing. We humans have brains that are wired based on our ancestors evolving for billions of years "in every possible environment", and then with that in place, each individual human still needs quite a few years of statistical learning (and guided learning) to be able to function independently.
Obviously I'm not claiming that SIMA 2 is as intelligent as a human, or even that it's on the way there, but based on recent progress, I would be very surprised if we don't see humanoid robots using a approaches inspired by this navigate our streets in a decade or so.
I don't think that's true. Humans are dramatically better than current AI systems at tackling novel problems and situations. Humans are capable of zero-shot learning by imagining how they might do something, we are able to apply general reasoning principles without previous examples.
https://arcprize.org/ is a whole category of problems that AI struggles with but humans are able to do.
Lack of employee trust in these systems is caused by model (under)performance. There's a HUGE disconnect between the C-suite right now and the people on the ground using these models. Anyone who builds something with the models would tell you that they can't be trusted.
There's this weird disconnect in tech circles, where everyone is deathly afraid of AGI, but totally asleep on the very real possibility of thermonuclear war breaking out in Europe or Asia over the next 10 years. There's already credible evidence that we came perilously close to the use of tactical nuclear weapons in Ukraine which likely would've spiraled out of control. AGI might happen, but the threat of nuclear war keeps me up at night.
I think this is a vast overstatement. A small group of influential people are deathly afraid of AGI, or at least using that as a pretext to raise funding.
But I agree that there are so many more things we should be deathly afraid of. Climate change tops my personal list as the biggest existential threat to humanity.
I sure wish my conspiracy theories could lead to me running billion dollar projects to defend against my shadow demons. Instead I just get ratio'd by the internet and get awkward silences at family gatherings.
I think the sad part is that most people in power aren't planning to be around in 10 years, so they don't care about any long term issues that are cropping up. leave it to their grandchildren to burn with the world.
I personally think that AI is a realistic extinction threat for our whole species within this century, and a full nuclear war never was (and probably never will be). Neither is climate change.
Collapse of our current civilizations? Sure. Extinction? No.
And I honestly see stronger incentives on a road towards us being outcompeted by AI, then on our leaders starting a nuclear war.
Why? There’s nothing in the current process of developing AI that would lead to a AI that would act against humanity of its own choosing. The development process is hyper-optimised to make AIs that do exactly what humans tell it to do. Sure, an LLM AI can role-play as an evil super AI out to kill humans. But it can just as easily role-play as one that defends humanity. So that tells us nothing about what will happen.
We could just as well think that exploding the first nuclear bomb would ignite the atmosphere and kill all of humanity. There was nothing from physics that indicated it was possible but some still thought about it. IMO that kind of thinking is pointless. Same with thinking LHC would create a black hole.
As far as I can tell the fear of super intelligent AI will kill humans all boils down to something utterly magical happens and then somehow super intelligent evil AI appears.
> Why? There’s nothing in the current process of developing AI that would lead to a AI that would act against humanity of its own choosing.
If we had certainty that our designs were categorically incapable of acting in their own interest then I would agree with you, but we absolutely don't, and I'd argue that we don't even have that certainty for current-generation LLMs.
Long term, we're fundamentally competing with AI for ressources.
> We could just as well think that exploding the first nuclear bomb would ignite the atmosphere and kill all of humanity. There was nothing from physics that indicated it was possible but some still thought about it.
> As far as I can tell the fear of super intelligent AI will kill humans all boils down to something utterly magical happens and then somehow super intelligent evil AI appears.
Not necessary at all. AI acting in its own interests and competing "honestly" with humans is already enough. This is exactly how we outcompeted every other animal on the planet after all.
Total extinction of any dominant species is really hard. Very few post-apocalyptic settings suggest a full extinction and usually show some thousands of survivors struggling with the new norm. Humans in particular are very adaptable so thorough killing all 8 billion of us would be difficult no matter the scenario. I think only the Sun can do that and that's assuming we fail to find an exit strategy 5 billion years in (we're less than a thousandth of a percent into humanity if we measure on that scale).
As such, I'd say "extinction" is more of a colloquial use of "Massive point in history that kills off billions in short order".
Personally I don't bieleve in a collapse nor extinction, just a slow spiral into more and more enshittification. You'll have to talk to an "ai" doctor because real doctors will treat people with money, you'll have to face an "ai" administration because the real administration will work for people with money, you'll have to be a flesh and blood robot to an "ai" telling you what to do (already the case for Amazon warehouse workers, food delivery people, &c.), some "ai" will determine if you qualify for X or Y benefits, X or Y treatment, X or Y job.
Basically everything wrong with today's productivism, but 100 times worse and powered by a shitty ai that's very far from agi.
>There's already credible evidence that we came perilously close to the use of tactical nuclear weapons in Ukraine which likely would've spiraled out of control.
I do agree nukes are a far more realistic threat. So this is kind of an aside and doesn't really undermine your point.
But I actually think we widely misunderstand the dynamic of using nuclear weapons. Nukes haven't been used for a long time and everyone kind of assumes using them will inevitably lead to escalation which spirals into total destruction.
But how would Russia using a tactical nuke in Ukraine spiral out of control? It actually seems very likely that it would not be met in kind. Which is absolutely terrifying in it's own right. A sort of normalization of nuclear weapons.
Extremely developed thought experiment maybe. Only 2 nuclear weapons have ever been dropped. Which is why I say it's a massive assumption.
You tell me. How does this escalate into a total destruction scenario? Russia uses a small nuke on a military target in the middle of nowhere Ukraine. ___________________________. Everyone is firing nukes at eachother.
Fill in the blank.
We are not talking about the scenario where Russia fire a nuke at Washington, Colorado, CA, Montana, forward deployments, etc. and the US responds in kind while nukes are en-route.
Of course the flowchart that fill that blank has other outcomes than total nuclear war, but you aren't considering how unacceptable those outcomes are to those involved or the feedback loops involved.
Let me ask you this question. Why did Russia use a small nuke on a military target in the middle of nowhere Ukraine? Because the outcome was positive for Russia... but the only way that can be true is if the cost of using a small nuke was better than the alternatives. This either means it was a demonstration / political action or... 1 or more Russian units in the Ukraine are armed with tactical nukes and it was a militarily sound option so by definition you'll see more nuke flying around at least from the Russian side when it's militarily sound. Now due to the realities of logistics that means there is capturable nuclear material on the battlefield.
If it's a demonstration/political action what do you think it was meant to accomplish? Either the consequences will be less detrimental than the military gain and so Russia can use tactical nukes and will do so if it improves the military situation... or the consequences will be at a level detrimental to Russia.
See the premise of the question is flawed in that Russia doesn't just use one Nuke in the middle of nowhere, because everyone already knows Russia has nukes. Russia is trying to demonstrate it will use them and so the actions from that are either a Ukraine alliance capitulation, Russia can continue the war just the same but with whatever ever extra political consequences of using 1 nuke, or continue the situation using Nukes.
You see the issue right? Should the Ukraine alliance surrender to a single tactical nuke when it hasn't to the threat of strategic nukes? Russia can't fire that first Nuke without being a country willing to use tactical nukes on the battlefield and what did they gain if not the use of those nukes to ensure a military victory because it hasn't ensured a diplomatic one.
So the statement becomes: Russia uses tactical nukes across the battlefield. ____. Everyone is firing nukes at eachother.
That last sentence is synonymous with "Strategic nukes being fired by ICBM" which is incredibly likely when 1 non scheduled ICBM is fired. While you're right that 1 tactical nuke in the middle of nowhere wouldn't ensure MAD, it is not a massive assumption that the realities around that 1 nuke being used would.
> Russia uses a small nuke on a military target in the middle of nowhere Ukraine. ___________________________. Everyone is firing nukes at eachother.
While all-out international global war isn't guaranteed in this scenario, I don't see why you'd be so confident as to imply that it was very unlikely. For me, the biggest fear in terms of escalating to nuclear war is the time when some nuclear power is eventually beaten in conventional war to the point of utter desperation, and ends up going all out with nukes as a Hail Mary or a "if I can't have it, no one will" move.
Russia uses a strategic nuke in a military move in Ukraine. The rest of Europe, fearing a normalization of strategic nuke use and more widespread uses of these weapons (including outside Ukraine) begin deploying in Ukraine to help push back the Russian forces - especially since Russia showing the willingness to use nukes at all makes them and their military a lot more intimidating and urgently threatening to the rest of Europe than before. Russia perceives this deployment as a direct declaration of war from NATO and invades (one of) the Baltic states to create instability and drive the attention and manpower away from their primary front line. This leads to full war and mobilization in Europe. Russia is eventually pushed back to their original borders, but with what the situation became, Western countries are nervous that not dealing with this once and for all would just be giving Russia a timeout to regroup and re-invade with more nukes at a later point. They press on, and eventually Russia is put in a desperate situation, which leads to them using nukes more broadly against their enemies for one of the reasons I described at the start of this comment. Other nuclear states begin targeting known nuclear launch sites in Russia with strikes of their own, to cripple their launch ability. This is nuclear war.
I'm not saying this scenario is likely, but this is just one attempt at filling in the blank. If you can imagine a future - any future at all where Russia, or North Korea, or India, or Pakistan, or Israel have their existence threatened at any point in the future ever, this is when nuclear war becomes a serious possibility.
>How does this escalate into a total destruction scenario?
My favorite historical documentary: https://www.youtube.com/watch?v=Pk-kbjw0Y8U (my new favorite part is America realizing "fuck, we're dumasses" far too late into the warfare they started).
That is to say: you're assuming a lot of good faith in a time of unrest with several leaders looking for any excuse enact martial law. For all we know, the blank is "Trump overreacts and authorizes a nuclear strike on Los Angeles"(note the word "authorizes". Despite the media, the president cannot unilaterally fire a nuclear warhead). That bizarre threat alone might escalate completely unrelated events and boom. Chaos.
I think this perfectly demonstrates my point that the path from isolated tactical nuke to wide scale nuclear war is quite unclear and by no means necessary. Thank you.
I wish it was a clear path. That's the scariest part. Remember that one assassation escalated to The Great War.
It'll be a similar flimsy straw breaking that will mark the start of nuclear conflict after years of rising tensions. And by then pandora's box will be opened.
Other doomsday risks aren't any reason to turn our heads away from this one. AI's much more likely to end up taking an apocalyptic form if we sleep on it.
But this isn’t a suggestion to turn away from AI threats - it’s a matter of prioritization. There are more imminent threats that we know can turn apocalyptic that swaths of people in power are completely ignoring and instead fretting over AI.
Why do people say it’s likely? There’s nothing in science that indicates that it’s probable.
The closest field of science we can use to predict the behaviour of intelligent agents is evolution. The behaviour of animals is highly dictated by the evolutionary pressure they experience in their development. Animals kill other animals when they need to compete for resources for survival. Now think about the evolutionary pressure for AIs. Where’s the pressure towards making AIs act on their own behalf to compete with humans?
Let’s say there’s somehow something magical that pushes AIs towards acting in their own self interest at the expense of humans. Why do we believe they will go from 0% to 100% efficient and successful at this in the span of what? Months? It seems more likely that there would be years of failed attempts at breaking out, before a successful attempt is even remotely likely. This would just further increase the evolutionary pressure humans exerts on the AIs to stay in line with our expectations.
Attempting to eliminate your creators is fundamentally a pretty stupid action. It seems likely that we will see thousands of attempts by incompetent AIs before we see one by a truly superintelligent one.
We should worry more about doomsday risks that are concrete and present today. Despite the prognostications of the uber wealthy, the emergence of AGI is not guaranteed. It likely will happen at some point, but is that tomorrow or 200 years in the future? We can’t know for sure.
Or, you know, the bit where we've now-irrevocably committed ourselves to destabilizing the global climate system whose relative predictability has been the foundation of our entire civilization. That's going to be a ride.
Years ago a friend of mine observed that we don't need to wonder what it would look like if artificial entities were to gain power and take over our civilization, because it already happened: we call them "corporations".
> There's already credible evidence that we came perilously close to the use of tactical nuclear weapons in Ukraine which likely would've spiraled out of control.
Spiral out of control in what way? Wouldn't it have ended the war immediately.
There is no evidence that use of tactical nuclear weapons in Ukraine would spiral out of control. I like to think that the US/UK/France would stay out of it, if only because the leaders value their own lives if not those of others.
>There's already credible evidence that we came perilously close to the use of tactical nuclear weapons in Ukraine which likely would've spiraled out of control
Do you want to share any of this credible evidence?
Well, one of these is something that most reaonable people work on avoiding, while the other is something that a huge capitalist industrial machine is working to achieve like their existence depends on it.
Well it's not everyone. I guess I am in "tech circles" and have zero fear of AGI. Everyone who is (or claims to be) "deathly afraid" is either ignorant or unserious or a grifter. Their arguments are essentially a form of secular religion lacking any firm scientific basis. These are not people worth listening to.
every one of the snarky comments like this on myriad of HN threads like this:
1. assumes most humans write good code (or even better than LLMs)
2. will stick around to maintain it
after 30 years in the industry and last 10 as consultant I can tell you fairly definitively that #1 cannot be further from the truth and #2 is frequent cause of consultants getting gigs, no one understand what “Joe” did with this :)
I think it's hard as he's done a huge article about coding with agents, with no code examples.
You can go look at his GitHub but it's a bewildering array of projects. I've had a bit of a poke around at a few of the seemingly more recent ones. Bit odd though as in one he's gone heavy on TS classes and another heavy on functions. Might be he was just contributing to one as it was under a different account.
And a lot of them seem to be tools that wrap a lot of cli tools. There is a ton of scaffolding code, to handle a ton of cli options. A LOT of logger stmts, one file I randomly opened was a logger stmt every other line.
So it's hard to judge, I found it hard to wade through the code as it's basically just a bunch of option handling for tool calls. It didn't really do much. But necessary, probably?
Just very different code than I need to write.
And there are some weird tells that make it hard to believe.
For example, he talks about refactoring for useEffect in React but I KNOW GPT5 is really rubbish at it.
Some code it's given me recently was littered with useEffect and useMemo when it wasn't needed. Then when challenged it got rid of some, then changed other stuff to useEffect when again, it wasnt needed.
And then got all confused and basically blew it's top.
Yet this person says he can just chuck a basic prompt at his codex cli, running GPT5 and it magically refractors the bad useEffects?
Personally, my experience with codex is same as yours, no way I would ever use codex for TS projects and especially not React. I don't know this mate personally but if we were talking about this over beer I would probably tell you (after 3rd one when I am more open to being direct) that I think I trust this blog as much as I trust President (this one or previous ones) to tell the truth :)
My comment was more geared towards an insane amount of comments on myriad of "AI" / "agent coding" posts where soooooo many people will write "oh, such AI slop" assuming that average SWE would write it better. I don't know many things but working with these tools heavily over the last year or so (and really heavily last 6 months) I'll take their output over general average SWE every day of the week and twice on Sunday (provided that I am driving the code generation myself, not general AI generated code...)
(OP) the current projec is closed source. If you look at my cli tools, that's pure slop, all I care is that it works, so reviewing that code for sure will show some weird stuff. Does it matter? It's a tool to fetch logs form a server. I run it locally. As long as is does that reliably, idk about the code.
1. Humans are capable of writing good code. Most won't, but at least it's possible. If your company needs good code to survive, would you take 5% chance or 0% chance?
2. Even when humans write crappy code, they typically can maintain it.
This sounds like a wild take. So what about those trying LLM code, then deciding it isn't good enough, and going back and writing it from scratch themselves, with what they perceive to be better results? They're just wrong and the LLM was just as good?
Higher level abstractions are built on rational foundations, that is the distinction. I may not understand byte code generated by a compiler, but I could research the compiler and understand how it is generated. No matter how much I study a language model I will never understand how it chose to generate any particular output.
COBOL developers may have claimed that higher-level language developers didn't understand what was happening under the hood. However, they never suggested those developers couldn't understand the high-level code itself (what's going on here)—only what lay beneath it.
But COBOL developers resisted modern tooling. A coworker of mine tells the story of when he was working alongside an old mainframe hand more than 25 years ago, and was trying to explain to him how modern IDEs work. The mainframe guy gave him a disdainful look and said "That ain't how computing is done, kid."
Now what the guys above the programmers' paygrade knew was that the aim of software development wasn't really code, it was value delivered to customer. If 300k lines if AI slop deliver that value quickly, they can be worth much more than the 20k lines of beautiful human-written code.
I'd suspect folk with a terminal first approach probably have much stronger understandings of what is going on under the hood which makes approaching new repositories a lot easier if nothing else.
Alternatively, maybe folk who're exposed to more codebases are the best off.
By "modern IDE" I meant something like Turbo Pascal, as compared with the (at best) ISPF-based editor the mainframe guy was using. This took place in the early 90s.
Then you'd love "Real Programmers Don't Use PASCAL", by Ed Post. It's about Fortran vs PASCAL, though it does mention COBOL in passing. It's copyright 1983!
There's a huge disconnect between what the benchmarks are showing and what the day-to-day experience of those of us using LLMs are experiencing. According to SWE-bench, I should be able to outsource a lot of tasks to LLMs by now. But practically speaking, I can't get them to reliably do even the most basic of tasks. Benchmaxxing is a real phenomenon. Internal private assessments are the most accurate source of information that we have, and those seem to be quite mixed for the most recent models.
How ironic that these LLM's appear to be overfitting to the benchmark scores. Presumably these researchers deal with overfitting every day, but can't recognize it right in front of them
I'm sure they all know it's happening. But the incentives are all misaligned. They get promotions and raises for pushing the frontier which means showing SOTA performance on benchmarks.
We'll know next year if it's not the right move. If we get another wave of incremental upgrades at best then AI will clearly be in a bubble and it's time to look for the exits. If we keep getting major advances then any country _not_ going all in on AI will be left behind.
Have we gotten major advancements in the last ~year? Not being facetious, genuinely asking. I don't see it from my end but then again I don't really use much more than ChatGPT a couple times a week.
I think the major advancements are outside the textbox in this last year: video generation, robotic models like Helix, world models, Genie 3.
Even for text, Deepseek R1 was this year, and agentic and coding AI has made progress on length/complexity of tasks. The rise of MoE architecture in the open/local model space has made it possible to run useful models locally on hardware under like $2K, something I didn’t expect for a long time.
In the last ~year, we’ve gotten reasoning models, coding agents, much better multimodal capabilities, usable video models, huge context windows, huge decreases in cost, tons of advances in open source models, and I’m sure a lot more I’m not thinking of. I use AI both to code and to deliver value to my (non-tech) customers, and the last year has been awesome for me.
The difference in coding ability between then and now is pretty huge. And a year ago o1 hadn’t been introduced yet, whereas now the “reasoning” technique is pretty widespread.
Not sure if you’re counting things built on top of the models but if so, coding agents have also come a long way since a year ago.