> Practically speaking, we’ve observed it maintaining focus for more than 30 hours on complex, multi-step tasks.
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output
Interestingly the internet is full of "slack clone" dev tutorials. I used to work for a company that provides chat backend/frontend components as a service. It was one of their go-to examples, and the same is true for their competitors.
While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.
This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.
Yes but you need to setup quite a bit of tooling to provide feedback loops.
It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.
For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops
Btw I have upgraded the app and taking it to alpha testing now so it is possible
I've tried asking it to log every request and response to a project_log.md but it routinely ignores that.
I've also tried using playwright for testing in a headless browser and taking screenshots for a blog that can effectively act as a log , it just seems like too tall an order for it.
It sounds like you're streets ahead of where I am could you give me some pointers on getting started with a feed back loop please
But then that goes back to the original question, considering my own experiences observing the amount of damage CC or Codex can do in a working code base with a couple tiny initial mistakes or confusion about intent while being left unattended for ten minutes, let alone 30 hours....
If you had used any of those, you'd know they clearly don't work well enough for such long tasks. We're not yet at the point where we have general purpose fire-and-forget frameworks. But there have been a few research examples from constrained environments with a complex custom setup.
That sounds to me like a full room of guys trying to figure out the most outrageous thing they can say about the update, without being accused of lying. Half of them on ketamine, the other on 5-MeO-DMT. Bat country. 2 months of 007 work.
What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.
Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.
Unless the main area of improvement was tools and scaffolding rather than the model itself.
“30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.
Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.
(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)
Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?
Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.
Their point still stands though? They said the 1 tok/hr example was illustrative only. 11,000 LoC could be generated line-by-line in one shot, taking not much more than 11,000 * avg_tokens_per_line tokens. Or the model could be embedded in an agent and spend a million tokens contemplating every line.
> Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code
it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.
The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.
It has been trivial to build a clone of most popular services for years, even before LLMs. One of my first projects was Miguel Grinberg's Flask tutorial, in which a total noob can build a Twitter clone in an afternoon.
What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.
There's a marked difference between running a Twitter-like application that scales to even a few hundred thousand users, and one that is a global scale application.
You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.
Agreed entirely but not sure that's relevant in what I'm replying to.
> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it
That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.
Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.
You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”
subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.
Yeah, in parallel. They don't call it yolo mode for nothing! I have Claude configured to commit units of work to git, and after reviewing the commits by hand, they're cleanly separated be file. The todo's don't conflict in the first place though; eg changes to the admin api code won't conflict with changes to submission frontend code so that's the limited human mechanism I'm using for that.
I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.
Have the released the code for this? Does it work? or are there x number of caviets and excuses. I'm kinda of sick of them (and others) getting a free pass at saying stuff like this.
Really curious about this since people keep bringing it up on Twitter. They mention it pretty much off-handedly in their press release and doesn't show up at all in their system card. It's only through an article on The Verge that we get more context. Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code (https://www.theverge.com/ai-artificial-intelligence/787524/a...)
I have very low expectations around what would happen if you took an LLM and let it run unattended for 30 hours on a task, so I have a lot of questions as to the quality of the output