> Practically speaking, we’ve observed it maintaining focus for more than 30 hou...

cowboy_henk · 2025-09-30T12:44:45 1759236285

Interestingly the internet is full of "slack clone" dev tutorials. I used to work for a company that provides chat backend/frontend components as a service. It was one of their go-to examples, and the same is true for their competitors.

While it's impressive that you can now just have an llm build this, I wouldn't be surprised if the result of these 30 hours is essentially just a re-hash of one of those example Slack clones. Especially since all of these models have internet access nowadays; I honestly think 30 hours isn't even that fast for something like this, where you can realistically follow a tutorial and have it done.

In fact, I just did a quick google search and found this 15 hour course about building a slack clone: https://www.codewithantonio.com/projects/slack-clone

sigmoid10 · 2025-09-29T18:27:21 1759170441

This is obviously much more than just taking an LLM an letting it run for 30 hours. You have to build a whole environment together with external tool integration and context management and then tune the prompts and perhaps even set up a multi-agent system. I believe that if someone puts a ton of work into this you can have an LLM run for that long and still produce sellable outputs, but let's not pretend like this is something that average devs can do by buying some API tokens and kicking off a frontier model.

Philpax · 2025-09-29T18:40:12 1759171212

Well, yes, that's Claude Code. And OpenAI Codex. And Google Gemini CLI.

Your average dev can just use those.

janee · 2025-09-30T04:30:08 1759206608

Yes but you need to setup quite a bit of tooling to provide feedback loops.

It's one thing to get an llm to do something unattended for long durations, it's a other to give it the means of verification.

For example I'm busy upgrading a 500k LoC rails 1 codebase to rails 8 and built several DSLs that give it proper authorised sessions in a headless browser with basic html parsing tooling so it can "see" what affect it's fixes have. Then you somehow need to also give it a reliable way to keep track of the past and it's own learnings, which sound simple but I have yet to see any tool or model solve it on this scale...will give sonnet 4.5 a try this weekend, but yeah none of the models I tried are able to produce meaningful results over long periods on this upgrade task without good tooling and strong feedback loops

Btw I have upgraded the app and taking it to alpha testing now so it is possible

majortennis · 2025-10-01T14:28:20 1759328900

I've tried asking it to log every request and response to a project_log.md but it routinely ignores that.

I've also tried using playwright for testing in a headless browser and taking screenshots for a blog that can effectively act as a log , it just seems like too tall an order for it.

It sounds like you're streets ahead of where I am could you give me some pointers on getting started with a feed back loop please

grncdr · 2025-09-30T06:52:05 1759215125

> rails 1 codebase to rails 8

A bit off topic, but Rails *1* ? I hope this was an internal app and not on the public internet somewhere …

janee · 2025-09-30T09:04:21 1759223061

haha no it's an old (15years old) abandoned enterprise app running on-prem that hasn't seen updates in more than a decade.

sarchertech · 2025-09-30T11:22:59 1759231379

Wow Rails 3 came out 15 years ago, so that thing started life out of date.

lsaferite · 2025-10-01T13:21:34 1759324894

> enterprise app

> started life out of date

That tracks my experiences.

ewoodrich · 2025-09-29T21:39:22 1759181962

But then that goes back to the original question, considering my own experiences observing the amount of damage CC or Codex can do in a working code base with a couple tiny initial mistakes or confusion about intent while being left unattended for ten minutes, let alone 30 hours....

sigmoid10 · 2025-10-02T07:39:36 1759390776

If you had used any of those, you'd know they clearly don't work well enough for such long tasks. We're not yet at the point where we have general purpose fire-and-forget frameworks. But there have been a few research examples from constrained environments with a complex custom setup.

ChadMoran · 2025-09-30T01:16:32 1759194992

Claude Code with a good prompt can run for hours.

NaomiLehman · 2025-09-30T06:41:33 1759214493

That sounds to me like a full room of guys trying to figure out the most outrageous thing they can say about the update, without being accused of lying. Half of them on ketamine, the other on 5-MeO-DMT. Bat country. 2 months of 007 work.

Imagine reviewing 30 hours of 2025-LLM code.

shanecp · 2025-09-30T00:09:06 1759190946

What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.

Bjorkbat · 2025-09-30T06:41:20 1759214480

Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.

gapeslape · 2025-09-29T18:55:20 1759172120

“30 hours of unattended work” is totally vague and it doesn’t mean anything on its own. It - at the very least - highly depends on the amount of tokens you were able to process.

Just to illustrate, say you are running on a slow machine that outputs 1 token per hour. At that speed you would produce approximately one sentence.

zelphirkalt · 2025-09-30T10:15:52 1759227352

"Slack clone" is also super vague:

(First of all: Why would anyone in their right mind want a Slack clone? Slack is a cancer. The only people who want it are non-technical people, who inflict it upon their employees.)

Is it just a chat with a group or 1on1 chat? Or does it have threads, emojis, voice chat calls, pinning of messages, all the CSS styling (which probably already is 11k lines or more for the real Slack), web hooks/apps?

Also, of course it is just a BS announcement, without honesty, if they don't publish a reproducible setup, that leads to the same outcome they had. It's the equivalent of "But it worked on my machine!" or "scientific" papers that prove anti gravity with superconductors and perpetuum mobile infinite energy, that only worked in a small shed where some supposed physics professor lives.

mh- · 2025-09-29T19:10:53 1759173053

Has their comment has been edited? A few words later it says it resulted in 11,000 LoC.

> [..] left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code [..]

throwaway0123_5 · 2025-09-29T20:34:16 1759178056

Their point still stands though? They said the 1 tok/hr example was illustrative only. 11,000 LoC could be generated line-by-line in one shot, taking not much more than 11,000 * avg_tokens_per_line tokens. Or the model could be embedded in an agent and spend a million tokens contemplating every line.

zmmmmm · 2025-09-30T02:00:19 1759197619

> Apparently they told it to build a Slack clone and left it unattended for 30 hours, and it built a Slack clone using 11,000 lines of code

it's going to be an issue I think, now that lots of these agents support computer use, we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it.

The software world may find it's got more in common with book authors than they thought sooner rather than later once full clones of popular apps are popping out of coding tools. It will be interesting to see if this results in a war of attrition with counter measures and strict ToU that prohibit use by AI agents etc.

stravant · 2025-09-30T05:00:01 1759208401

That just means that owning the walled gardens and network effects will become yet more important.

walthamstow · 2025-09-30T14:21:17 1759242077

It has been trivial to build a clone of most popular services for years, even before LLMs. One of my first projects was Miguel Grinberg's Flask tutorial, in which a total noob can build a Twitter clone in an afternoon.

What keeps people in are network effects and some dark patterns like vendor lock-in and data unportability.

supern0va · 2025-09-30T15:27:45 1759246065

There's a marked difference between running a Twitter-like application that scales to even a few hundred thousand users, and one that is a global scale application.

You may find quickly that, network effects aside, you would find yourself crushed under the weight and unexpected bottlenecks of that network you desire.

walthamstow · 2025-09-30T15:37:45 1759246665

Agreed entirely but not sure that's relevant in what I'm replying to.

> we are at the point where you can install an app, tell the agent you want something that works exactly the same and just let it run until it produces it

That won't produce a global-scale application infrastructure either, it'll just reproduce the functionality available to the user.

technocrat8080 · 2025-09-29T19:17:17 1759173437

Curious about this too – does it use the standard context management tools that ship with Claude Code? At 200K context size (or 1M for the beta version), I'm really interested in the techniques used to run it for 30 hours.

ChadMoran · 2025-09-30T01:17:24 1759195044

Sub-agents. I've had Claude Code run a prompt for hours on end.

technocrat8080 · 2025-09-30T01:34:36 1759196076

What kind of agents do you have setup?

s900mhz · 2025-09-30T04:42:32 1759207352

You can use the built in task agent. When you have a plan and ready for Claude to implement, just say something along the line of “begin implementation, split each step into their own subagent, run them sequentially”

fragmede · 2025-10-01T17:26:03 1759339563

subagents are where Claude code shines and codex still lags behind. Claude code can do some things in parallel within a single session with subagents and codex cannot.

technocrat8080 · 2025-10-01T23:48:55 1759362535

By parallel, do you mean editing the codebase in parallel? Does it use some kind of mechanism to prevent collisions (e.g. work trees)?

fragmede · 2025-10-02T06:55:15 1759388115

Yeah, in parallel. They don't call it yolo mode for nothing! I have Claude configured to commit units of work to git, and after reviewing the commits by hand, they're cleanly separated be file. The todo's don't conflict in the first place though; eg changes to the admin api code won't conflict with changes to submission frontend code so that's the limited human mechanism I'm using for that.

I'll admit it's a bit insane to have it make changes in the same directory simultaneously. I'm sure I could ask it to use git worktrees and have it use separate directories, but I haven't (needed to) try that (yet), so I won't comment on how well it would actually do with that.

s900mhz · 2025-10-02T00:45:21 1759365921

I personally do not do any writes in parallel but parallel works great for read operations like investigating multiple failing tests.

osn9363739 · 2025-09-30T00:18:40 1759191520

Have the released the code for this? Does it work? or are there x number of caviets and excuses. I'm kinda of sick of them (and others) getting a free pass at saying stuff like this.

haute_cuisine · 2025-09-30T09:59:57 1759226397

They don't seem to link any source code or demo. They could have run Claude for 10 hours to write thousands of the verge articles as well.