For long-running, expensive processes that do a lot of waiting, a downside is th...

abelanger · 2025-06-09T18:27:18 1749493638

OP here - this type of "checkpoint-based state machine" is exactly what platforms which offer durable execution primitives like Hatchet (https://hatchet.run/) and Temporal (https://temporal.io/) are offering. Disclaimer: am a founder of Hatchet.

These platforms store an event history of the functions which have run as part of the same workflow, and automatically replay those when your function gets interrupted.

I imagine synchronizing memory contents at the language level would be much more overhead than synchronizing at the output level.

tptacek · 2025-06-09T18:31:04 1749493864

This is also how our orchestrator (written in Go) is structured. JP describes it pretty well here (it's a durable log implemented with BoltDB).

https://fly.io/blog/the-exit-interview-jp/

abelanger · 2025-06-09T19:09:26 1749496166

Nice! It makes a lot of sense for orchestrating infra deployments -- we also started exploring Temporal at my previous startup for many of the same reasons, though at one level higher to orchestrate deployment into cloud providers.

skybrian · 2025-06-09T18:48:35 1749494915

Yep, though I haven’t used them, I’m vaguely aware that such things exist. I think they have a long way to go to become mainstream, though? Typical Go code isn’t written to be replayable like that.

abelanger · 2025-06-09T18:59:45 1749495585

I think there's a gap between people familiar with durable execution and those who use it in practice; it comes with a lot of overhead.

Adding a durable boundary (via a task queue) in between steps is typically the first step, because you at least get persistence and retries, and for a lot of apps that's enough. It's usually where we recommend people start with Hatchet, since it's just a matter of adding a simple wrapper or declaration on top of the existing code.

Durable execution is often the third evolution of your system (after the first pass with no durability, then adding a durable boundary).

lifty · 2025-06-10T06:35:51 1749537351

What are the main differences between temporal and hatchet?

abelanger · 2025-06-10T13:03:05 1749560585

The primary difference is that Hatchet is an all-purpose platform for async jobs, so while durable execution is a pattern that we support, we have a lot of other features like concurrency and fairness control, event ingestion, custom queues, dynamic rate limiting, streaming from a background job, monitoring, alerting, DAG-based executions, etc. There's a bit more on this/our architecture here: https://news.ycombinator.com/item?id=43572733.

The reason I started working on Hatchet was because I'm a huge advocate of durable execution, but didn't enjoy using Temporal. So we try to make the development experience as good as possible.

On the underlying durable execution layer, it's the exact same core feature set.

sorentwo · 2025-06-09T18:25:46 1749493546

That's the issue with goroutines, threads, or any long running chain of processes. The tasks must be broken up into atomic chunks, and the state has to be serialized in some way. That allows failures to be retried, errors to be examined, results to be referenced later, and the whole thing to be distributed between multiple nodes.

It must in my view at least, as that's how Oban (https://github.com/oban-bg/oban) in Elixir models this kind of problem. Full disclosure, I'm an author and maintainer of the project.

It's Elixir specific, but this article emphasizes the importance of async task persistence: https://oban.pro/articles/oban-starts-where-tasks-end

carsoon · 2025-06-09T19:28:09 1749497289

I actually working on an agent library in golang and this is exactly the thought process I've come up with. If we have comprehensive logging we can actual reconstruct the agents state at any position. Allowing for replays etc. You just need the timestamp(endpoint) and the parent run and you can build children/branched runs after that.

Through the use of both a map that holds a context tree and a database we can purge old sessions and then reconstruct them from the database when needed (for instance an async agent session with user input required).

We also don't have to hold individual objects for the agents/workflows/tools we just make them stateless in a map and can refernce the pointers through an id as needed. Then we have a stateful object that holds the previous actions/steps/"context".

To make sure the agents/workflows are consistent we can hash the output agent/workflow (as these are serializable in my system)

I have only implemented basic Agent/tools though and the logging/reconstruction/cancellation logic has not actually been done yet.

jpk · 2025-06-10T05:09:57 1749532197

Just a drive-by thought, but: What you're describing sounds a lot like Temporal.io. I guess the difference is the "workflow" of an agent might take different paths depending on what it was asked to accomplish and the approach it ends up taking to get there, and that's what you're interested in persisting, replaying, etc. Whereas a Temporal workflow is typically a more rigid thing, akin to writing a state machine that models a business process -- but all the challenges around persistence, replay, etc, sound similar.

Edit: Heh, I noticed after writing this that some sibling comments also mention Temporal.

Karrot_Kream · 2025-06-09T18:20:51 1749493251

Temporal is pretty decent at checkpointing long-running processes and is language agnostic.

trevinhofmann · 2025-06-10T00:23:11 1749514991

I've been considering good ways to use a task queue for this, and might just settle for a rudimentary one in a Postgres table.

The upside is that agent subtasks can be load balanced among servers, tasks won't be dropped if the process is killed, and better observability comes along with it.

The downside is definitely complexity. I'm having a hard time planning out an architecture that doesn't significantly increase the complexity of my agent code.

ashishb · 2025-06-09T18:26:00 1749493560

> For long-running, expensive processes that do a lot of waiting, a downside is that if you kill the goroutine, you lose all your work.

This is true regardless of the language. I always do a reasonable amount of work (milliseconds to up to a few seconds) worth of work in a Go routine every time. Anything more and your web service is not as stateless as it should be.