Celery in production: Three more years of fixing bugs

qeternity · on March 6, 2022

I'll take the other side of all this hate. We use celery for quite a bit, and I struggle to sympathize with TFA.

Much of what's written comes back to using old python2 libraries or just complaining that celery doesn't have defaults that better suit you. The very first issue mentioned, about not manually setting the concurrency level...this is basic stuff. It always boggles my mind when people deploy something to prod with the copy/paste "get started" instructions from github. Why do people deploy critical infra layers without spending a lot of time to understand the thing they're depending on?

There are genuinely bugs in celery, and task queues are particularly hard to get right, especially with pluggable backends. But if you RTFD and follow good devops, you will have a perfectly fine time.

ascendantlogic · on March 6, 2022

The moment I saw mention of Python 2 I was like "well, yeah". I'm sure that won't solve all, or even many of the ills, but if you're not reasonably up to date with your language and libraries then yeah, you're probably gonna have a harder time than you need to.

zbentley · on March 6, 2022

> Supervisor should've killed celery's forked workers once the parent dies. Either it is not reliable or we haven't been able to make this happen in at least a few attempts now.

Use systemd instead of supervisord for misbehaving forking daemons like this. If you set "KillMode=mixed" you get a nice compromise between "signal the parent and give things a configurable grace period to shut themselves down" and "zero risk of orphaned child (or stuck parent) processes being left around after service stop/restart".

Without weighing in on the systemd-vs-other-init debate in general, I do strongly believe that systemd fully obsoletes supervisord. The (slightly) tighter coupling supervisord offers with Python isn't worth the deficient feature set.

js2 · on March 6, 2022

I find supervisor's event listener functionality useful for implementing watchdog processes. I have the supervised process write an event whenever it successfully completes a task. Meanwhile, a separate event listener process has a watchdog timer which asks supervisor to restart the supervised process if it hasn't checked in recently enough.

Does systemd have anything like supervisor events?

http://supervisord.org/events.html#events

tomnipotent · on March 6, 2022

Yes, sd_notify(0, "WATCHDOG=1") should get you started.

syro · on March 6, 2022

supervisord still has a place in scenarios where you can't or don't want to use the init system. WSL2 and docker containers running multiple processes come to mind.

In most linux environments, this is a non-issue. And some will argue you are using containers wrong if you are running multiple services in a single container, but containers as a reproducible environment and isolation unit is a valid use case (for test/dev environments especially).

boardwaalk · on March 6, 2022

Just want to point out there are options other than supervisord. You can even use systemd in a container, though I would avoid it. s6 and runit are both options.

Using something Python-based for this sort of thing makes me slightly nervous.

tluyben2 · on March 6, 2022

I ran Celery in many projects in production over the past 10 years; I would not recommend it. It is mostly a constant fight even though it is the first thing most people grab when they have a Django stack.

js2 · on March 6, 2022

I've been running four separate installs of Celery since 2014 or so processing about 10-20 million jobs a day, using reddit as a broker, and with about 100 worker instances per install. It just works, but I don't use Celery results or any esoteric features.

crucialfelix · on March 6, 2022

I used it for 10 years myself. Also learned the hard way not to use esoteric features.

I did have to build my own solutions for things like priority queues, but it was much clearer and testable.

loloquwowndueo · on March 6, 2022

Reddit :)

Scarblac · on March 6, 2022

Posts to a subreddit as things in the queue, comments for results and feedback like progress percentages, up and down votes for priorities...

js2 · on March 6, 2022

That's embarrassing! I'd like to blame autocorrect but I doubt it...

sergiomattei · on March 6, 2022

Did we just witness a Freudian slip? :)

Operyl · on March 6, 2022

Or overzealous phone auto correct. Redis used to auto correct to Reddit on an old android device of mine heh.

hoesephgerrible · on March 6, 2022

I can't stop giggling. ;)

alesso_x · on March 6, 2022

I totally agree, I would not recommend using Celery. The last time I used Celery was in 2019 and it was a mess. The memory leak issue is still open https://github.com/celery/celery/issues/4843 to this day.

postpawl · on March 6, 2022

There has been a lot of progress on fixing memory leaks recently: https://www.paulsprogrammingnotes.com/2021/12/python-memory-...

matsemann · on March 6, 2022

IMO using python at all for this kinda setup with lots of workers is kinda crazy. As there is no usable threading, each worker need its own process. And as the article says, that's a couple of hundred mb of ram per worker. But in other languages, you could have hundreds of threads doing work with the same resources.

ddorian43 · on March 8, 2022

You can have async workers or gevent workers in python.

gregsadetsky · on March 6, 2022

What would you recommend instead?

wryun · on March 6, 2022

I'm also in the 'anything but Celery' camp:

https://dramatiq.io/motivation.html

If you're backing onto something like SQS or Rabbit (Redis is a bit trickier), it's not the worst thing to hand-roll either. Then you understand exactly what trade-offs you're making.

ungawatkt · on March 6, 2022

I had a good time with dramatiq+apecheduler, much more bare bones and less out of the box, but when something went wrong is was pretty easy to understand and work around.

staticautomatic · on March 6, 2022

Dramatiq is great.

akvadrako · on March 6, 2022

For under 1000 messages per second, PostgreSQL is great at queues.

rtpg · on March 6, 2022

We have transitioned over to Dramatiq.

It has its own set of problems (mostly from starting at the set of "we will remove all of Celery features" and then going from there), in particular its logging policies are extremely frustrating and we have had to work around them. But the code is so simple in general that we get good compensation for that.

I personally would have liked to use dq instead but that's mostly an aesthetics thing.

porker · on March 6, 2022

Beanstalkd?

These comments say better than I can why I like it

https://news.ycombinator.com/item?id=27483167

https://news.ycombinator.com/item?id=24714198

411111111111111 · on March 6, 2022

Celery is not a broker daemon, it's an abstraction for them. You can use it with various brokers and some kv databases such as redis.

And yes, celery also supports beanstalk, so it's clearly not an alternative to it

http://docs.celeryproject.org/en/latest/getting-started/brok...

johtso · on March 6, 2022

Currently exploring using arq, a successor to rq for my next project.

odiroot · on March 6, 2022

What have you switched to?

the_cat_kittles · on March 6, 2022

man... couldnt agree more.

sillycube · on March 6, 2022

Do you have any suggestions for alternatives?

nerdbaggy · on March 6, 2022

I love Huey [1]. I switched everything over from Celery to it. It’s so simple it’s a godsend, even supports SQLite queue

[1] https://github.com/coleifer/huey

GSGBen · on March 6, 2022

I implemented this for its Windows support and it's been great.

jteppinette · on March 6, 2022

Thoughts on Huey vs RQ?

nickjj · on March 6, 2022

I can't say I've encountered the same things and I've been using Celery since around 2015.

In one case of using it I am routinely throwing work at it. Its main job is to contact third party APIs and parse up to 600MB of XML responses. It's not high traffic but it's active enough to be running jobs often.

This box has been really stable, often times going 6 months without having to restart Flask or Celery because it's humming along without issues and the service itself is fairly stable so it's not getting new versions rolled out (just security patches as needed). It's not a legacy system either, it's just feature complete.

In the past I've had other sites where Celery was doing around a million jobs a month to massage a small amount of data and write it to a Postgres database, it also had no issues and was super stable.

I mainly use standard Celery features backed by Redis on a low'ish end VPS. I would like to know more details about what folks are doing where they're encountering all of these problems.

gilad · on March 6, 2022

RQ [0] (Python-based with Redis backend) is another option; works great. [0] https://python-rq.org/

leetrout · on March 6, 2022

I love RQ. Nothing but good experiences using it several times over the past ~6 years.

However if you are doin mission critical work in your queue that cannot be replayed Celery is still the best python flavored option IMO

tux1968 · on March 6, 2022

Others above have been suggesting Huey as an alternative. Do you know of any reason that it falls short of Celery in a mission critical situation?

leetrout · on March 6, 2022

Anything using redis will be riskier than AMQP by default.

Redis can be tuned to be safer (such as persisting on every key change) but this all comes with tradeoffs.

Redis such a great piece of tech I use it first as often as I can.

azinman2 · on March 7, 2022

Sorry why is Redis riskier?

leetrout · on March 7, 2022

Because it's default configuration and (most) naive setups offer no redundancy? It is almost entirely in-memory with no easy eviction controls? If you are going to snapshot on every change because the thing you are doing needs to happen another purpose built queue like rabbitmq is going to perform better in this situation, especially during failures.

Clustering is another topic entirely.

Icathian · on March 6, 2022

I've had mixed experiences with RQ. It works great most of the time, but I've observed that you can get into weird states if you OOM your system or otherwise throw curveballs at it. Overall I think Python makes it easy enough to handroll a queue management system that I would generally recommend that over RQ.

andrewstuart · on March 6, 2022

I suspect in many cases people use Celery only to send emails or for simple queueing type problems.

At risk of being overly negative, my experience with celery has been that it is complex and hard to configure and debug.

I chose a simple queueing mechanism instead of using Celery, and for sending emails I was so frustrated by Celery that I wrote the Arnie SMTP buffer server (in 100 lines of Python): https://github.com/bootrino/arniesmtpbufferserver

captaintobs · on March 6, 2022

I wrote an extremely performant and simple async worker framework called SAQ because I couldn't find any that fit my use case.

https://github.com/tobymao/saq

johtso · on March 10, 2022

You said you drew inspiration from arq. What was lacking with arq that made you want to write something new?

imp · on March 6, 2022

I have been using Celery a lot for several projects and haven't really run into any significant issues. I like it as part of my tech stack.

Sorry to hear others don't feel the same way. I find it easy to use and maintain.

e-brake · on March 6, 2022

We switched our python task processing from celery to pubsub (on GCP) backed by redis for message storage, and it's much better. Celery is full of bugs and quirks, nobody ever really knew what it was doing when it would fail. For simple task processing, we find it helpful to have a task queue that you can fully understand. Just a hundred lines of code for the subscriber and publisher logic, which we wrote ourselves. If you can get by without workflow and orchestration, pubsub works well at high volume and it's cheap. Something to consider!

jefficient · on March 6, 2022

I've been using Celery on a small scale for years now and generally haven't had any serious issues except 1: reliably cancelling scheduled (future) tasks. Even using persistent revokes has been unreliable in my experience: https://docs.celeryproject.org/en/latest/userguide/workers.h...

Fwiw though, the alternatives (huey, dramatiq, rq) don't even seem to offer this feature.

memco · on March 6, 2022

I’ve been using Celery and it has been working okay for our needs but it is not very friendly to make changes depending on how you write tasks. I feel okay when a task is a wrapper around some other functions that you can just run and test with any tools you’d like but we have some tasks that go all in by binding all kids of data and functionality to the task that isn’t guaranteed to be there if you just launch a task with .delay or similar so you can’t just spin up a REPL or debugger and call the function and have it work.

sz3 · on March 6, 2022

I would put celery in that class of software that's good enough to be useful, but enough trouble that you'll never be truly happy with it.

We got a lot of mileage out of it (in a past life) before finally moving our job scheduling to a custom solution built on RQ. Celery caused more than a few headaches -- which is why we ditched it -- but its flexibility probably helped us scale up the service to begin with...

jhgg · on March 6, 2022

Celery is not great. At my place of work we begrudgingly are using it right now, with plans to replace it with something else this year. I am at the very least amazed we got this much use out of it. Considering replacing it with a much much more simple version that just uses google cloud pub sub and some simple python routing of the messages to functions.

true_religion · on March 6, 2022

My advise is to just pay someone to run your celery server for you.

Celery is hard to maintain, but the benefits are definitely worth it, and it’s the most mature cross language stack. I can have a celery job sent or received from any major language.

I think the only reason people pick another option is that that start of by only using python in the first place.

johtso · on March 6, 2022

I've been mulling over using arq for a new project, a successor to rq written by the author of Pydantic. https://arq-docs.helpmanual.io/

Does anyone have any experience with it? I like the pessimistic execution concept..

mattbillenstein · on March 6, 2022

Never deployed Celery as I got stuck at the docs - it seems way too complicated.

Ran beanstalkd in small deployments, rq in others, and nsq for real queueing but at a small scale.

darkoob12 · on March 6, 2022

i am using it for four months. its configuration is really complicated and many default values do not make sense. for instance when you are using solo there should be a default hard task timeout smaller than effective heartbeat value otherwise it will fail and retry the same task forever.

iamcreasy · on March 6, 2022

Can any explain why we need task queue? It is unique to Python because of it lack of real threading?

shoo · on March 6, 2022

Task queue pattern is not unique to python.

Consider where you see physical queues of people in real world: e.g. people lining up to be served at supermarket, to board aircraft, to enter a concert. The purpose of the queue is an orderly way to manage situations where there is more demand for a service than can be immediately processed. If the system tries to serve all the demand simultaneously, perhaps that leads to chaos and no one getting served.

Similar situations happen in computer systems, in some of those cases queues can be used to manage load.

E.g. When you make a purchase to order something online, there is a good chance there is a task queue or something very similar as part of the machinery helping that to happen. When you click "purchase" your order or request is persisted in some queue or DB, before any attempt is made to fulfill it. The UI can give you immediate feedback telling you that your order was received, here's your order ID, sit back and relax, we'll send you an update. When some machine (or person) is available with free capacity, and no higher priority orders to process, they will grab your order and start processing it.

Some worker queue implementations offer additional benefits in terms of increasing robustness of system. e.g. if worker node A begins processing task T, then worker node A explodes before T is complete, the task T may be returned to the queue after some timeout, and surviving worker B may be able to grab T and process it successfully. This can be very valuable in situations where you want to guarantee that task T is done correctly and not half-done or forgotten -- e.g. processing and provisioning customer orders. This gets considerably more challenging to implement correctly if task T involves performing side effects and modifying the state of the real world (transferring money, dispatching vehicles) and the failure occurs after some but not all of the side effects have already happened.

Queues introduce new failure modes: if something stops workers from processing tasks, and queue gets longer and longer, how do you find out and fix it? Queue needs to be stored somewhere, there is some max capacity, what happens when that is exceeded -- does the queue give up and drop all the tasks on the floor, or explode, or so on.

aetimmes · on March 6, 2022

Threading is local to a machine; task queues are generally a solution to distributing a large number of tasks across many distinct compute nodes by using some remote network service (redis, rmq, etc) to track the progress of jobs.

shagmin · on March 6, 2022

Even if there were real threading, it has features that make it convenient for scheduling tasks, chaining them in different ways and so on. You could write your own thing of course. But also it operates over different processes and those processes don't even necessarily have to be on the same machine.

geraneum · on March 6, 2022

We have used it to run different CPU bound tasks on worker machines (to not run them on the app server).

For example imagine you want to receive a file from the user and extract data from it.

Using such system also allowed us to horizontally scale by adding more worker machines.

All this, of course, depends on the tasks you want to schedule.

samirsd · on March 6, 2022

i gave up on celery and switched to huey