I'll take the other side of all this hate. We use celery for quite a bit, and I struggle to sympathize with TFA.
Much of what's written comes back to using old python2 libraries or just complaining that celery doesn't have defaults that better suit you. The very first issue mentioned, about not manually setting the concurrency level...this is basic stuff. It always boggles my mind when people deploy something to prod with the copy/paste "get started" instructions from github. Why do people deploy critical infra layers without spending a lot of time to understand the thing they're depending on?
There are genuinely bugs in celery, and task queues are particularly hard to get right, especially with pluggable backends. But if you RTFD and follow good devops, you will have a perfectly fine time.
The moment I saw mention of Python 2 I was like "well, yeah". I'm sure that won't solve all, or even many of the ills, but if you're not reasonably up to date with your language and libraries then yeah, you're probably gonna have a harder time than you need to.
> Supervisor should've killed celery's forked workers once the parent dies. Either it is not reliable or we haven't been able to make this happen in at least a few attempts now.
Use systemd instead of supervisord for misbehaving forking daemons like this. If you set "KillMode=mixed" you get a nice compromise between "signal the parent and give things a configurable grace period to shut themselves down" and "zero risk of orphaned child (or stuck parent) processes being left around after service stop/restart".
Without weighing in on the systemd-vs-other-init debate in general, I do strongly believe that systemd fully obsoletes supervisord. The (slightly) tighter coupling supervisord offers with Python isn't worth the deficient feature set.
I find supervisor's event listener functionality useful for implementing watchdog processes. I have the supervised process write an event whenever it successfully completes a task. Meanwhile, a separate event listener process has a watchdog timer which asks supervisor to restart the supervised process if it hasn't checked in recently enough.
Does systemd have anything like supervisor events?
supervisord still has a place in scenarios where you can't or don't want to use the init system. WSL2 and docker containers running multiple processes come to mind.
In most linux environments, this is a non-issue. And some will argue you are using containers wrong if you are running multiple services in a single container, but containers as a reproducible environment and isolation unit is a valid use case (for test/dev environments especially).
Just want to point out there are options other than supervisord. You can even use systemd in a container, though I would avoid it. s6 and runit are both options.
Using something Python-based for this sort of thing makes me slightly nervous.
I ran Celery in many projects in production over the past 10 years; I would not recommend it. It is mostly a constant fight even though it is the first thing most people grab when they have a Django stack.
I've been running four separate installs of Celery since 2014 or so processing about 10-20 million jobs a day, using reddit as a broker, and with about 100 worker instances per install. It just works, but I don't use Celery results or any esoteric features.
I totally agree, I would not recommend using Celery. The last time I used Celery was in 2019 and it was a mess. The memory leak issue is still open https://github.com/celery/celery/issues/4843 to this day.
IMO using python at all for this kinda setup with lots of workers is kinda crazy. As there is no usable threading, each worker need its own process. And as the article says, that's a couple of hundred mb of ram per worker. But in other languages, you could have hundreds of threads doing work with the same resources.
If you're backing onto something like SQS or Rabbit (Redis is a bit trickier), it's not the worst thing to hand-roll either. Then you understand exactly what trade-offs you're making.
I had a good time with dramatiq+apecheduler, much more bare bones and less out of the box, but when something went wrong is was pretty easy to understand and work around.
It has its own set of problems (mostly from starting at the set of "we will remove all of Celery features" and then going from there), in particular its logging policies are extremely frustrating and we have had to work around them. But the code is so simple in general that we get good compensation for that.
I personally would have liked to use dq instead but that's mostly an aesthetics thing.
I can't say I've encountered the same things and I've been using Celery since around 2015.
In one case of using it I am routinely throwing work at it. Its main job is to contact third party APIs and parse up to 600MB of XML responses. It's not high traffic but it's active enough to be running jobs often.
This box has been really stable, often times going 6 months without having to restart Flask or Celery because it's humming along without issues and the service itself is fairly stable so it's not getting new versions rolled out (just security patches as needed). It's not a legacy system either, it's just feature complete.
In the past I've had other sites where Celery was doing around a million jobs a month to massage a small amount of data and write it to a Postgres database, it also had no issues and was super stable.
I mainly use standard Celery features backed by Redis on a low'ish end VPS. I would like to know more details about what folks are doing where they're encountering all of these problems.
Because it's default configuration and (most) naive setups offer no redundancy? It is almost entirely in-memory with no easy eviction controls? If you are going to snapshot on every change because the thing you are doing needs to happen another purpose built queue like rabbitmq is going to perform better in this situation, especially during failures.
I've had mixed experiences with RQ. It works great most of the time, but I've observed that you can get into weird states if you OOM your system or otherwise throw curveballs at it. Overall I think Python makes it easy enough to handroll a queue management system that I would generally recommend that over RQ.
I suspect in many cases people use Celery only to send emails or for simple queueing type problems.
At risk of being overly negative, my experience with celery has been that it is complex and hard to configure and debug.
I chose a simple queueing mechanism instead of using Celery, and for sending emails I was so frustrated by Celery that I wrote the Arnie SMTP buffer server (in 100 lines of Python): https://github.com/bootrino/arniesmtpbufferserver
We switched our python task processing from celery to pubsub (on GCP) backed by redis for message storage, and it's much better. Celery is full of bugs and quirks, nobody ever really knew what it was doing when it would fail. For simple task processing, we find it helpful to have a task queue that you can fully understand. Just a hundred lines of code for the subscriber and publisher logic, which we wrote ourselves. If you can get by without workflow and orchestration, pubsub works well at high volume and it's cheap. Something to consider!
I've been using Celery on a small scale for years now and generally haven't had any serious issues except 1: reliably cancelling scheduled (future) tasks. Even using persistent revokes has been unreliable in my experience: https://docs.celeryproject.org/en/latest/userguide/workers.h...
Fwiw though, the alternatives (huey, dramatiq, rq) don't even seem to offer this feature.
I’ve been using Celery and it has been working okay for our needs but it is not very friendly to make changes depending on how you write tasks. I feel okay when a task is a wrapper around some other functions that you can just run and test with any tools you’d like but we have some tasks that go all in by binding all kids of data and functionality to the task that isn’t guaranteed to be there if you just launch a task with .delay or similar so you can’t just spin up a REPL or debugger and call the function and have it work.
I would put celery in that class of software that's good enough to be useful, but enough trouble that you'll never be truly happy with it.
We got a lot of mileage out of it (in a past life) before finally moving our job scheduling to a custom solution built on RQ. Celery caused more than a few headaches -- which is why we ditched it -- but its flexibility probably helped us scale up the service to begin with...
Celery is not great. At my place of work we begrudgingly are using it right now, with plans to replace it with something else this year. I am at the very least amazed we got this much use out of it. Considering replacing it with a much much more simple version that just uses google cloud pub sub and some simple python routing of the messages to functions.
My advise is to just pay someone to run your celery server for you.
Celery is hard to maintain, but the benefits are definitely worth it, and it’s the most mature cross language stack. I can have a celery job sent or received from any major language.
I think the only reason people pick another option is that that start of by only using python in the first place.
i am using it for four months. its configuration is really complicated and many default values do not make sense.
for instance when you are using solo there should be a default hard task timeout smaller than effective heartbeat value otherwise it will fail and retry the same task forever.
Consider where you see physical queues of people in real world: e.g. people lining up to be served at supermarket, to board aircraft, to enter a concert. The purpose of the queue is an orderly way to manage situations where there is more demand for a service than can be immediately processed. If the system tries to serve all the demand simultaneously, perhaps that leads to chaos and no one getting served.
Similar situations happen in computer systems, in some of those cases queues can be used to manage load.
E.g. When you make a purchase to order something online, there is a good chance there is a task queue or something very similar as part of the machinery helping that to happen. When you click "purchase" your order or request is persisted in some queue or DB, before any attempt is made to fulfill it. The UI can give you immediate feedback telling you that your order was received, here's your order ID, sit back and relax, we'll send you an update. When some machine (or person) is available with free capacity, and no higher priority orders to process, they will grab your order and start processing it.
Some worker queue implementations offer additional benefits in terms of increasing robustness of system. e.g. if worker node A begins processing task T, then worker node A explodes before T is complete, the task T may be returned to the queue after some timeout, and surviving worker B may be able to grab T and process it successfully. This can be very valuable in situations where you want to guarantee that task T is done correctly and not half-done or forgotten -- e.g. processing and provisioning customer orders. This gets considerably more challenging to implement correctly if task T involves performing side effects and modifying the state of the real world (transferring money, dispatching vehicles) and the failure occurs after some but not all of the side effects have already happened.
Queues introduce new failure modes: if something stops workers from processing tasks, and queue gets longer and longer, how do you find out and fix it? Queue needs to be stored somewhere, there is some max capacity, what happens when that is exceeded -- does the queue give up and drop all the tasks on the floor, or explode, or so on.
Threading is local to a machine; task queues are generally a solution to distributing a large number of tasks across many distinct compute nodes by using some remote network service (redis, rmq, etc) to track the progress of jobs.
Even if there were real threading, it has features that make it convenient for scheduling tasks, chaining them in different ways and so on. You could write your own thing of course. But also it operates over different processes and those processes don't even necessarily have to be on the same machine.
Much of what's written comes back to using old python2 libraries or just complaining that celery doesn't have defaults that better suit you. The very first issue mentioned, about not manually setting the concurrency level...this is basic stuff. It always boggles my mind when people deploy something to prod with the copy/paste "get started" instructions from github. Why do people deploy critical infra layers without spending a lot of time to understand the thing they're depending on?
There are genuinely bugs in celery, and task queues are particularly hard to get right, especially with pluggable backends. But if you RTFD and follow good devops, you will have a perfectly fine time.