We tried to set up Airflow in our team in the past. The big problem we encounrted is that its unit of management (I believe it's called a "job" but I'm rusty on this) is too low level. Our pipeline processes a lot of data and we have millions of jobs per day. Once Airflow has an (planned or unplanned) outage, 10s of thousands of job start piling up, and it never recovers from that.
In the end we replaced our data orchestration with a stateless lambda that for a configured time interval 1/ looks at what output data is missing, 2/ cross-references that with running jobs (in AWS Batch), and 3/ submit jobs for missing data that has no job. Jobs themselves are essentially stateless. They are never restarted and we don't even look at their status. If one fails we notice because there will be a hole in the output and we therefore submit a new one. Some safety precautions are added to prevent a job from repeatedly failing, but that's the exception.
Maybe Airflow has moved on from when we last tried it. But this was our experience.
> The big problem we encounrted is that its unit of management (I believe it's called a "job" but I'm rusty on this) is too low level. Our pipeline processes a lot of data and we have millions of jobs per day. Once Airflow has an (planned or unplanned) outage, 10s of thousands of job start piling up, and it never recovers from that.
That sounds more like an architecture-at-scale problem than something that is Airflow's 'fault.' Airflow may never have been the right tool for the job but it's getting all the blame.
We notice repeated failures because we have metrics on our "up to dateness", and those metrics will stall. We also send logs to CloudWatch logs and alarm on certain threshold of errors. Once an alarm fires, we investigate manually and see why the job is failing. This happens occasionally but not too much. While we are investigating, we are spinning up repeat jobs with some frequency, but this hasn't proved to be a problem.
In the end we replaced our data orchestration with a stateless lambda that for a configured time interval 1/ looks at what output data is missing, 2/ cross-references that with running jobs (in AWS Batch), and 3/ submit jobs for missing data that has no job. Jobs themselves are essentially stateless. They are never restarted and we don't even look at their status. If one fails we notice because there will be a hole in the output and we therefore submit a new one. Some safety precautions are added to prevent a job from repeatedly failing, but that's the exception.
Maybe Airflow has moved on from when we last tried it. But this was our experience.