We tried to set up Airflow in our team in the past. The big problem we encounrte...

hatware · on Aug 2, 2022

> The big problem we encounrted is that its unit of management (I believe it's called a "job" but I'm rusty on this) is too low level. Our pipeline processes a lot of data and we have millions of jobs per day. Once Airflow has an (planned or unplanned) outage, 10s of thousands of job start piling up, and it never recovers from that.

That sounds more like an architecture-at-scale problem than something that is Airflow's 'fault.' Airflow may never have been the right tool for the job but it's getting all the blame.

peteradio · on Aug 2, 2022

What do you do for repeated failures? Does it get flagged for a manual debug or does it kick into a different mode of automation?

geertj · on Aug 2, 2022

We notice repeated failures because we have metrics on our "up to dateness", and those metrics will stall. We also send logs to CloudWatch logs and alarm on certain threshold of errors. Once an alarm fires, we investigate manually and see why the job is failing. This happens occasionally but not too much. While we are investigating, we are spinning up repeat jobs with some frequency, but this hasn't proved to be a problem.