I really like using make for data pipelines as you suggest, and thanks for point...

datahead · on Aug 13, 2022

Back in my early days of data eng for ML pipelines, I stumbled onto Drake- and it opened my eyes to managing pipelines as DAGs. This pattern is supremely effective, and I try to teach anyone who might benefit.

https://www.factual.com/blog/introducing-drake-a-kind-of-mak...

dwheeler · on Aug 13, 2022

Drake looks interesting, the visualization looks fun.

I don't see any evidence that it handles refunding of steps when transitive depended on code is changed, though. E.g., if script BBB.py includes CC, and CC is changed, then all steps transitively depending on BBB should be rerun. My make-booster specifically deals with that case.

I also expect drake to have a slow start, which slows development.

dwheeler · on Aug 13, 2022

Thanks, and I agree, make can work well for data pipelines.

When you're integrating many different data sources with a complicated set of scripts, it's important to automate what you can. The easy but impractical thing to do is rerun everything. Make, properly used, will rerun everything that needs to be run in the correct order... and nothing else. GNU make is also awesome for running things in parallel.