Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really like using make for data pipelines as you suggest, and thanks for pointing out your package.

In this pipeline use case, you have base data, and a series of transformations that massage it into usable results. You are always revising the pipeline, usually at the output end (but not always) so you want to skip as many preprocessing steps as possible. Make automates all that.

This works great for image processing pipelines, science data pipelines, and physical simulators for a few examples.

There have been a few blog posts and ensuing HN discussions about this use pattern for make. The discussion generally gets mixed up between make’s use as a build system for code, alas.



Back in my early days of data eng for ML pipelines, I stumbled onto Drake- and it opened my eyes to managing pipelines as DAGs. This pattern is supremely effective, and I try to teach anyone who might benefit.

https://www.factual.com/blog/introducing-drake-a-kind-of-mak...


Drake looks interesting, the visualization looks fun.

I don't see any evidence that it handles refunding of steps when transitive depended on code is changed, though. E.g., if script BBB.py includes CC, and CC is changed, then all steps transitively depending on BBB should be rerun. My make-booster specifically deals with that case.

I also expect drake to have a slow start, which slows development.


Thanks, and I agree, make can work well for data pipelines.

When you're integrating many different data sources with a complicated set of scripts, it's important to automate what you can. The easy but impractical thing to do is rerun everything. Make, properly used, will rerun everything that needs to be run in the correct order... and nothing else. GNU make is also awesome for running things in parallel.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: