Ask HN: Lightweight data analytics using SQLite, Bash and DuckDB – too simple?

perrygeo · on Dec 28, 2023

You've discovered a little secret of the industry - most ultra-scalable big data solutions are complete overkill for many scenarios. I've used a technique similar to yours for 20 years and it's only fallen out of fashion in the last 5 years. Not because the workloads are too big but because the industry has willingly chosen to make it more complex (for reasons). Personal computers with local disk are big enough and powerful enough to handle analytics on even medium-sized workloads (into the tens of billions of rows / 100s of GB scale). There's nothing stopping you from doing the simple approach except for dogma and fashion.

The problem is team dynamics. Who executes that bash script and when? What happens if something goes wrong? How do you run it when you're offline, or traveling, or need to use your laptop for something else? How do you and your team track the progress and view the results? And since this is all running in the cloud, how do you track access, security, and costs? And the bus factor - what if you leave and your knowledge leaves with it? What about all the junior developers that want to use FancyNewThing for their resume, can we incorporate that somehow? You need a complex system to support the dozens of people who want to stick their hands in the pipeline; a perfect reflection of Conway's law. These are the organizational problems that "fancy" cloud data pipelines deal with. If you don't need such things, you can (and should) reduce the complexity by orders of magnitude.

hilti · on Dec 29, 2023

Thank you for your great feedback! My solution runs on a cheap server at Hetzner currently. Regarding the bus factor: I comment my own code a lot. Especially for myself, because after some months I tend to forget why simple one-liners still work ;-) But you're right: most junior developers want to use the latest FancyNewThing, download NPMs and other packages to solve simplest problems.

Is your technique still in use or exchanged by something else?

perrygeo · on Dec 30, 2023

Depends. I don't work much on data analytics these days so only a few of the systems I've put in place (using SQLite + bash scripts on a workstation) are still in use. Most have been shut down for organizational/business reasons. They did their job and typically the entire project was put out to pasture, not upgraded to other technology. Further reaffirming my suspicion that investing in such complexity up front was not warranted.

The biggest risk is that your software is successful and you're stuck with a ball of hacks that no one except you knows how to run or debug! Good documentation and automation can keep it alive, but not indefinitely. If your project is successful, you can expect it to get more complex by default.