Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Lightweight data analytics using SQLite, Bash and DuckDB – too simple?
4 points by hilti on Dec 28, 2023 | hide | past | favorite | 3 comments
Over the last 12 months, in parallel to using Google BigQuery, I have built my own processing pipeline using SQLite and DuckDB.

What amazes me is that it works surprisingly well and costs much less than using BigQuery.

Roughly speaking, here's what I do: A SQLite database receives IoT sensor data via a very simple PHP function. I currently use the FlightPHP framework for this. The data is written to a table within the SQLite database (WAL mode activated) and states are updated by the machines using triggers.

Example of a trigger

CREATE TRIGGER message_added AFTER INSERT ON messages BEGIN INSERT OR REPLACE INTO states VALUES ( new.id, new.status, new.time_stamp, new.current_present, new.voltage_present)

This allows me to query the current status of a machine in real time. To do this, I again use a simple PHP function that provides the data via SSE. In the frontend, a simple Javascript method (plain vanilla JS) retrieves the JSON data and updates the HTML in real time.

    const source_realtime = new EventSource("https://myapi/sse_realtime_json");
    source_realtime.onmessage = function(event) {
        var json = JSON.parse(event.data); };
For a historical analysis - for example over 24 months - I create a CSV export from the SQLite database and convert the CSV files into Parquet format.

I use a simple BASH script that I execute regularly via CronJob.

Here is an excerpt

# Loop through the arrays and export each table to a CSV, then convert it to a Parquet file and load into the DuckDB database for (( i=0; i<${arrayLength}; i++ )); do db=${databases[$i]} table=${tables[$i]}

  echo "Processing $db - $table"
  
  # Export the SQLite table to a CSV file
  sqlite3 -header -csv $db "SELECT * FROM $table;" > parquet/$table.csv
  
  # Convert the CSV file to a Parquet file using DuckDB
  $duckdb_executable $duckdb_database <<EOF

 -- Set configurations
 SET memory_limit='2GB';
 SET threads TO 2;
 SET enable_progress_bar=true;

 COPY (SELECT * FROM read_csv_auto('parquet/$table.csv', header=True)) TO 'parquet/$table.parquet' (FORMAT 'PARQUET', CODEC 'ZSTD');
 CREATE TABLE $table AS SELECT * FROM read_parquet('parquet/$table.parquet');
EOF

Now finally my question: Am I overlooking something? This little system works well for currently 15 million events per month. No outtages, nothing like that. I read so much about fancy data pipelines, reactive frontend dashboards, lambda functions ...

Somehow my system feels "too simple". So I'm sharing it with you in the hope of getting feedback.



You've discovered a little secret of the industry - most ultra-scalable big data solutions are complete overkill for many scenarios. I've used a technique similar to yours for 20 years and it's only fallen out of fashion in the last 5 years. Not because the workloads are too big but because the industry has willingly chosen to make it more complex (for reasons). Personal computers with local disk are big enough and powerful enough to handle analytics on even medium-sized workloads (into the tens of billions of rows / 100s of GB scale). There's nothing stopping you from doing the simple approach except for dogma and fashion.

The problem is team dynamics. Who executes that bash script and when? What happens if something goes wrong? How do you run it when you're offline, or traveling, or need to use your laptop for something else? How do you and your team track the progress and view the results? And since this is all running in the cloud, how do you track access, security, and costs? And the bus factor - what if you leave and your knowledge leaves with it? What about all the junior developers that want to use FancyNewThing for their resume, can we incorporate that somehow? You need a complex system to support the dozens of people who want to stick their hands in the pipeline; a perfect reflection of Conway's law. These are the organizational problems that "fancy" cloud data pipelines deal with. If you don't need such things, you can (and should) reduce the complexity by orders of magnitude.


Thank you for your great feedback! My solution runs on a cheap server at Hetzner currently. Regarding the bus factor: I comment my own code a lot. Especially for myself, because after some months I tend to forget why simple one-liners still work ;-) But you're right: most junior developers want to use the latest FancyNewThing, download NPMs and other packages to solve simplest problems.

Is your technique still in use or exchanged by something else?


Depends. I don't work much on data analytics these days so only a few of the systems I've put in place (using SQLite + bash scripts on a workstation) are still in use. Most have been shut down for organizational/business reasons. They did their job and typically the entire project was put out to pasture, not upgraded to other technology. Further reaffirming my suspicion that investing in such complexity up front was not warranted.

The biggest risk is that your software is successful and you're stuck with a ball of hacks that no one except you knows how to run or debug! Good documentation and automation can keep it alive, but not indefinitely. If your project is successful, you can expect it to get more complex by default.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: