I love SQLite and this is in no way I'm making a point devaluing SQLite, Author's method is excellent approach to get analytical speed out of SQLite. But I am loving DuckDB for similar analytical workloads as it is built for such tasks. DuckDB also reads from single file, like SQLite and DuckDB process large data sets at extreme speeds. I work on my macbook m2 and I have been dealing with about 20 million records and it works fast, very fast.
Loading data into DuckDB is super easy, I was surprised :
SELECT
avg(sale_price),
count(DISTINCT customer_id)
FROM '/my-data-lake/sales/2024/*.json';
and you can also load into a JSON type column and can use postgres type syntax
col->>'$.key'
duckdb is super fast for analytic tasks, especially when u use it with visual eda tool like pygwalker. it allows u handles millions of data visuals and eda in seconds.
but i would say, comparing duckdb and sqlite is a little bit unfair, i would still use sqlite to build system in most of cases, but duckdb only for analytic. you can hardly make a smooth deployment if you apps contains duckdb on a lot of platform
depending on the size and needs of distributed system or application im kind of really excited about postgres + pg_lake. postgres has blown my mind at how well it does concurrent writes at least for the types of things i build/support for my org, the pg_lake extension then adds the ability to.. honestly work like a datalake style analytics engine. it intuitively switches whether or not the transaction goes down the normal query path or it uses duckdb which brings giga-aggregation type queries to massive datasets.
someone should smush sqlite+duckdb together and do that kind of switching depending on query type
It's not an index, it's just (probably parallel) file reads
That being said, it would be trivial to tweak the above script into two steps, one reading data into a DuckDB database table, and the second one reading from that table.
Loading data into DuckDB is super easy, I was surprised :
SELECT avg(sale_price), count(DISTINCT customer_id) FROM '/my-data-lake/sales/2024/*.json';
and you can also load into a JSON type column and can use postgres type syntax col->>'$.key'