Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hydra: Column-Oriented Postgres (hydra.so)
116 points by hankchinaski on Aug 25, 2023 | hide | past | favorite | 73 comments


Nobody mentioned this yet, but the design on this site really feels unique and crafted with passion. Good job!


It is beautiful and unique, but I have a hard time making the connections between a database and the ancient elements.


Column based database and Ancient Greek columns?


Also the mythical Hydra


It is! I noticed a lot of startups recently are building on webflow!


An intro article that explains the what and why https://www.hydra.so/blog-posts/2022-12-13-how-we-built-fast...


Can somebody explain in the most basic way possible what column oriented databases can do for me and some example use cases?


Say you have a billion records in your table. If you want to filter the records on a column which has values that are on average 30 bytes, then an index on that column would be 30GB (uncompressed). It becomes extremely impractical to hold those indices in memory if you have several / dozens / hundreds.

The problem with row based database structures is that the data between any two rows can vary in length, so if scanning across rows on disk for a column's value, you need metadata that points at the start of each column. That means the process of scanning across the data on disk involves a _lot_ of jumping around; less of a problem on SSDs than it was on spinning platters, but still work.

The alternate approach is to store columns together on disk. That means that when performing a query that filters based on values in a given column, you can scan sequential areas on disk, which is much much faster. Of course, if you now want to assemble rows based on your filter, you need to use the metadata about where related values are in each other column in order to assemble those rows, but hopefully you have a much smaller set of records that you need to do this over. (It's cheaper to assemble 10k matching rows across a set of columns than it is to jump across a billion unaligned columns to filter them in the first place.)


One thing I've seen about other columnar databases (e.g. Clickhouse) is that they're bad at k/v style lookups. Why is that? Your explanation would make me think otherwise, as we would be able to quickly scan the columns using our sequential reads, ideally.


Think of columnar data as being run-length compressed.

(If you have 100 rows of "1 Fizz" followed by 200 rows of "1 Buzz", followed by one "1 Fizz" you can store that as "100:1:Fizz;200:1:Buzz;1:1:Fizz".)

Then, a mere 30 bytes of IO + a quick sum tells you that you sold 101 Fizz and 200 Buzz from the widget factory this quarter. Maybe the first Fizz is how many you sold in the USA and the second Fizz is how many you sold in Canada, so it becomes easy to do by-region reporting.

However, unpacking the column that has all your individual transaction IDs and linking it to the individual single sale of one Fizz takes a lot of lookups. So untangling individual transactions becomes slow (though not impossible).


1) If they're only using on disk indices, it'll be slower to look up any one record than it would be if the index was in memory (particularly true if a good chunk of the data was in memory also)

2) The operation of retrieving a full row of data is more expensive in a column based db than in a row based db, because in row based the data for the row is contiguous on disk. So row based is likely optimal if your data size works with that architecture. (And, to be clear, row based _can_ work with billions of rows, if you're thoughtful about what kind of querying you'll be doing against such tables, and how you maintain them.)


Instead of storing your data grouped by record, column databases store data grouped by column. This means that if your query is only interested in one or a few columns but from a large number of records , e.g. to aggregate it to calculate an average, it needs to read less database blocks from disk, so lower io costs. Same for in memory or cached, the data your interested in is easier accessible for lower performance cost.

This is especially useful for reporting/analytical purposes where you want to aggregate over data from many records.


Thanks. So if I understand correctly, this is basically like ignoring all the other columns (that you're not interested in)?

And do you run them in conjunction with your existing Postgres, install it as a plugin, or switch your entire db to something like Hydra?


Yes, the downside being that if you need all the columns of one specific record (row), it will be spread over many column blocks.

As a rule of the thumb, if most of your queries are aggregations over few columns of many records (OLAP), column based would be a better choice - whereas if you need to access every column/field of a small number of records (OLTP), record/row based would be a better choice.


Usually you choose the type of database based on your use case. Reporting databases like data warehouses, benefit from using a column oriented database, whereas transactional use cases use a row oriented database. Modern databases allow you to choose per table.


And in addition, each column, or even partition of a column, can store statistics about itself like min max avg count etc. That way if you look up max(column) you only have to read the meta data on the partitions on that column, it's very fast. And for analytics this is pretty common.


Not an expert on the traditional database.

Do they also have a benefit similar to the compression that comes from record re-ordering in column databases?

It's not just that you need to read a smaller section of disk to scan the same amount of interesting data - it's that that data also takes up considerably less space than it otherwise would.


Hi, hydra cofounder here. catching up on the thread since we were submitted by the community!

Yes, it depends on the data, but we've observed customers with a 10X data compression over Postgres (heap) tables although ~5-7X is more common.


In row oriented DBs data is laid out row by row. In order to read a column of data the DB has to read through the entire row to get to the next one. When computing an analytical query like the sum of a single column, the DB wastes cycles skipping through all the columns in the row.

In columnar databases a batch of column data is stored sequentially followed by the next column and so on. So if you need to sum a column you can read it sequentially very quickly and sum it up without needing to touch the data of any other column.

Columnar is also better at filtering and grouping for similar reasons.

As a general rule if your workload looks like “select sum(x) where y group by X” a columnar db will be a great fit.

The trade off is that columnar databases don’t like being updated and work best when you only insert data. This is because when you update existing data you now need to rewrite several pages on disk because the columns are all over the place.

Because of this there’s very few areas where a pure columnar DB makes sense. They’re great for timeseries metrics like sensor data , click event logs etc. places where you are just recording a fact and don’t need to update past events.


Thanks. So they are specifically for aggregate queries?


Pretty much. Analytical aggregates are where they shine.


The keyword you want to look for is: SIMD (Single Instruction Multiple Data) optimization at hardware level.

This is how many aggregate functions can run real fast.

To take advantage of that, data has to be structured a certain way: In a columnar way (which is then be vectorized and pushed to SIMD).


My understanding has always been:

1. No indexes or primary keys

2. Insert and delete only, no updates

3. Automatically normalized behind the scenes leading to data compression in the 10x-100x range

4. Very fast analytical queries

BigQuery, RedShift, etc are columnar databases (for example).


Microsoft SQL Server ColumnStore and SAP HANA both support indexes, point updates, etc...

You can convert a table in SQL Server to "Clustered ColumnStore" storage, it'll become compressed, but everything will mostly work the same.

The implementation in SQL Server uses a row "delta store" to keep recent changes. Once a certain threshold is reached, the delta store is converted to columnar format and merged with the underlying table.


When data is stored in columns, you can add it up very fast. The traditional use case is analytical reporting.


I'm not sure I even understand what "data stored in columns" actually means. Can you explain it to me like I'm 6?


Database tables are stored as an "array of structures". If a row has someone's ID, name, and age, then on disk it'll be laid out like so:

[#001][John Smith][25];[#002][Jane Doe][32];[#003][Kane Citizen][62];[...

Columnar databases store the data as a "structure of arrays", keeping each type of value together:

[#001][#002][#003][...;[John Smith][Jane Doe][Kane Citizen][...;[25][32][62][...

With row-based storage, the database can look up everything it needs to show a form to the user with a single disk I/O operation: just read a block of data starting at the location where the record row is.

With column-based storage, the database has to do an I/O per column, which seems worse... but these days on SSD storage the overhead is negligible when displaying one form at a time.

Where columnar wins massively is reporting workloads. If you want to know the average age of employees, then with row storage you pretty much have to read everything in, including their names and ID numbers, even though you don't need that data. This is because all disk storage comes as "block devices" that read a minimum of 512 bytes, but typically more like 64KB at a time. You get more than you want, and then you have to throw it away. With column storage, you can read just the age data to compute the average age.

With column storage there are also compression techniques that work a lot better than with row storage. Run-length encoding, lookup tables, etc... provide higher compression ratios because the data is much more uniform for much larger chunks.

The typical end-result is 100x less I/O to run the same query.


Each cell in the column is store one after another in memory, you have continuous memory of one column. Effectively think of having text file for each column, each cell having a line, to reconstruct a row you have to get its line from every column file. To sum a column you only have to scan the one file for that column.

In row oriented, each cell for each row is next to each other in memory. Think traditional CSV files, each row is a line in the file. To sum a column you have scan the whole file and all data for all columns.


Assuming you knew C and CPU arch terms like SIMD at age 6, imagine an array of complex structs compared to an int[]. The latter can be read directly from disk and aggregated (ie summed) very fast via SIMD instructions. Whereas a dynamically sized structure for rows (like how traditional databases store them) requires a lot of branch and dereference logic.


> most basic way possible

It's like a database engine that uses grep under the hood.

Data is in flat files. Queries are executed by scanning those files (as quickly as possible).

For aggregate queries (which is what these things are used for), substitute awk for grep.


And a bit further down the rabbit hole if you need. structure of heap table in postgresql with pictures. https://medium.com/quadcode-life/structure-of-heap-table-in-...


Dishonest for them to use the default "Postgres" for the ClickBench comparison instead of "Postgres (tuned)", which uses similar config settings to Hydra and tells a very different story.

https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...


Thanks that was useful to know. It feels like software markets have been moving from best solution to those with either distribution (MS/AWS/GOOG) or those with marketing. It's thanks to posts like this that keeps the market "honest".


“Tuned” how?


The default "Postgres" uses:

    shared_buffers = 128MB
    max_parallel_workers = 8
    max_parallel_workers_per_gather = 2
    work_mem = 4MB
"Postgres (tuned)" uses:

    shared_buffers = 8GB (64x the default)
    max_parallel_workers = 16
    max_parallel_workers_per_gather = 8
    work_mem = 4MB
"Hydra" uses:

    shared_buffers = 8GB
    max_parallel_workers = 16
    max_worker_processes = 32 (default 8)
    work_mem = 64MB
The default settings are not realistic when running on a c6a.4xlarge instance, and the "Postgres (tuned)" settings more closely match what "Hydra" is configured to use.

I consider it dishonest because I do not believe a team working this close to Postgres internals wouldn't know this.


Yes, setting between Hydra and Postgres (tuned) are somehow aligned. But what Postgres (tuned) also creates are bunch of indexes (https://github.com/ClickHouse/ClickBench/blob/main/postgresq...) which adds around ~50GB of extra storage.


> Yes, setting between Hydra and Postgres (tuned) are somehow aligned.

very somehow, since Hydra essentially is allowed to use 4 times more CPU cores.


Equivalent settings makes the comparison more useful, and storage-conscious readers will pick-up that Postgres uses 4-6x more.


Agree, but also, IMO, using index is what gives also performance boost. Just looking at queries that use string search - they have best result overall in complete test. Postgres (tuned) is designed for this specific clickbench table and nothing wrong with that.


https://github.com/ClickHouse/ClickBench/blob/main/postgresq...

basically larger buffers and more concurrent workers.


Show HN: Hydra 1.0 – open-source column-oriented Postgres - https://news.ycombinator.com/item?id=36987920 - Aug 2023 (80 comments)


And just like last time, watch out for the misleading GitHub license detector because it's not entirely Apache as the GitHub summary claims but rather *some* is Apache and buried in the interior is some AGPL stuff: https://github.com/hydradatabase/hydra#license


How far we’ve come for MSFT to have a project (Citus, which Hydra forks from) released under the AGPL.


I thought Microsoft bought Citus. If so, the AGPL code might've existed prior to the acquisition.


All of Citus is AGPL, so either MSFT bought it like that, or it made it like that. Either way, embracing the AGPL was not on my MSFT bingo card.


Thanks. Is there a list somewhere of Postgres-based columnar datastores (and which are MPP/clustered vs single/SMP-node)?

* Citus (cstore_fwd) extension

* Azure CosmosDB for PostgreSQL

* GCP AlloyDB for PostgreSQL

* AWS Redshift (but very old Postgres base)


Note that cstore_fdw ≈ Citus columnar (modernized using table access method) ≈ Azure Cosmos DB for PostgreSQL (managed Citus service) ≈ Hydra (a Citus columnar fork)

Fwiw, the main use case we had in mind when developing the Citus columnar table access method was compression of old data in a time-partitioned table. Citus is commonly used for real-time analytics on time series data. That involves more materialization than typical OLAP reporting use cases. However, you might still want to keep the source data in the database for ad-hoc queries or future materialization and it's preferable to keep the source data in compressed form.

Redshift is specifically architected for ad-hoc OLAP queries. AlloyDB I'm not sure.


Not Postgres-based (but wire- and mostly syntax-compatible): cockroachDB using column families is much like a columnar MPP. Yugabyte is PG-based and MPP but not columnar.


The presence and use of column families is only half of the puzzle - it doesn't strictly imply that the execution engine is capable of working in a vectorized columnar style (which is necessary for competitive OLAP).


Was unable to edit my previous. It does use vectorization: https://www.cockroachlabs.com/docs/stable/vectorized-executi...


Thanks for sharing that - TIL! These blog posts elaborate with more detail:

https://www.cockroachlabs.com/blog/vectorized-hash-joiner/

https://www.cockroachlabs.com/blog/vectorizing-the-merge-joi...

...it seems the distinction here is that the vectorization is only present in the execution layer and not the storage layer also. I would guess that from a storage perspective, even with column families in play, everything is being streamed out of sorted a LSM engine regardless. So there isn't additionally some highly-tuned buffer pool serving up batches of compressed column files etc.


Indeed. As I commented alsewhere this is just about the general design. It is not targeting OLAP in this case (even though I do believe cockroach employs vectorization for reads)


> cockroachDB using column families is much like a columnar MPP.

I am wondering why they are saying it is not for OLAP workload..


They don't optimize for it and I suppose the data distribution is primarily aimed at parallel OLTP rather than OLAP. Just wanted to mention that design-wise it is similar but that's indeed not all there is to it. I'd be hesitant to store large volumes of data on a single PG instance; don't see how a single-writer, filesystem-based database is suitable at all for data that is large enough to warrant columnar storage


so, what would be your db choice for OLAP?


You can also go HTAP with TiDB which has TiKV for OLTP and TiFlash (Raft-based columnar replicas) for OLAP.


I am more interested in actual OLAP than HTAP, and don't see strong OSS OLAP offering on the market right now, my rants in previous discussion: https://news.ycombinator.com/item?id=36992039

But I should look at TiDB, they looks like interesting and relatively mature project.


https://www.starrocks.io/ is on my shortlist for OLAP


its also on my list, but it looks like not easy to try, setup procedure is complicated.


No expert but I'd say you'd rather be looking at bigquery, Redshift, Clickhouse, Snowflake, etc.


Note column families has nothing to do with columnar.

Another example is cassandra is not column oriented.


Thank you for the correction. Indeed it is not entirely the same thing. Though I'd expect that at least the benefit of not having to read columns that aren't in the family would still help (haven't tried in earnest). I suppose compression is not an option though.


For sure there are GitHub lists somewhere. I recently had to research for olap Postgres compatible workloads and I know major ones: redshift, clickhouse, hydra


TimescaleDB?


I consider Timescale to be more of a hybrid than a "pure" columnar DB. IIRC the latest chunks are row-wise, but it goes columnar for older chunks to compress them. You can do a lot of OLAPish stuff, but it's arguably OLTP-first.


Not affiliated just sharing as I thought it would be interesting


Any thoughts on Databend? I was really impressed by the minimal time to set it up and the 100x faster inserts with parquet upload. But it doesn't seem to be picking up HN hype.


It doesn't look fast compared to others in ClickBench: https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQXRoZW5hIC...


Clickhouse isn’t the same level of compatibility as Postgres/Hydra. So the comparison ain’t really fair bur yes clickhouse is kind of the gold standard for this


[dead]


Whenever I see DuckDB, it's the perfect one needed for OLAP queries in terms of performance and efficiency.

One thing I couldn't wrap my head around, why there is no standalone version?

What would be typical data pipeline/architecture to use DuckDB as in process OLAP engine? How would it achieve storage of data (large data volume)


> One thing I couldn't wrap my head around, why there is no standalone version?

Standalone version of what? DuckDB? If you mean DuckDB, there is:

https://duckdb.org/docs/archive/0.8.1/installation/index

> How would it achieve storage of data (large data volume)

- There are so many ways you can do this that you end up going down many rabbit holes...

- Locally on your own machine's local storage.

- On AWS you could use EBS, EFS, expandable /tmp, S3...

- On a Data Lake you could use parquet/json/csv/etc, hive, and in version 0.9.0 (Sept 11th) you could use Apache Iceberg.

And probably so many other options I haven't even thought or read about...



I think duckdb is more of a columnar sqlite.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: