Say you have a billion records in your table. If you want to filter the records on a column which has values that are on average 30 bytes, then an index on that column would be 30GB (uncompressed). It becomes extremely impractical to hold those indices in memory if you have several / dozens / hundreds.
The problem with row based database structures is that the data between any two rows can vary in length, so if scanning across rows on disk for a column's value, you need metadata that points at the start of each column. That means the process of scanning across the data on disk involves a _lot_ of jumping around; less of a problem on SSDs than it was on spinning platters, but still work.
The alternate approach is to store columns together on disk. That means that when performing a query that filters based on values in a given column, you can scan sequential areas on disk, which is much much faster. Of course, if you now want to assemble rows based on your filter, you need to use the metadata about where related values are in each other column in order to assemble those rows, but hopefully you have a much smaller set of records that you need to do this over. (It's cheaper to assemble 10k matching rows across a set of columns than it is to jump across a billion unaligned columns to filter them in the first place.)
One thing I've seen about other columnar databases (e.g. Clickhouse) is that they're bad at k/v style lookups. Why is that? Your explanation would make me think otherwise, as we would be able to quickly scan the columns using our sequential reads, ideally.
Think of columnar data as being run-length compressed.
(If you have 100 rows of "1 Fizz" followed by 200 rows of "1 Buzz", followed by one "1 Fizz" you can store that as "100:1:Fizz;200:1:Buzz;1:1:Fizz".)
Then, a mere 30 bytes of IO + a quick sum tells you that you sold 101 Fizz and 200 Buzz from the widget factory this quarter. Maybe the first Fizz is how many you sold in the USA and the second Fizz is how many you sold in Canada, so it becomes easy to do by-region reporting.
However, unpacking the column that has all your individual transaction IDs and linking it to the individual single sale of one Fizz takes a lot of lookups. So untangling individual transactions becomes slow (though not impossible).
1) If they're only using on disk indices, it'll be slower to look up any one record than it would be if the index was in memory (particularly true if a good chunk of the data was in memory also)
2) The operation of retrieving a full row of data is more expensive in a column based db than in a row based db, because in row based the data for the row is contiguous on disk. So row based is likely optimal if your data size works with that architecture. (And, to be clear, row based _can_ work with billions of rows, if you're thoughtful about what kind of querying you'll be doing against such tables, and how you maintain them.)
Instead of storing your data grouped by record, column databases store data grouped by column. This means that if your query is only interested in one or a few columns but from a large number of records , e.g. to aggregate it to calculate an average, it needs to read less database blocks from disk, so lower io costs. Same for in memory or cached, the data your interested in is easier accessible for lower performance cost.
This is especially useful for reporting/analytical purposes where you want to aggregate over data from many records.
Yes, the downside being that if you need all the columns of one specific record (row), it will be spread over many column blocks.
As a rule of the thumb, if most of your queries are aggregations over few columns of many records (OLAP), column based would be a better choice - whereas if you need to access every column/field of a small number of records (OLTP), record/row based would be a better choice.
Usually you choose the type of database based on your use case. Reporting databases like data warehouses, benefit from using a column oriented database, whereas transactional use cases use a row oriented database. Modern databases allow you to choose per table.
And in addition, each column, or even partition of a column, can store statistics about itself like min max avg count etc. That way if you look up max(column) you only have to read the meta data on the partitions on that column, it's very fast. And for analytics this is pretty common.
Do they also have a benefit similar to the compression that comes from record re-ordering in column databases?
It's not just that you need to read a smaller section of disk to scan the same amount of interesting data - it's that that data also takes up considerably less space than it otherwise would.
In row oriented DBs data is laid out row by row. In order to read a column of data the DB has to read through the entire row to get to the next one. When computing an analytical query like the sum of a single column, the DB wastes cycles skipping through all the columns in the row.
In columnar databases a batch of column data is stored sequentially followed by the next column and so on. So if you need to sum a column you can read it sequentially very quickly and sum it up without needing to touch the data of any other column.
Columnar is also better at filtering and grouping for similar reasons.
As a general rule if your workload looks like “select sum(x) where y group by X” a columnar db will be a great fit.
The trade off is that columnar databases don’t like being updated and work best when you only insert data. This is because when you update existing data you now need to rewrite several pages on disk because the columns are all over the place.
Because of this there’s very few areas where a pure columnar DB makes sense. They’re great for timeseries metrics like sensor data , click event logs etc. places where you are just recording a fact and don’t need to update past events.
Microsoft SQL Server ColumnStore and SAP HANA both support indexes, point updates, etc...
You can convert a table in SQL Server to "Clustered ColumnStore" storage, it'll become compressed, but everything will mostly work the same.
The implementation in SQL Server uses a row "delta store" to keep recent changes. Once a certain threshold is reached, the delta store is converted to columnar format and merged with the underlying table.
With row-based storage, the database can look up everything it needs to show a form to the user with a single disk I/O operation: just read a block of data starting at the location where the record row is.
With column-based storage, the database has to do an I/O per column, which seems worse... but these days on SSD storage the overhead is negligible when displaying one form at a time.
Where columnar wins massively is reporting workloads. If you want to know the average age of employees, then with row storage you pretty much have to read everything in, including their names and ID numbers, even though you don't need that data. This is because all disk storage comes as "block devices" that read a minimum of 512 bytes, but typically more like 64KB at a time. You get more than you want, and then you have to throw it away. With column storage, you can read just the age data to compute the average age.
With column storage there are also compression techniques that work a lot better than with row storage. Run-length encoding, lookup tables, etc... provide higher compression ratios because the data is much more uniform for much larger chunks.
The typical end-result is 100x less I/O to run the same query.
Each cell in the column is store one after another in memory, you have continuous memory of one column. Effectively think of having text file for each column, each cell having a line, to reconstruct a row you have to get its line from every column file. To sum a column you only have to scan the one file for that column.
In row oriented, each cell for each row is next to each other in memory. Think traditional CSV files, each row is a line in the file. To sum a column you have scan the whole file and all data for all columns.
Assuming you knew C and CPU arch terms like SIMD at age 6, imagine an array of complex structs compared to an int[]. The latter can be read directly from disk and aggregated (ie summed) very fast via SIMD instructions. Whereas a dynamically sized structure for rows (like how traditional databases store them) requires a lot of branch and dereference logic.
Dishonest for them to use the default "Postgres" for the ClickBench comparison instead of "Postgres (tuned)", which uses similar config settings to Hydra and tells a very different story.
Thanks that was useful to know. It feels like software markets have been moving from best solution to those with either distribution (MS/AWS/GOOG) or those with marketing. It's thanks to posts like this that keeps the market "honest".
The default settings are not realistic when running on a c6a.4xlarge instance, and the "Postgres (tuned)" settings more closely match what "Hydra" is configured to use.
I consider it dishonest because I do not believe a team working this close to Postgres internals wouldn't know this.
Agree, but also, IMO, using index is what gives also performance boost. Just looking at queries that use string search - they have best result overall in complete test. Postgres (tuned) is designed for this specific clickbench table and nothing wrong with that.
And just like last time, watch out for the misleading GitHub license detector because it's not entirely Apache as the GitHub summary claims but rather *some* is Apache and buried in the interior is some AGPL stuff: https://github.com/hydradatabase/hydra#license
Note that cstore_fdw ≈ Citus columnar (modernized using table access method) ≈ Azure Cosmos DB for PostgreSQL (managed Citus service) ≈ Hydra (a Citus columnar fork)
Fwiw, the main use case we had in mind when developing the Citus columnar table access method was compression of old data in a time-partitioned table. Citus is commonly used for real-time analytics on time series data. That involves more materialization than typical OLAP reporting use cases. However, you might still want to keep the source data in the database for ad-hoc queries or future materialization and it's preferable to keep the source data in compressed form.
Redshift is specifically architected for ad-hoc OLAP queries. AlloyDB I'm not sure.
Not Postgres-based (but wire- and mostly syntax-compatible): cockroachDB using column families is much like a columnar MPP.
Yugabyte is PG-based and MPP but not columnar.
The presence and use of column families is only half of the puzzle - it doesn't strictly imply that the execution engine is capable of working in a vectorized columnar style (which is necessary for competitive OLAP).
...it seems the distinction here is that the vectorization is only present in the execution layer and not the storage layer also. I would guess that from a storage perspective, even with column families in play, everything is being streamed out of sorted a LSM engine regardless. So there isn't additionally some highly-tuned buffer pool serving up batches of compressed column files etc.
Indeed. As I commented alsewhere this is just about the general design. It is not targeting OLAP in this case (even though I do believe cockroach employs vectorization for reads)
They don't optimize for it and I suppose the data distribution is primarily aimed at parallel OLTP rather than OLAP. Just wanted to mention that design-wise it is similar but that's indeed not all there is to it.
I'd be hesitant to store large volumes of data on a single PG instance; don't see how a single-writer, filesystem-based database is suitable at all for data that is large enough to warrant columnar storage
I am more interested in actual OLAP than HTAP, and don't see strong OSS OLAP offering on the market right now, my rants in previous discussion: https://news.ycombinator.com/item?id=36992039
But I should look at TiDB, they looks like interesting and relatively mature project.
Thank you for the correction. Indeed it is not entirely the same thing. Though I'd expect that at least the benefit of not having to read columns that aren't in the family would still help (haven't tried in earnest). I suppose compression is not an option though.
For sure there are GitHub lists somewhere. I recently had to research for olap Postgres compatible workloads and I know major ones: redshift, clickhouse, hydra
I consider Timescale to be more of a hybrid than a "pure" columnar DB. IIRC the latest chunks are row-wise, but it goes columnar for older chunks to compress them. You can do a lot of OLAPish stuff, but it's arguably OLTP-first.
Any thoughts on Databend? I was really impressed by the minimal time to set it up and the 100x faster inserts with parquet upload. But it doesn't seem to be picking up HN hype.
Clickhouse isn’t the same level of compatibility as Postgres/Hydra. So the comparison ain’t really fair bur yes clickhouse is kind of the gold standard for this