> Basically we started to collected detailed analytics and thus had a rapidly growing table of around 2B records of user events during their sessions. As it grew past a 500 million records it turned out to be impossible to query this table in any thing close to real-time
Analytics isn't typically something that needs real-time capabilities, for one.
> a rapidly growing table [emphasis mine]
I think I see part of the problem here. If you had a single table, that means it's completely denormalized, so your schema probably looked something like this (or wider):
CREATE TABLE UserEvent (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
user_ip_address TEXT NOT NULL,
user_agent TEXT NOT NULL,
event_data JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
event_type TEXT
);
CREATE INDEX UserEvent_user_id_idx ON UserEvent (user_id);
CREATE INDEX UserEvent_created_at_idx ON UserEvent (created_at);
The JSON blob might be anywhere from a few hundred bytes to well over a kilobyte, and probably duplicates data already present as a scalar, like IP address, user agent string, timestamp, etc. I'll use the middle ground and say the JSONB objects are on average 500 bytes when stored. Now, the rest.
A UUID, if stored as its native type (or BINARY(16) in MySQL - don't sleep on this, MySQL folks; it makes a huge difference at scale) is 16 bytes. That's double the size of a BIGINT, and quadruple the size of an INT4. Also, unless you're using UUIDv7 (or UUIDv1, but no one does), it's not k-sortable. Since Postgres doesn't cluster tuples around the PK [yes, I know all indices in Postgres are technically secondary] like MySQL/InnoDB does, this doesn't immediately thrash the B+tree in the same way, but it does thrash the visibility map, and it does bloat the WAL. There are various arguments for why you shouldn't use a monotonic integer as a surrogate key, but IMO they're largely overblown, and there are workarounds to not publicly disclose it.
IPv4 addresses, stored in dotted-quad as a string, are a maximum of 15 characters, storing in 16 bytes as TEXT or VARCHAR. If stored instead in the Postgres native INET type, that drops to 7 bytes, plus you get built-in validation. If you had INT4 UNSIGNED available (as MySQL does natively), you could even store them in their numeric form and save another 3 bytes, though you lose the validation.
User Agent strings are huge, usually over 100 bytes. They're also not that unique, relatively speaking. Even if you need to know the patch version of the browser, anyone with a browser doing automatic updates is going to stay more or less in sync. The point is this could easily be a lookup table, with either a SMALLINT (2^15 - 1 maximum values, or 2^16 - 1 if you use unsigned values; possible with an extension in Postgres) or an INT (2^31 -1 maximum values) as the PK.
Not going to touch on JSON objects because the things you might want to know are endless. TOAST and de-TOAST can be slow; if you need low latency, you should normalize your data.
There may or may not be extracted scalars, which can be beneficial during queries. Again, though, lookup tables (or even native ENUM types, if the values are limited) are crucial at scale.
As it stands, the table will have an average row size of 664 bytes (assuming an average of 12 bytes stored for the IP, 100 bytes stored for the UA, 500 bytes stored for the JSONB, and 12 bytes stored for the event type). That's 332 GB for 500,000,000 rows. You could shave a couple of bytes off by aligning columns [0], which saves 1 GB. If the IP addresses and UA strings were lookup tables, each with an INT4, that saves 104 bytes per row. If you made the PK for the table a BIGINT, that saves another 8 bytes per row. The total savings between column alignment and basic normalization is 114 bytes per row, or 57 GB.
This doesn't touch on the indices, either. If you're using PG 13+, you get B+tree de-duplication [1] for free, which can help with some denormalized data, but not if you have anything with high cardinality, like a timestamp, or a UUID. With lookup tables, you would of course need to index those FKs (whether or not you're enforcing constraints), which adds some size, but is still a huge net gain.
> I know I could have used some type of daily aggregation combined with a weekly aggregation, etc to roll up the data incrementally. A dev tried this and yeah, it hide the slow queries but then it became inflexible in terms of reporting. And writing and maintaining these cronjobs is a lot of work.
And shifting your entire analytics workload isn't a lot of work? Between ROLLUP [2] and MATERIALIZED VIEW [3], which can automatically refresh itself with a cron, this doesn't seem that burdensome.
> Also BigQuery bill for https://web3dsurvey.com is like $0.25 month and it is dealing with millions of records in its 3 month window of stored data.
Then you're in the free tier (<= 1 TiB/month of processed data), because after that it's $6.25/TiB. Also worth noting there is a massive difference between millions of rows and billions of rows. The former can be handled by practically any RDBMS on any hardware, with a completely unoptimized schema. The latter requires some thought if you want it to be performant.
This isn't at all to say that specialized DBs don't have their place, because they absolutely do. If you need a KV store, use a KV store, not an RDBMS. If you need OLAP, use something designed for OLAP. The difference is scale. At startup or side project scale, you can easily do everything (including pub/sub) with an RDBMS, and if you put thought into its design and usage, you can take it a lot farther than you'd think. Eventually, you may hit a point where it's counter-productive to do so, and then you should look into breaking tasks out.
The issue I see happening time and time again is devs have little to no expertise in DBs of any kind, but since everyone says "Postgres is all you need," they decide to use it for everything, except they don't know what they're doing. If you do that, yeah, you're gonna have problems fairly early on, and then you'll either throw your hands up and decide you really need a bevy of specialized DBs, caches, and message queues (which introduces a lot of complexity), or you'll vertically scale the DB. If you choose the latter, by the time you hit scaling limits, you're easily spending $25K/month on the DB alone. If you opt to hire someone with DB expertise at this point, you'll spend about that if not more in personnel costs, and not only will it take them weeks if not months to unravel everything, your devs will be constantly complaining that queries are now "too complex" because they have to do some JOINs, and they're being told to stop chucking everything into JSON. If instead, you took at most a week to learn some RDBMS basics by a. reading its manual front-to-back b. hands-on experience, trying things out you could almost certainly get much farther on much less.
Analytics isn't typically something that needs real-time capabilities, for one.
> a rapidly growing table [emphasis mine]
I think I see part of the problem here. If you had a single table, that means it's completely denormalized, so your schema probably looked something like this (or wider):
The JSON blob might be anywhere from a few hundred bytes to well over a kilobyte, and probably duplicates data already present as a scalar, like IP address, user agent string, timestamp, etc. I'll use the middle ground and say the JSONB objects are on average 500 bytes when stored. Now, the rest.A UUID, if stored as its native type (or BINARY(16) in MySQL - don't sleep on this, MySQL folks; it makes a huge difference at scale) is 16 bytes. That's double the size of a BIGINT, and quadruple the size of an INT4. Also, unless you're using UUIDv7 (or UUIDv1, but no one does), it's not k-sortable. Since Postgres doesn't cluster tuples around the PK [yes, I know all indices in Postgres are technically secondary] like MySQL/InnoDB does, this doesn't immediately thrash the B+tree in the same way, but it does thrash the visibility map, and it does bloat the WAL. There are various arguments for why you shouldn't use a monotonic integer as a surrogate key, but IMO they're largely overblown, and there are workarounds to not publicly disclose it.
IPv4 addresses, stored in dotted-quad as a string, are a maximum of 15 characters, storing in 16 bytes as TEXT or VARCHAR. If stored instead in the Postgres native INET type, that drops to 7 bytes, plus you get built-in validation. If you had INT4 UNSIGNED available (as MySQL does natively), you could even store them in their numeric form and save another 3 bytes, though you lose the validation.
User Agent strings are huge, usually over 100 bytes. They're also not that unique, relatively speaking. Even if you need to know the patch version of the browser, anyone with a browser doing automatic updates is going to stay more or less in sync. The point is this could easily be a lookup table, with either a SMALLINT (2^15 - 1 maximum values, or 2^16 - 1 if you use unsigned values; possible with an extension in Postgres) or an INT (2^31 -1 maximum values) as the PK.
Not going to touch on JSON objects because the things you might want to know are endless. TOAST and de-TOAST can be slow; if you need low latency, you should normalize your data.
There may or may not be extracted scalars, which can be beneficial during queries. Again, though, lookup tables (or even native ENUM types, if the values are limited) are crucial at scale.
As it stands, the table will have an average row size of 664 bytes (assuming an average of 12 bytes stored for the IP, 100 bytes stored for the UA, 500 bytes stored for the JSONB, and 12 bytes stored for the event type). That's 332 GB for 500,000,000 rows. You could shave a couple of bytes off by aligning columns [0], which saves 1 GB. If the IP addresses and UA strings were lookup tables, each with an INT4, that saves 104 bytes per row. If you made the PK for the table a BIGINT, that saves another 8 bytes per row. The total savings between column alignment and basic normalization is 114 bytes per row, or 57 GB.
This doesn't touch on the indices, either. If you're using PG 13+, you get B+tree de-duplication [1] for free, which can help with some denormalized data, but not if you have anything with high cardinality, like a timestamp, or a UUID. With lookup tables, you would of course need to index those FKs (whether or not you're enforcing constraints), which adds some size, but is still a huge net gain.
> I know I could have used some type of daily aggregation combined with a weekly aggregation, etc to roll up the data incrementally. A dev tried this and yeah, it hide the slow queries but then it became inflexible in terms of reporting. And writing and maintaining these cronjobs is a lot of work.
And shifting your entire analytics workload isn't a lot of work? Between ROLLUP [2] and MATERIALIZED VIEW [3], which can automatically refresh itself with a cron, this doesn't seem that burdensome.
> Also BigQuery bill for https://web3dsurvey.com is like $0.25 month and it is dealing with millions of records in its 3 month window of stored data.
Then you're in the free tier (<= 1 TiB/month of processed data), because after that it's $6.25/TiB. Also worth noting there is a massive difference between millions of rows and billions of rows. The former can be handled by practically any RDBMS on any hardware, with a completely unoptimized schema. The latter requires some thought if you want it to be performant.
This isn't at all to say that specialized DBs don't have their place, because they absolutely do. If you need a KV store, use a KV store, not an RDBMS. If you need OLAP, use something designed for OLAP. The difference is scale. At startup or side project scale, you can easily do everything (including pub/sub) with an RDBMS, and if you put thought into its design and usage, you can take it a lot farther than you'd think. Eventually, you may hit a point where it's counter-productive to do so, and then you should look into breaking tasks out.
The issue I see happening time and time again is devs have little to no expertise in DBs of any kind, but since everyone says "Postgres is all you need," they decide to use it for everything, except they don't know what they're doing. If you do that, yeah, you're gonna have problems fairly early on, and then you'll either throw your hands up and decide you really need a bevy of specialized DBs, caches, and message queues (which introduces a lot of complexity), or you'll vertically scale the DB. If you choose the latter, by the time you hit scaling limits, you're easily spending $25K/month on the DB alone. If you opt to hire someone with DB expertise at this point, you'll spend about that if not more in personnel costs, and not only will it take them weeks if not months to unravel everything, your devs will be constantly complaining that queries are now "too complex" because they have to do some JOINs, and they're being told to stop chucking everything into JSON. If instead, you took at most a week to learn some RDBMS basics by a. reading its manual front-to-back b. hands-on experience, trying things out you could almost certainly get much farther on much less.
[0]: https://www.enterprisedb.com/blog/rocks-and-sand
[1]: https://www.postgresql.org/docs/current/btree.html#BTREE-DED...
[2]: https://www.postgresql.org/docs/current/queries-table-expres...
[3]: https://www.postgresql.org/docs/current/rules-materializedvi...