What I’ve found works well, going along with the author’s “everything is a log”, is append only tables in PG with date ranges on them.
So you have a pet table with an ID, birth date, type, name, whatever, and ‘valid_range’.
That last column is a date_range column. Combined with the ID it serves as a unique key for the table. Records are inserted with a date grange from now() to infinity.
To update a record, you call a stored procedure. It creates the new record with that same date range, and updates the old record to be valid up to (but not including) now(). The SP ensures the process is done correctly.
You can use the same date range in join tables for the same reason.
This makes it possible to see the full state of any record kept like this at any point in time, see when it was created, or last changed. An audit table records who changed it by holding the ID and timestamp of the change. There is no real deletion, you’d do soft deletion by setting a status.
I suspect this wouldn’t work well for very high volume tables without sharping or something. But for CRUD tables that don’t change a lot it’s fantastic.
The only thing that’s not smooth is future updates. If you need a new non-null column, it ends up added to all records. So you can either set a default and just deal with the fact that it’s now set on all old records, leave it as nullable and enforce non-null in code, or enforce it only on insert in a trigger or the SP I described.
I’ve found it much easier to use than some sort of ‘updates’ table storing JSON changes or EAV style updates or whatever.
This is a really good description of more or less exactly how our current approach works! This is a daily granularity variant we are testing atm, in order to eliminate flip-flops that occur during the length of a business day. The v1 impl was down to the second, this one is daily.
Here is the core of it:
CREATE TABLE time_travel_daily (
domain TEXT NOT NULL,
valid_range tstzrange NOT NULL,
valid_from timestamptz GENERATED ALWAYS AS (lower(valid_range)) STORED,
valid_to timestamptz GENERATED ALWAYS AS (upper(valid_range)) STORED,
tld TEXT,
owned BOOLEAN,
acquired_at timestamptz,
released_at timestamptz,
registrar TEXT,
updated_at timestamptz,
accounting_uuid TEXT,
offer_received_date timestamptz,
payment_received_date timestamptz,
sold_at timestamptz,
sold_channel TEXT,
last_renewed_at timestamptz,
expires_at timestamptz,
transfer_started_at timestamptz,
transfer_completed_at timestamptz,
transfer_eligible_at timestamptz,
snapshot_json JSONB NOT NULL,
inserted_at timestamptz DEFAULT NOW() NOT NULL,
source_data_change_id INT,
PRIMARY KEY (domain, valid_range)
);
CREATE INDEX ttd_domain_idx ON time_travel_daily(domain);
CREATE INDEX ttd_gist_valid_range_idx ON time_travel_daily USING gist(valid_range);
CREATE INDEX ttd_owned_valid_range_idx ON time_travel_daily USING gist(valid_range) WHERE owned = TRUE;
CREATE INDEX ttd_registrar_idx ON time_travel_daily(registrar) WHERE registrar IS NOT NULL;
CREATE INDEX ttd_source_data_change_id_idx ON time_travel_daily(source_data_change_id) WHERE source_data_change_id IS NOT NULL;
And then here is a piece of our update trigger which "closes" previous entities and opens an new one:
UPDATE time_travel_daily
SET valid_range = tstzrange(lower(valid_range), target_date::timestamptz, '[)')
WHERE domain IN (
SELECT DISTINCT dc.domain
FROM data_changes dc
WHERE dc.invalidated IS NULL
AND dc.after IS NOT NULL
AND dc.modified_at::date = target_date
)
AND upper(valid_range) IS NULL -- Only close open ranges
AND lower(valid_range) < target_date::timestamptz; -- Don't close ranges that started today
A trigger to ‘close’ the old record is a great idea. My stored procedure is also doing some additional validation (validate at every layer = less bugs) so what I’ve got works well enough for me.
I'm not entirely sure what the valid_range is doing. Besides updating it, do you use this index for anything else? I agree the performance doesn't seem like it would be great.
I do something like 4000 inserts a second, but maybe only a few queries a minute, so I use an "invalidated_by" column which (eventually) points to the newer record, and I update it on query instead of insert (when the multiple nulls are discovered and relevant)
SELECT COUNT(DISTINCT domain)
FROM time_travel
WHERE (CURRENT_DATE - INTERVAL '90 days')::timestamptz <@ valid_range
AND owned;
This is asking, "how many domains did we own 90 days ago"
Instead of finding records where the start is less than, end is greater than, you can just say find me rows that will cover this point in time. The GiST index on valid_range does the heavy lifting.
So you have a pet table with an ID, birth date, type, name, whatever, and ‘valid_range’.
That last column is a date_range column. Combined with the ID it serves as a unique key for the table. Records are inserted with a date grange from now() to infinity.
To update a record, you call a stored procedure. It creates the new record with that same date range, and updates the old record to be valid up to (but not including) now(). The SP ensures the process is done correctly.
You can use the same date range in join tables for the same reason.
This makes it possible to see the full state of any record kept like this at any point in time, see when it was created, or last changed. An audit table records who changed it by holding the ID and timestamp of the change. There is no real deletion, you’d do soft deletion by setting a status.
I suspect this wouldn’t work well for very high volume tables without sharping or something. But for CRUD tables that don’t change a lot it’s fantastic.
The only thing that’s not smooth is future updates. If you need a new non-null column, it ends up added to all records. So you can either set a default and just deal with the fact that it’s now set on all old records, leave it as nullable and enforce non-null in code, or enforce it only on insert in a trigger or the SP I described.
I’ve found it much easier to use than some sort of ‘updates’ table storing JSON changes or EAV style updates or whatever.