And this is why I much prefer Suurogate values for primary keys over natural values. And why I've gravitated to using UUID values for surrogates, not integer identities.
A theme running through the article is "this value is unique " and "this value does not change". And of course those are both wrong.
So when designing databases now I assume "everything changes, nothing is unique " (even when the domain "expert" professes it is.)
This approach solves so many problems and saves something time later on when it turns out that that "absolutely, positively, unique for ever" natural key, isn't.
The tradeoff you’re making is performance, sometimes a lot depending on your RDBMS and table size. For smaller tables, under 10,000,000 rows or so, you won’t really notice much, but in the hundreds of millions or billions, you definitely do.
A UUID is at best 2x larger than even a BIGINT, thus the index size is 2x larger. If you aren’t using v1 or v7, it’s also not k-sortable. But most importantly for MySQL (and optionally SQL Server) if the table contains things related to a common entity, like a user’s purchases, the rows are now scattered around the clustering index’s B+tree. That incurs a huge amount of I/O on large tables, and short of a covering index and/or partitioning (which only masks the problem by shrinking the search space), there is no way to improve it. If instead the PK was (user_id, some_other_identifier), all records for a given user are physically co-located.
Size is in play, yes, but the 8 extra bytes per row is likely negligible compared to the row size.
Is there a case where the dise matters? Sure. But you can't discuss the space cost for 10 billion rows without comparing to the space cost of 10 billion rows.
SQL server let's you cluster by any index, do if your child record table will benefit by clustering by ParentGuid then go for it.
MySQL stores a copy of the PK in every secondary index, so it can start adding up quite a bit. I agree that the overall size of 10 billion rows would dwarf that, but since you're presumably doing some decent indexing on a table of that size, index size matters more IMO.
For any RDBMS (I assume... I don't know a lot about SQL Server or Oracle), the binpacking for pages also impacts query speed for queries where there are many results.
But it doesn't help much, as the surrogate only lives in your system.
So now some information comes in from outside the system that something happened with a plane, and you still have to find which surrogate id that plane has in your system.
You may decide two things happened to two different planes whereas another system consider it the same plane both times, and vice versa.
The uuid keys make it easy to change some value, but won’t solve the issue of keeping a record of historical changes.
UUID keys PLUS some form of versioning with creation dates will let you change an airport name and let you know what the airport name was on some arbitrary date in the past. Useful for backfills and debugging
You don’t need all that; any candidate key (even natural) with the addition of a datetime would work. What was the definition of Airport X before Datetime Y? And after? Etc.
> But if your natural key is the thing that changed you’d never know that airport x was renamed to airport y.
Correct, which is why you need the addition of a DATETIME to indicate when that identifier is valid.
> And when you renamed the airport, you’d need to add new entries in all the other tables that used airport name as a foreign key
No, because you wouldn't use the name in the key, you'd use a code like ICAO, though there are pseudo-ICAO codes for some aerodromes, so whether or not you want to be pedantic about naming is a personal choice. Then use FK constraints. Example:
CREATE TABLE airport_physical (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
coordinates POINT NOT NULL,
country_code_alpha2 CHARACTER(2) NOT NULL, -- ideally this would be a FK to an ISO3166-2 table
opened_at DATE NOT NULL,
closed_at DATE DEFAULT 'infinity'
);
CREATE TABLE airport_code (
icao TEXT NOT NULL CHECK (length(icao) <= 16), -- reasonable length, but can easily be changed
iata CHAR(3) DEFAULT NULL,
airport_physical_id INTEGER NOT NULL REFERENCES airport_physical(id) ON UPDATE CASCADE ON DELETE RESTRICT,
effective_date DATE NOT NULL,
end_date DATE NOT NULL DEFAULT 'infinity',
PRIMARY KEY (icao, effective_date)
);
CREATE TABLE airport_name (
airport_physical_id INTEGER NOT NULL REFERENCES airport_physical(id) ON UPDATE CASCADE ON DELETE RESTRICT,
name TEXT NOT NULL CHECK (length(name) <= 126),
effective_date DATE NOT NULL,
end_date DATE DEFAULT 'infinity',
PRIMARY KEY (airport_physical_id, effective_date, name)
);
This would let you model edge cases like John F. Kennedy International Airport, née Idlewild Airport, which had the ICAO "KIDL" from its opening (I mean, probably before that as well, but for the sake of argument assume you care when it was operating) on 1948-07-01 until 1964-01-01, but its name was changed to John F. Kennedy on 1963-12-24. It also allows you to model the reuse of ICAO codes, since Indianoloa Municipal Airport received the ICAO code "KIDL" following its release by JFK.
Is this easier to do than surrogate keys? Not really, no, but IMO it's easier for a human to understand when presented with temporal changes, allows for edge cases like an airport's designator or name changing while flights are enroute, and for flights (which would be the largest table), they can use `icao` and their departure/arrival datetime (which the table would need to model anyway) to effectively link to the other tables.
A theme running through the article is "this value is unique " and "this value does not change". And of course those are both wrong.
So when designing databases now I assume "everything changes, nothing is unique " (even when the domain "expert" professes it is.)
This approach solves so many problems and saves something time later on when it turns out that that "absolutely, positively, unique for ever" natural key, isn't.