Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Funny how the common thread through many of these 'Falsehoods...' posts is that many programmers think that systems designed by humans, for humans, and kept running by humans will rigidly adhere to a set of rules and don't have edge cases.



The profession of programming is fundamentally about the interface between squishy human systems and rigid rule-based machines. No surprise that keeps coming up.


I think it's more, "you might think as a programmer writing software that models the world of aviation that you could assume the following things— but alas."

Software unfortunately follows rigid rules so the challenge is finding a set of rigid rules that can encompass reality. It would be pretty natural if you were writing a database schema that a flight would have a departing airport and an arriving airport— but alas.


Well what I'm speaking to more is that most systems that you model, most of the model is already assumptions. So natural or not, that database schema is already invoking assumptions which may or may not be false. Especially when dealing with any system where humans are directly involved in it. For many things, there's no exhaustive list of rules that will cover all the cases. As they say, if you make something idiot-proof, they'll invent a better idiot.


And this is why I much prefer Suurogate values for primary keys over natural values. And why I've gravitated to using UUID values for surrogates, not integer identities.

A theme running through the article is "this value is unique " and "this value does not change". And of course those are both wrong.

So when designing databases now I assume "everything changes, nothing is unique " (even when the domain "expert" professes it is.)

This approach solves so many problems and saves something time later on when it turns out that that "absolutely, positively, unique for ever" natural key, isn't.


The tradeoff you’re making is performance, sometimes a lot depending on your RDBMS and table size. For smaller tables, under 10,000,000 rows or so, you won’t really notice much, but in the hundreds of millions or billions, you definitely do.

A UUID is at best 2x larger than even a BIGINT, thus the index size is 2x larger. If you aren’t using v1 or v7, it’s also not k-sortable. But most importantly for MySQL (and optionally SQL Server) if the table contains things related to a common entity, like a user’s purchases, the rows are now scattered around the clustering index’s B+tree. That incurs a huge amount of I/O on large tables, and short of a covering index and/or partitioning (which only masks the problem by shrinking the search space), there is no way to improve it. If instead the PK was (user_id, some_other_identifier), all records for a given user are physically co-located.


Size is in play, yes, but the 8 extra bytes per row is likely negligible compared to the row size. Is there a case where the dise matters? Sure. But you can't discuss the space cost for 10 billion rows without comparing to the space cost of 10 billion rows.

SQL server let's you cluster by any index, do if your child record table will benefit by clustering by ParentGuid then go for it.


MySQL stores a copy of the PK in every secondary index, so it can start adding up quite a bit. I agree that the overall size of 10 billion rows would dwarf that, but since you're presumably doing some decent indexing on a table of that size, index size matters more IMO.

For any RDBMS (I assume... I don't know a lot about SQL Server or Oracle), the binpacking for pages also impacts query speed for queries where there are many results.


But it doesn't help much, as the surrogate only lives in your system.

So now some information comes in from outside the system that something happened with a plane, and you still have to find which surrogate id that plane has in your system.

You may decide two things happened to two different planes whereas another system consider it the same plane both times, and vice versa.


The uuid keys make it easy to change some value, but won’t solve the issue of keeping a record of historical changes.

UUID keys PLUS some form of versioning with creation dates will let you change an airport name and let you know what the airport name was on some arbitrary date in the past. Useful for backfills and debugging


You don’t need all that; any candidate key (even natural) with the addition of a datetime would work. What was the definition of Airport X before Datetime Y? And after? Etc.


But if your natural key is the thing that changed you’d never know that airport x was renamed to airport y. You’d just have two different keys.

And when you renamed the airport, you’d need to add new entries in all the other tables that used airport name as a foreign key


> But if your natural key is the thing that changed you’d never know that airport x was renamed to airport y.

Correct, which is why you need the addition of a DATETIME to indicate when that identifier is valid.

> And when you renamed the airport, you’d need to add new entries in all the other tables that used airport name as a foreign key

No, because you wouldn't use the name in the key, you'd use a code like ICAO, though there are pseudo-ICAO codes for some aerodromes, so whether or not you want to be pedantic about naming is a personal choice. Then use FK constraints. Example:

    CREATE TABLE airport_physical (
        id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
        coordinates POINT NOT NULL,
        country_code_alpha2 CHARACTER(2) NOT NULL,  -- ideally this would be a FK to an ISO3166-2 table
        opened_at DATE NOT NULL,
        closed_at DATE DEFAULT 'infinity'
    );

    CREATE TABLE airport_code (
        icao TEXT NOT NULL CHECK (length(icao) <= 16),  -- reasonable length, but can easily be changed
        iata CHAR(3) DEFAULT NULL,
        airport_physical_id INTEGER NOT NULL REFERENCES airport_physical(id) ON UPDATE CASCADE ON DELETE RESTRICT,
        effective_date DATE NOT NULL,
        end_date DATE NOT NULL DEFAULT 'infinity',
        PRIMARY KEY (icao, effective_date)
    );

    CREATE TABLE airport_name (
        airport_physical_id INTEGER NOT NULL REFERENCES airport_physical(id) ON UPDATE CASCADE ON DELETE RESTRICT,
        name TEXT NOT NULL CHECK (length(name) <= 126),
        effective_date DATE NOT NULL,
        end_date DATE DEFAULT 'infinity',
        PRIMARY KEY (airport_physical_id, effective_date, name)
    );

This would let you model edge cases like John F. Kennedy International Airport, née Idlewild Airport, which had the ICAO "KIDL" from its opening (I mean, probably before that as well, but for the sake of argument assume you care when it was operating) on 1948-07-01 until 1964-01-01, but its name was changed to John F. Kennedy on 1963-12-24. It also allows you to model the reuse of ICAO codes, since Indianoloa Municipal Airport received the ICAO code "KIDL" following its release by JFK.

Is this easier to do than surrogate keys? Not really, no, but IMO it's easier for a human to understand when presented with temporal changes, allows for edge cases like an airport's designator or name changing while flights are enroute, and for flights (which would be the largest table), they can use `icao` and their departure/arrival datetime (which the table would need to model anyway) to effectively link to the other tables.


It is the classic scenario of confusing the map with the territory.

In the map everything is clear. It is clear what a "plane" is what "airports" are and what their relationship is. And transferring that into a computer program is straight forward.

In the territory everything is fuzzy. None of the definitions are without edge cases and the expected relationships are often violated in surprising ways.

Aviation isn't unique here, every system suffers from the distinction between its actual function and the abstract description of that system.


Us programmers like to distill everything down to rigid sets of rules because that's how our mind operates. The fewer probabilistic "analog" parameters, the better. Of course the real world doesn't work this way.


> because that's how our mind operates

It is by no mean specific to programmers. Ask to someone who learns French, for instance. Rules with too many arbitrary exceptions.

What is specific to programmers is that their tool performs at its best with simpler rules, so their job is to find the necessary and sufficient set of rules - and will dismiss most of the cases pointed by this article as unimportant exceptions the software won't handle.


> Ask to someone who learns French, for instance. Rules with too many arbitrary exceptions.

I took French in middle school, and it was always a running joke that the teacher spent the first 5 minutes on the rule, and the next 40 minutes on the exceptions.


I am French and my non-native-French wife often asks me "why do you say this and that".

Either there is a simple rule and well known exceptions we learn at school (she would also know) or we get into the area of "this is what this is, just learn it by heart".

And then suddenly, someday, I discover there is an obscure rule with complicated words that addresses the question.


That is still a good rule if its just 40 minutes of exceptions if it covers much more than that.


Natural languages are kinda weird about this because most people don't remember their rules as rules, they learn by example, by finding patterns and kinda extrapolating them.

English is a foreign language to me. But I somehow managed to learn it without learning the rules. I can say things correctly-ish without being able to explain why I used this particular grammar.


In the end the data has to fit into structures or tables that can be processed by some algorithms. If the system is not rigid to a certain degree it would become unmaintainable or full of bugs or both.


Not really. It's just that software by definition must create a model of the domain it attempts to handle. And a model is, in the end, a set of rules. With an absence of rules, the software can't really do anything, as would be pretty pointless for actually solving any problem. The alternative is to hand the users Notepad and say "knock yourselves out".

I'd argue that programmers are indeed much more aware of how many exceptions and edge cases most real world domains have. Ask a lay person about such a simple thing as leap seconds, for instance, and they'll often believe you're making shit up.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: