Lots of peole got hung up on the example, which I thought would be be helpful on the discussion, but certainly should not replace the main point, which is:
Relations, attributes and tuples are logical. A PK is a combination of one or more attributes representing a name that uniquely identifies tuples and, thus, is logical too[2], while performance is determined exclusively at the physical level, by implementation.
So generating SKs for performance reasons (see, for example, Natural versus Surrogate Keys: Performance and Usability, Performance of Surrogate Key vs Composite Keys) is logical-physical confusion (LPC)[3]. Performance can be considered in PK choice only when there is no logical reason for choosing one key over another.
> Trying to tamp out these ambiguities adds an unbounded number of data model epicycles that add a lot of complexity and performance loss
If you can talk about a business rule, you have a predicate. If you have a predicate, you can make it 5 or 6 normal form, since all that means is that your relation expresses only and completely the predicate.
It seems that your definition of normalization is not the one that I am using above. What is it?
Logs, metrics, data analytics, sent to third parties as identifiers. Unless you have a system small enough that you, or a few people who can be trusted, can know every component your IDs will leak out somewhere.
- Joins, lookups, indexes. Here data type can matter regarding performance and resource use.
I struggle to see a practical example.
> - Idempotency. Allowing a client to generate IDs can be a big help here (ie UUIDs)
Natural keys solves this
> - Sharing. You may want to share a URL to something that requires the key, but not expose domain data (a URL to a user’s profile image shouldn’t expose their national ID).
The you have another piece of data, which you relate to the natural key. Something like `exposed-name`.
> There is not one solution that handles all of these well
Natural keys solve these issues.
> Also, we all know that stakeholders will absolutely swear that there will never be two people with the same national ID. Oh, except unless someone died, then we may reuse their ID. Oh, and sometimes this remote territory has duplicate IDs with the mainland. Oh, and for people born during that revolution 50 years ago, we just kinda had to make stuff up for them.
If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
> Actually, the article is proposing a new principle
I'm putting it in words, but such knowledge has been common in the database community for ages, afaict.
> If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
Errors in the initial design should be assumed as the default. Wise software engineering should make change easy.
Constraints on natural keys are business logic, not laws of mathematical truth. They are subject to change and often violated in the real world. The database as record-keeping engine should only enforce constraints on artificial keys that maintain its integrity for tracking record identity, history, and references.
Your database may not be primarily a record-keeping engine, and your tradeoffs may be different.
Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.
Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).
If you want to define the PK as the natural key with no separate column then you get to do comparisons on all the natural key columns themselves, so instead of doing 1 4 or 8 byte column comparison you get to do what? 5 char comparisons?
Having worked extensively in ETL - when a developer tells me "there's no duplication about this real world process" what they mean is "there's no duplication in my mental model about this real world process"
> Memory, and CPU, and even storage eventually, those would be the main practical examples of where having a key that's composed of something very small saves you space and thus, time.
> Say we want to use a bigint key vs a VARCHAR(30)? depending on your big key you might be talking about terabytes of additional data, just to store a key (1t rows @ bigint = 8TB, 1T rows at 30 chars? 30TB...). The data also is going to constantly shuffle (random inserts).
>> Joins, lookups, indexes
I don't see how what you brought up has anything to do with these.
But the main point is being missed here because of a physical vs logical conflation anyhow.
It would be helpful if the article used a natural key. Instead, it uses a primary key that is neither natural nor guaranteed, and is mutable. It makes assumptions that are not true, and that is one of the big dangers in using natural keys.
> If this happens, the designer had a error in his design, and should extend the design to accommodate the facts that escaped him at design time.
This again is a dangerous assumption. The danger here is the assumption that facts don't change. Facts do change. And facts that are true at design time are not necessarily true 1 day later, 1 year later, or 1 decade later, or more.
Again, when the example can't even use a natural key to present it's idea of using natural keys, we have a problem.
The problem with natural keys is that nobody ever says, "My bad, I should have spelled my name the same way on this form as I did when I registered for the service, I promise not to do it again." Instead they say, "No matter how I spell it, you should have still known it was my name!"