Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Remember that 33 bits of entropy are enough to identify everyone. It may not be legally so, but any data with 33 bits of entropy is technically PII, and you should treat it as such.


That makes no sense, sorry.

Ok, 2^33 > world population, but that doesn't mean that the string "Hello world" is PII.


33 bits of entropy, not just 33 bits


That depends on the encoding, does it not? The binary sequence equal to ASCII "Hello world" might well be PII with many different encodings. By accident, of course, but nevertheless 33 bits of information would be enough.


Unless someone is actually called Hello World. Or perhaps Bobby Tables. ;)


> any data with 33 bits of entropy is technically PII,

Well yes, but actually no.

I can run UUIDgen 33 times and put it on pastebin, via tor. That does not mean I have 128 peoples' worth of PII. In fact I have ~0 bits of PII. If I were to paste one of those numbers in this comment, now they are all linked to kortex, and however much entropy that username has (call it 32, this account is leaky). But you are still short 127 "PIIs". The rest of the entropy means nothing.

Conversely, a US phone number is <24 bits, but 100% PII.

It's not 33 bits of "entropy". It's 33 bits worth of a distribution which can be correlated to an identity. If it's not correlated, it's not PII.


What do you mean here? I am asking because this is potentially useful.


There are ~8,000,000,000 people in the world; that's a ten-digit number so that's the smallest size of number which could count out a unique number for everyone in the world, 9 digits doesn't have enough possible values. If the digit values are based on details about you, e.g. being in USA sets the second digit to 0/1/2, being in Canada and male sets it to 3, being in Canada and female sets it to 4, the last two digits are your height in inches, etc. etc. then you don't have to count out the numbers and give one to everyone, the ten digits become a kind of identifier of their own. 1,445,234,170 narrows down to (a woman in Canada 70 inches tall ... ) until it only matches one person. There are lots of people of the same height so perhaps it won't quite identify a single person, but it will be close. Maybe one or two more digits is enough to tiebreak and reduce it to one person.

Almost anything will do as a tie-break between two people - married, wears glasses, keeps snakes, once visited whitehouse.gov, walked past a Bluetooth advertising beacon on Main Street San Francisco. Starting from 8 billion people and making some yes/no tiebreaks that split people into two groups, a divide and conquer approach, split the group in two, split in two again, cheerful/miserable, speaks Spanish yes/no, once owned a dog yes/no, once had a Google account yes/no, once took a photo at a wedding yes/no, ever had a tooth filling yes/no, moved house more than once in a year yes/no, ever broke a bone yes/no, has a Steam account yes/no, anything which divides people you will "eventually" winnow down from 8 billion to 1 person and have a set of tiebreaks with enough information in them to uniquely identify individual people.

I say "eventually", if you can find tiebreaks that split the groups perfectly in half each time then you only need 33 of them to reduce 8 billion down to 1. This is all another way of saying counting in binary, 1010010110101001011010100101101 is a 33 bit binary number and it can be seen as 33 yes/no tiebreaks and it's long enough to count up past 8 billion. It's 2^33, two possible values in each position, 33 times.

That means any collection of data about people which gets to 33bits of information about each person is getting close to being enough data to have a risk of uniquely identifying people. If you end up gathering how quickly someone clicks a cookie banner, that has some information hiding in it about how familiar they are with cookie banners and how physically able they are, that starts to divide people into groups. If you gather data about their web browser, that tells you what OS they run, what version, how up to date it is, those divide people into buckets. What time they opened your email with a marketing advert in it gives a clue to their timezone and work hours. Not very precise, but it only needs 33 bits before it approaches enough to start identifying individual people. Gather megabytes of this stuff about each person, and identities fall out - the person who searched X is the same person who stood for longer by the advert beacon and supports X political candidate and lives in this area and probably has an income in X range and ... can only be Joe Bloggs.


Related, and fun to think about:

https://www.gwern.net/Death-Note-Anonymity

Solving an anonymous murderer with a supernatural MO, but an ego that betrays him.


This isn’t as meaningful as you think because each bit has to have no correlation with the others, which is hard. Or to put it another way, each bit has to perfectly bisect the population, otherwise you’ll have collisions and a bunch of empty space.


Each bit has to perfectly bisect the population for 33 bits to be sufficient. But that's the minimum and it's meaningful because of how tiny an amount of data it is compared to how much we process all the time with computers.

The time someone takes to get rid of a cookie popup is not a clear signal, it doesn't bisect the population but it has signal in it. Faster suggests fast computer, good mouse or trackpad, youth, health, familiarity with the internet. Slower suggests low quality dirty mouse or trackpad, old age, poor health, unfamiliarity with the internet. There is some data there, towards some sub-groups and away from others. "Doesn't keep snakes" is low signal because that's most people. "Keeps snakes" is high signal because it narrows down to a small sub-group.

At extreme best/worst case _33_ of them can single someone out, but think that we leak thousands and thousands of bits every day, day after day, to all kinds of information processors who sell and aggregate them, and some that we leak are very specific like GPS coordinates or nearby WiFi SSIDs or app login details, that's why I think it's meaningful.

FaceBook patented an idea of identifying photos taken with the same camera by looking at dust, distortion and damage on the camera lens. Imagine you go into a shop, open their app and photograph a QR code for a discount and that picture ties the app user to every public online photo they ever took, and the app does some device fingerprinting too. Then at home you open the web browser and visit a site with some tracking JS, and that ties a probable connection between your home internet IP with the person who was in the store earlier. You open an app and your home GPS coordinates and nearby WiFi signals are tied to the same home internet IP, and public name and address records identify who was in the store earlier. FaceBook buys that advertising data and makes a shadow profile of someone who lives here, shops there, took all these photos and probably knows some of the people who appear in the photos. That's a lot lot more than 33 bits, but it's not a lot to imagine happening for people casually using apps and the internet several times a day every day over the last decade. And no matter how hard you try to guard against it, the wrong 33 bits leaked is all it would take in the worst case to undo all your efforts.


Sure, but again, the 33 bits have to be a set of perfect dimensions. You’re severely downplaying how difficult it is to get them. It’s realistically not any different than saying, “if everyone had a unique ID, it would only take 33 bits to store that.” It’s not very meaningful.

In fact, it would be a major fucking deal if you could publish 33 dimensions that each bisect the population exactly.

“Keeps snakes” is a perfect example of something that isn’t good enough. It’s too infrequently true to be useful in the 33-bit uniqueness identifier.


Jawdropping. So this is the 33 degrees I've heard people throw around. Thank you so much for elaborating in such a detailed and insightful way.


Another way to look at this is that 2^ 33 is about 8.5 billion

>>> pow(2, 33) 8589934592

There are less than 8.5 billion people in the world, so you could just create a map from each person to a number in the set.


Thank you for gearing this explanation for a person versed in software but not data analytics. It helped a lot.


One of the best posts I've seen on HN




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: