It says GDPR compliant and no cookies on the project page. How are unique visitors calculated? And I'm assuming it can't link conversions to campaigns without some cookie-alternative?
No idea, but generally, a bloom filter would get you there without any identifying information being stored. The counts would merely be estimates at that point, not exact values.
> Instead of tagging users with cookies, we count the number of unique IP addresses that accessed your website. Counting IP addresses is an old-school method that was used before the modern age of JavaScript snippets and tracking cookies.
Since IP addresses are considered personal data under GDPR, we anonymize them using a one-way cryptographic hash function. This generates a random string of letters and numbers that is used to calculate unique visitor numbers for the day. Old salts are deleted to avoid the possibility of linking visitor information from one day to the next. We never store IP addresses in our database or logs.
> Since IP addresses are considered personal data under GDPR, we anonymize them using a one-way cryptographic hash function.
Um... hashing IPv4 addresses, even with salt, does literally nothing to anonymise (assuming the output space is at least ~32 bits, which I think is safe to assume): they’ll still be PII. IPv6 addresses I’m not so confident about; maybe it would be sufficient for some parts, but it’s definitely inadequate for some concerns.
(For IPv4, enumerating all four billion inputs is so completely practical that “one-way” is nonsense.)
One way if you have a salt? Enumerating won’t help, you need to know the salt, which gets deleted.
That said, the whole IP thing is weird to me. Not only are we allowed to log IPs directly for security reasons, we even *have* to log IPs in certain cases (newsletter subscriptions).
> That said, the whole IP thing is weird to me. Not only are we allowed to log IPs directly for security reasons, we even have to log IPs in certain cases (newsletter subscriptions).
The point of designating something as PII isn't that we then _never_ store or use it, it's to carefully consider if we actually need it or not (and what protections we can add for the values we do need to store/use).
We're meant to stop the practice of just collecting and storing all data, without consideration for the harms that causes.
What I understand they’re doing is storing the salt in one place, a set of hashed IP addresses in another place, then daily trashing the lot after counting the number of elements in the set and storing that.
Information-theory-wise, this is no different to just storing the actual IP addresses (and deleting them daily after tallying, as before). It does mean that you need to obtain two things instead of just one, but if you get access to it all, it’s straightforward to reverse the lot (though computationally a little expensive), and easy to check a single value for a match.
The technique may be considered reasonable effort at protecting against casual abuse, but it’s not technically effective of itself, and it doesn’t stop the data from being PII. The important aspect is that the PII is deleted within 24 hours. My personal opinion is that the hashing part should probably be considered snake oil and whitewash, at least for what they’re claiming—I don’t say it’s useless, but it definitely doesn’t do what they’re touting it for.
Unless they’re actually keeping the hashed values for some reason after one day, and associating them with other records? In which case, disregard part of what I say, it’s obviously better than persisting IP addresses long-term! But also it’s extremely dubious to call that anonymisation as they do, because you can so often tie things together, behavioural patterns and such, to deanonymise. It’s frighteningly effective.
Refer to my understanding in the first paragraph—I don’t think they’re retaining the hashed values after a day either? If they are, sure, apply my last paragraph, you can’t do a single match any more. (But the whole thing would still definitely be susceptible to deanonymisation.) But at the very least, it’s easily reversible for up to 24 hours.
The “PII” label is taint that is probably impossible to dispel completely/perfectly, and difficult to dispel sufficiently (and deanonymising is an arms race).
Lossless techniques do nothing to dilute that taint.
Lossy techniques are necessary to get anywhere, such as disregarding certain bits of the address, or Bloom filters.
The problem, in general with hashing IP addresses (especially ipv4) is that there's not that many of them.
If I tell you the value is either 1 or 2, but I hashed it with sha256 to make it secure, that's bullshit, right? You can just hash both and see which it is.
Same concept applies regardless of the hash algo, and still applies if you have more than 2 possible values, 4 billion or so possible ipv4 addresses is _not_ that many values to a computer.
Other common places this problem occurs is with any other restricted set of values, eg phone numbers and email addresses (most are at like 5 domains and are easy to guess/know).
As a side consideration, according to the varying opinions in response to this question it’s not really clear what constitutes PII (personally identifying information).
When I researched this topic it was strange to me that no one seems to agree. Is it just arm-chair internet answers? Or is it actually that the letter of the law is actually ambiguous? What are the real world consequences of using this when it’s possible it violates GDPR? Or, what are the chances there would be consequences?
PII in the general public's definition is a name, address etc. The confusion in these discussions comes from European regulations defining browsing behaviour as personal data, which makes GDPR applicable to it. Even if that browsing behaviour data is in layman's terms anonymous "and thus not PII" it is considered personal data under EU rules.