It says GDPR compliant and no cookies on the project page. How are unique visito...

withinboredom · on Nov 29, 2024

No idea, but generally, a bloom filter would get you there without any identifying information being stored. The counts would merely be estimates at that point, not exact values.

beeb · on Nov 29, 2024

At least for Plausible, they state this (https://plausible.io/blog/google-analytics-cookies):

> Instead of tagging users with cookies, we count the number of unique IP addresses that accessed your website. Counting IP addresses is an old-school method that was used before the modern age of JavaScript snippets and tracking cookies.

Since IP addresses are considered personal data under GDPR, we anonymize them using a one-way cryptographic hash function. This generates a random string of letters and numbers that is used to calculate unique visitor numbers for the day. Old salts are deleted to avoid the possibility of linking visitor information from one day to the next. We never store IP addresses in our database or logs.

chrismorgan · on Nov 29, 2024

> Since IP addresses are considered personal data under GDPR, we anonymize them using a one-way cryptographic hash function.

Um... hashing IPv4 addresses, even with salt, does literally nothing to anonymise (assuming the output space is at least ~32 bits, which I think is safe to assume): they’ll still be PII. IPv6 addresses I’m not so confident about; maybe it would be sufficient for some parts, but it’s definitely inadequate for some concerns.

(For IPv4, enumerating all four billion inputs is so completely practical that “one-way” is nonsense.)

I’m almost certain this is legal theatre.

Semaphor · on Nov 29, 2024

One way if you have a salt? Enumerating won’t help, you need to know the salt, which gets deleted.

That said, the whole IP thing is weird to me. Not only are we allowed to log IPs directly for security reasons, we even *have* to log IPs in certain cases (newsletter subscriptions).

kadoban · on Nov 29, 2024

> That said, the whole IP thing is weird to me. Not only are we allowed to log IPs directly for security reasons, we even have to log IPs in certain cases (newsletter subscriptions).

The point of designating something as PII isn't that we then _never_ store or use it, it's to carefully consider if we actually need it or not (and what protections we can add for the values we do need to store/use).

We're meant to stop the practice of just collecting and storing all data, without consideration for the harms that causes.

alkonaut · on Dec 2, 2024

Couldn't this be done with a Bloom filter in such a way that (in exchange for a small error rate) you'd not keep any individual hashes?

kadoban · on Nov 29, 2024

If what they're doing is using a secure salt and then throwing the salt away once a day that _might_ be doing something.

chrismorgan · on Nov 29, 2024

What I understand they’re doing is storing the salt in one place, a set of hashed IP addresses in another place, then daily trashing the lot after counting the number of elements in the set and storing that.

Information-theory-wise, this is no different to just storing the actual IP addresses (and deleting them daily after tallying, as before). It does mean that you need to obtain two things instead of just one, but if you get access to it all, it’s straightforward to reverse the lot (though computationally a little expensive), and easy to check a single value for a match.

The technique may be considered reasonable effort at protecting against casual abuse, but it’s not technically effective of itself, and it doesn’t stop the data from being PII. The important aspect is that the PII is deleted within 24 hours. My personal opinion is that the hashing part should probably be considered snake oil and whitewash, at least for what they’re claiming—I don’t say it’s useless, but it definitely doesn’t do what they’re touting it for.

Unless they’re actually keeping the hashed values for some reason after one day, and associating them with other records? In which case, disregard part of what I say, it’s obviously better than persisting IP addresses long-term! But also it’s extremely dubious to call that anonymisation as they do, because you can so often tie things together, behavioural patterns and such, to deanonymise. It’s frighteningly effective.

tingletech · on Nov 29, 2024

If you throw away the daily random salt (but keep the obscured IP address), how can you check a single value for a match the next day?

chrismorgan · on Nov 30, 2024

Refer to my understanding in the first paragraph—I don’t think they’re retaining the hashed values after a day either? If they are, sure, apply my last paragraph, you can’t do a single match any more. (But the whole thing would still definitely be susceptible to deanonymisation.) But at the very least, it’s easily reversible for up to 24 hours.

jszymborski · on Nov 29, 2024

What matomo does is mask parts of the IP address (you choose how much).

gizzlon · on Nov 29, 2024

hm.. are you saying they need scrypt or something similar?

chrismorgan · on Nov 29, 2024

The “PII” label is taint that is probably impossible to dispel completely/perfectly, and difficult to dispel sufficiently (and deanonymising is an arms race).

Lossless techniques do nothing to dilute that taint.

Lossy techniques are necessary to get anywhere, such as disregarding certain bits of the address, or Bloom filters.

kadoban · on Nov 29, 2024

The problem, in general with hashing IP addresses (especially ipv4) is that there's not that many of them.

If I tell you the value is either 1 or 2, but I hashed it with sha256 to make it secure, that's bullshit, right? You can just hash both and see which it is.

Same concept applies regardless of the hash algo, and still applies if you have more than 2 possible values, 4 billion or so possible ipv4 addresses is _not_ that many values to a computer.

Other common places this problem occurs is with any other restricted set of values, eg phone numbers and email addresses (most are at like 5 domains and are easy to guess/know).

pdyc · on Nov 29, 2024

most likely through one way ip hashing bounded by time duration. If you have utm's in your url than it can track otherwise probably not.

awongh · on Nov 30, 2024

As a side consideration, according to the varying opinions in response to this question it’s not really clear what constitutes PII (personally identifying information).

When I researched this topic it was strange to me that no one seems to agree. Is it just arm-chair internet answers? Or is it actually that the letter of the law is actually ambiguous? What are the real world consequences of using this when it’s possible it violates GDPR? Or, what are the chances there would be consequences?

t0mas88 · on Dec 3, 2024

PII in the general public's definition is a name, address etc. The confusion in these discussions comes from European regulations defining browsing behaviour as personal data, which makes GDPR applicable to it. Even if that browsing behaviour data is in layman's terms anonymous "and thus not PII" it is considered personal data under EU rules.