You're throwing out buzzwords instead of addressing the response. It's dimension...

stochastic_monk · on March 31, 2018

You're right that it's really just a form of dimensionality reduction. My point was just that it's a more powerful form of dimensionality reduction than PCA or NMDS.

[Edit: and that the salient characteristics are likely contained in the model.]

darawk · on March 31, 2018

Precisely because it's more powerful, it doesn't encode the identifying information of the original data. Something like PCA likely would retain identifying characteristics (depending on how many low-rank vectors you drop).

stochastic_monk · on March 31, 2018

Outside of the fact that they have identities for all of the people whose data they acquired, yes, it would be harder to reconstruct individual people with it than PCA because of the direct interpretability of its data.

darawk · on March 31, 2018

They claim to have deleted that data. If they haven't deleted the data, then of course it's still an invasion of privacy. But the ML model really has nothing to do with it.

etiam · on March 31, 2018

I think the ML model has a lot to do with it in this case. One of the arguments I expect to see is that "Oh, no! We removed all the data. It's gone. I mean, that was only a few hundred megabytes per person anyway, but we just calculate a few thousand numbers from it and save in our system, then delete the data. That's less data per person than is needed to show a short cute cat GIF. What harm could we possibly do with that?"

darawk · on March 31, 2018

My point isn't that there is no harm here in them storing this model. It's also not that the data in their model is worthless. It's specifically that the way this article is talking about the issue is incorrect. The analogy they use would lead you to draw false conclusions about what's going on, and how to understand it.

There is a real issue here of whether or not they should be allowed to keep a model trained from ill-gotten data. But the way I would think about it is: If you steal a million dollars and invest it in the stock market, and make a 10% return, what happens to that 10% return if you then return the original million? That's a much better analogy for what's going on here. They stole an asset, and made something from it, and it's unclear who owns that thing or what to do with it.

pohl · on March 31, 2018

The ML model might know more about me than I’m willing to admit about myself. I only find some — but not much — comfort in the proposition that it can’t conjure my PII.

rhizome · on March 31, 2018

Is this basically a choice between .mp3 and .ogg, png vs jpg vs gif?

stochastic_monk · on March 31, 2018

It’s kind of comparable.

Regardless, I still think having the most relevant features already extracted is all they need to ask many of the questions they might want to. The point is that that’s still quite bad.

rhizome · on March 31, 2018

Right, I was just trying to confirm an analogy. It seems like this stuff is like a lossy codec for traits.

theoh · on March 31, 2018

If you can still run Java applets, this is a nice intro: http://www.cs.mcgill.ca/~sqrt/dimr/dimreduction.html

rhizome · on March 31, 2018

It's dimensionality reduction. You cannot recover the original object.

Makes me think of the Simulacrum[1]. "The map is not the territory."[2]

1. https://en.wikipedia.org/wiki/Simulacra_and_Simulation

2. https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation

fjsolwmv · on March 31, 2018

Your SSN is "dimensionality reduction" over your data. It's still your private data. Same for your race, sexiual orientation, hobbies, etc

alexcnwy · on April 1, 2018

I don't think you understand what dimensionality reduction means.

SSN is a lookup key into the raw data. Dimensionality reduction is by definition lossy since it's used in scenarios where: rows of data = n <<< m = number of features