You're throwing out buzzwords instead of addressing the response.
It's dimensionality reduction. You cannot recover the original object. It's like using a shadow to reconstruct the face of the person casting the shadow.
Note this has nothing to do with the expressive power of a deep neural network. You are by definition trying to throw away noisy aspects of the data and generalize a lower dimensional manifold from a high dimensional space. If it's not lossy, it won't generalize.
You're right that it's really just a form of dimensionality reduction. My point was just that it's a more powerful form of dimensionality reduction than PCA or NMDS.
[Edit: and that the salient characteristics are likely contained in the model.]
Precisely because it's more powerful, it doesn't encode the identifying information of the original data. Something like PCA likely would retain identifying characteristics (depending on how many low-rank vectors you drop).
Outside of the fact that they have identities for all of the people whose data they acquired, yes, it would be harder to reconstruct individual people with it than PCA because of the direct interpretability of its data.
They claim to have deleted that data. If they haven't deleted the data, then of course it's still an invasion of privacy. But the ML model really has nothing to do with it.
I think the ML model has a lot to do with it in this case.
One of the arguments I expect to see is that "Oh, no! We removed all the data. It's gone. I mean, that was only a few hundred megabytes per person anyway, but we just calculate a few thousand numbers from it and save in our system, then delete the data. That's less data per person than is needed to show a short cute cat GIF. What harm could we possibly do with that?"
My point isn't that there is no harm here in them storing this model. It's also not that the data in their model is worthless. It's specifically that the way this article is talking about the issue is incorrect. The analogy they use would lead you to draw false conclusions about what's going on, and how to understand it.
There is a real issue here of whether or not they should be allowed to keep a model trained from ill-gotten data. But the way I would think about it is: If you steal a million dollars and invest it in the stock market, and make a 10% return, what happens to that 10% return if you then return the original million? That's a much better analogy for what's going on here. They stole an asset, and made something from it, and it's unclear who owns that thing or what to do with it.
The ML model might know more about me than I’m willing to admit about myself. I only find some — but not much — comfort in the proposition that it can’t conjure my PII.
Regardless, I still think having the most relevant features already extracted is all they need to ask many of the questions they might want to. The point is that that’s still quite bad.
I don't think you understand what dimensionality reduction means.
SSN is a lookup key into the raw data. Dimensionality reduction is by definition lossy since it's used in scenarios where:
rows of data = n <<< m = number of features
It's dimensionality reduction. You cannot recover the original object. It's like using a shadow to reconstruct the face of the person casting the shadow.
Note this has nothing to do with the expressive power of a deep neural network. You are by definition trying to throw away noisy aspects of the data and generalize a lower dimensional manifold from a high dimensional space. If it's not lossy, it won't generalize.