Anecdotally, the only time I've seen a truly anonymized database was in a european genetics research company, due mainly to the rightly high amount of regulation required in the medical field.
There was a whole separate legal entity, with its own board, that did the phenotype measurement gathering and stored the data in a big database on premise. The link between those measurements and the individual's personal identifiable record was then stored in a separate airgapped database which had cryptographic locks implemented (on the data and physical access to the server) so accessing the data took the physical presence of the privacy officers of each of the two companies (the measurement lab and the research lab) and finally what I found at the time to be the unique move; a representative from the state run privacy watchdog.
To be able to backtrack the data to a person, there was always going to be a need to go through the watchdog. Required, not just legally mandated.
All of the measurement data that was stored in the database came from very restricted input fields in the custom software that was made on prem (no long form text input fields for instance, where identifying data could be put in accidentally), and there was a lot of thought put into the design of the UI to limit the possibility that anyone could put identifiable data into the record.
For instance numerical ranges for a specific phenotype where all prefilled in a dropdown, so as to keep user key input to a minimum. Much of the data also came from direct connections to the medical equipment (I wrote a serial connector for a Humphrey medical eye scanner that parsed the results straight into the software, skipping the human element all together).
This didn't make for the nicest looking software (endless dropdowns or scales), but it fulfilled its goal of functionality and privacy concerns perfectly.
The measurement data would then go through many automatic filters and further anonymizing processes before being delivered through a dedicated network pipeline (configured through the local isp to be unidirectional) to the research lab.
Is this guaranteed to never leak any private information? No, nothing is 100%. This comes damn near close to it, but ofcourse would not work in most other normal business situations.
Yes they had to, in case the person giving the data had opted for being notified about some severe medical condition or other revelations that would show up during the analysis process. For those cases, these mappings were kept around, and did require going through the watchdog.
I did understand the point perfectly. The mechanism was there simply for the good actors to backtrace the data to the matching person, it's purpose was never to play a part in making the data more anonymous.
If what you meant to say was the more clear statement that adversaries wouldn't need to do that then I'd have agreed with you.
Everything else that was mentioned; the strict processes in determining what data could be stored, what it exposed about the user if anything, eliminating as much of the human input as possible and the post processing of the data before it left the measurement lab. These are the steps that achieved anonymity (as far as everyone believed had been achieved).
There was a whole separate legal entity, with its own board, that did the phenotype measurement gathering and stored the data in a big database on premise. The link between those measurements and the individual's personal identifiable record was then stored in a separate airgapped database which had cryptographic locks implemented (on the data and physical access to the server) so accessing the data took the physical presence of the privacy officers of each of the two companies (the measurement lab and the research lab) and finally what I found at the time to be the unique move; a representative from the state run privacy watchdog.
To be able to backtrack the data to a person, there was always going to be a need to go through the watchdog. Required, not just legally mandated.
All of the measurement data that was stored in the database came from very restricted input fields in the custom software that was made on prem (no long form text input fields for instance, where identifying data could be put in accidentally), and there was a lot of thought put into the design of the UI to limit the possibility that anyone could put identifiable data into the record.
For instance numerical ranges for a specific phenotype where all prefilled in a dropdown, so as to keep user key input to a minimum. Much of the data also came from direct connections to the medical equipment (I wrote a serial connector for a Humphrey medical eye scanner that parsed the results straight into the software, skipping the human element all together).
This didn't make for the nicest looking software (endless dropdowns or scales), but it fulfilled its goal of functionality and privacy concerns perfectly.
The measurement data would then go through many automatic filters and further anonymizing processes before being delivered through a dedicated network pipeline (configured through the local isp to be unidirectional) to the research lab.
Is this guaranteed to never leak any private information? No, nothing is 100%. This comes damn near close to it, but ofcourse would not work in most other normal business situations.