The Anonymization Fallacy

I was asked by the local medical school to give an afternoon session on patient data security. The idea was to tell them how to properly anonymize their data so that the relevant patient data security laws can be followed.

I was planning on talking about the laws and then going through selection, generalization, perturbation, k-anonymity, etc. Maybe throw in a bit of cryptographic magic dust. Then I found a paper that shocked me.

There is no such thing as anonymous patient data.

The so-called patient identifying information, obvious stuff like name and address, are not the only things useful to identify people. You can be identified on the basis of birthdate, sex, and zip code alone. You can be identified on the basis of the search terms you typed into a search machine. You can be identified on the basis of the movies you reviewed.

Forget even things like semi-anonymous blogs.

I rooted around and found four disturbing papers:

  • Latanya Sweeney, Uniqueness of Simple Demographics in the U.S. Population, Laboratory for International Data Privacy Working Paper, LIDAP-WP4 (2000)

    She analyzed US census data for 1990 and asserts that 87% of U.S. citizens are identifiable (1-anonymous) with just birthdate, sex, and zip code. She also irritated the hell out of Massachusetts Governor William Weld, who released "anonymized" health data on state employees by handily selecting his complete medical history out of the data, using another set of data she purchased for 20$.

    The working paper does not seem to be available online, but a preprint of her dissertation with the results is findable.

  • Arvind Narayanan and Vitaly Shmatikov, De-anonymizing Social Networks, IEEE Security & Privacy '09.

    Narayanan (who authors the blog 33bits on privacy questions) and Shmatikov took the Netflix challenge in a bit different way than intended and compared this massive graph with unknown people rating known films to the IMDB database with known people rating known films. It turns out that even though subgraph isomorphism is NP-Complete (really hard to calculate, for non-theoretical computer scientists), if you can identify certain nodes (in this case the film names on the nodes in one half of the bipartite graph), you can quickly find overlap. Unique overlap.

  • Paul Ohm, “Broken promises of privacy: responding to the surprising Failure of Anonymization”, Preprint, University of Colorado Law Legal Studies Research Paper No. 2009-12

    This lawyer has given the whole privacy question such a thorough shake down, that it is left standing nakeder than under a full-body scanner. There is no such thing as privacy any more. He insists that the ancient, creaking legal system get its act together and deal with technology. I'm not holding my breath, but shaken to the core. This paper is an excellent read, copiously footnoted.
  • Philippe Golle, Revisiting the uniqueness of simple demographics in the US population. In Proceedings of the 5th ACM Workshop on Privacy in Electronic Society (Alexandria, Virginia, USA, October 30 - 30, 2006). WPES '06. ACM, New York, NY, 77-80. DOI= http://doi.acm.org/10.1145/1179601.1179615 (in the ACM Digital Library)

    Golle tries to revalidate Sweeney's results. He "only" has 63% instead of 87% for the 1990 census, but gives his methods and tests both the 1990 and 2000 census data. A fascinating - and scary - read.
So I decided to give the students a different lecture and let them understand what privacy would mean and why it is a problem.

Right at the start some computer science types questioned my premise that there is no anonymity. They have had lectures on this before. I discussed these three papers, and then we did an experiment.

I passed out papers and had them put down their age, sex, country of birth, bachelor's degree program and city they graduated in. I didn't trust my own senses, but it turned out I could have. The last two were completely unnecessary in such a small group. They were also to make up a horrible disease they had.

I collected the papers and had one student choose one at random. Then I had everyone stand up. A 27-year-old female from Germany was chosen. Even though there were 8 students from Germany in this population of 23 students, when I asked for all the non-27-year-olds to sit down, only 5 were left standing. And there was only one woman in this group. I had the men sit down and asked her if she had written the deadly disease X on her paper. She had, and was rather shaken.

If you can isolate equivalence classes and then rule out some of them, you very quickly can narrow down even an extremely large population.

We had a break, and after the break did some case studies in medical ethics. They had a good time with that, two students from different countries getting into a wonderful row about private companies obtaining data to deny people health insurance.

After class many came up to speak to me. One wanted assurance that TOR would keep him anonymous when surfing. Sorry, I said. Did you know that the versions of the plugins you use for Firefox, the installed fonts and your time zone pretty much identify you? The EFF has a site, Panopticlick, that will help you see this. And even the history links in your browser are readable and can identify you.

Of course, sometimes they are wrong. According to this site, there is a
Likelihood of you being FEMALE is 9%
Likelihood of you being MALE is 91%
I guess I surf like a guy.

The EFF has a few tips on staying anonymous. They rather boil down to only surfing with an iPhone that is not registered in your name.

I suppose we will have to realize that the whole world is watching what we do online. Literally.

1 comment:

Anonymous said...

scary, indeed!!