:: Jim McGaw's Blog

About five years ago, Netflix created a contest that would award $1,000,000 to any person or group of people who could improve their movie recommendation algorithm by 10%. To aid contestants, they released an enormous sample dataset of users, movie titles, and ratings. Each username was replaced with a numeric code. That way, users could be compared to one another, and the general rating habits of a single user could be analyzed, without divulging potentially identifiable information about the user.

The problem is that this didn't really work. Shortly after the data was revealed, a couple of computer scientists took the large Netflix dataset and compared it with other publicly available data sources, one of which was IMDB. By comparing the ratings of a user in the Netflix data with reviews and ratings posted in other places, they were able to de-anonymize the Netflix data and put actual names to the numeric codes.

Oops.

They didn't do this maliciously, but they were interested in demonstrating a point: there's enough data out there that it's possible to reverse engineer someone's identity from information, even if that information has been completely stripped of personal information. It's almost asymptotic...as the amount of data stored in multiple locations about any one Internet user approaches infinity, the potential for actual privacy approaches zero.

Maybe it doesn't matter for movie ratings, but there's an awful lot of data being compiled by hospitals and medical clinics around the country right now. There's not an IMDB for medical records, but it's not a stretch to say that such a dataset, if released to the public stripped of patient information, couldn't be de-anonymized as well. Insurance providers would certainly have an incentive to do this.

The two computer scientists who matched up the Netflix users are working on an algorithm that will actually allow large datasets to be released in a way that prevents any reverse-engineering-of-identity hullabaloo. I think their research is very important; in the coming years, releasing datasets to the general public to solve problems is likely to become not only common but necessary. To do it in a way that protects privacy at the same time can only be of great benefit to us all.