The value of adding independent datasets

March 28th, 2008

I just discovered Datawocky, a blog by Anand Rajaraman, cofounder of Kosmix. Anand has an interesting post up on some attempts at the Netflix prize made by some of his students at Stanford. The most successful strategy, at least among this group, was to make use of a second, independent dataset: IMDB.

But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I’m often suprised that many people in the business, and even in academia, don’t realize this.

One Response to “The value of adding independent datasets”

  1. Ben Says:

    Have you read Schneier’s article on this?

    http://www.schneier.com/blog/archives/2007/12/anonymity_and_t_2.html

    “Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin, de-anonymized some of the Netflix data by comparing rankings and timestamps with public information in the Internet Movie Database, or IMDb.

    Their research (.pdf) illustrates some inherent security problems with anonymous data, but first it’s important to explain what they did and did not do.

    They did not reverse the anonymity of the entire Netflix dataset. What they did was reverse the anonymity of the Netflix dataset for those sampled users who also entered some movie rankings, under their own names, in the IMDb. (While IMDb’s records are public, crawling the site to get them is against the IMDb’s terms of service, so the researchers used a representative few to prove their algorithm.)

    The point of the research was to demonstrate how little information is required to de-anonymize information in the Netflix dataset.”

Leave a Reply