Paul Ohm's blog

Netflix's Impending (But Still Avoidable) Multi-Million Dollar Privacy Blunder

In my last post, I had promised to say more about my article on the limits of anonymization and the power of reidentification. Although I haven't said anything for a few weeks, others have, and I especially appreciate posts by Susannah Fox, Seth Schoen, and Nate Anderson. Not only have these people summarized my article well, they have also added a lot of insightful commentary, and I commend these three posts to you.

Today brings news relating to one of the central examples in my paper: Netflix has announced plans to commit a privacy blunder that could cost it millions of dollars in fines and civil damages.

In my article, I focus on Netflix's 2006 decision to release millions of records containing the movie rating preferences of "anonymized" users to the public, in order to fuel a crowd-sourcing competition called the Netflix Prize. The Netflix Prize has been a huge win for Netflix's public relations, but it has also been a win for academics, who have used the data to improve the science of guessing human behavior from past preferences.

The Netflix Prize was also a watershed event for reidentification research because Arvind Narayanan and Vitaly Shmatikov of U. Texas revealed that they could reidentify some of the "anonymized" users with ease, proving that we are more uniquely tied to our movie rating preferences than intuition would suggest. In my paper, I argue that we should worry about this privacy breach even if we don't think movie ratings are terribly sensitive, because it can be used to enable other, more terrifying privacy breaches.

I never argue, however, that Netflix deserves punishment or sanction for having released this data. In my opinion, Netflix acted pretty responsibly. It consulted with computer scientists in a (failed) attempt to anonymize successfully. It tried perturbing the data in order to make reidentification harder. And other experts seem to have been surprised by how easy it was for Narayanan and Shmatikov to reidentify. Even with the benefit of hindsight, I find nothing to blame in how Netflix handled the privacy implications of what it did.

Although I give Netflix a pass for its past privacy breach, I am astonished to learn from the New York Times that the company plans a second act:

The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.

Netflix should cancel this new, irresponsible contest, which it has dubbed Netflix Prize 2. Researchers have known for more than a decade that gender plus ZIP code plus birthdate uniquely identifies a significant percentage of Americans (87% according to Latanya Sweeney's famous study.) True, Netflix plans to release age not birthdate, but simple arithmetic shows that for many people in the country, gender plus ZIP code plus age will narrow their private movie preferences down to at most a few hundred people. Netflix needs to understand the concept of "information entropy": even if it is not revealing information tied to a single person, it is revealing information tied to so few that we should consider this a privacy breach.

I have no doubt that researchers will be able to use the techniques of Narayanan and Shmatikov, together with databases revealing sex, zip code, and age, to tie many people directly to these supposedly anonymized new records.

Because of this, if it releases the data, Netflix might be breaking the law. The Video Privacy Protection Act (VPPA), 18 USC 2710 prohibits a "video tape service provider" (a broadly defined term) from revealing "personally identifiable information" about its customers. Aggrieved customers can sue providers under the VPPA and courts can order "not less than $2500" in damages for each violation. If somebody brings a class action lawsuit under this statute, Netflix might face millions of dollars in damages.

Additionally, the FTC might also decide to fine Netflix for violating its privacy policy as an unfair business practice.

Either a lawsuit under the VPPA or an FTC investigation would turn, in large part, on one sentence in Netflix's privacy policy: "We may also disclose and otherwise use, on an anonymous basis, movie ratings, consumption habits, commentary, reviews and other non-personal information about customers." If sued or investigated, Netflix will surely argue that its acts are immunized by the policy, because the data is disclosed "on an anonymous basis." While this argument might have carried the day in 2006, before Narayanan and Shmatikov conducted their study, the argument is much weaker in 2009, now that Netflix has many reasons to know better, including in part, my paper and the publicity surrounding it. A weak argument is made even weaker if Netflix includes the kind of data--ZIP code, age, and gender--that we have known for over a decade fails to anonymize.

The good news is Netflix has time to avoid this multi-million dollar privacy blunder. As far as I can tell, the Netflix Prize 2 has not yet been launched.

Dear Netflix executives: Don't do this to your customers, and don't do this to your shareholders. Cancel the Netflix Prize 2, while you still have the chance.

Tagged:  

Anonymization FAIL! Privacy Law FAIL!

I have uploaded my latest draft article entitled, Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization to SSRN (look carefully for the download button, just above the title; it's a little buried). According to my abstract:

Computer scientists have recently undermined our faith in the privacy-protecting power of anonymization, the name for techniques for protecting the privacy of individuals in large databases by deleting information like names and social security numbers. These scientists have demonstrated they can often “reidentify” or “deanonymize” individuals hidden in anonymized data with astonishing ease. By understanding this research, we will realize we have made a mistake, labored beneath a fundamental misunderstanding, which has assured us much less privacy than we have assumed. This mistake pervades nearly every information privacy law, regulation, and debate, yet regulators and legal scholars have paid it scant attention. We must respond to the surprising failure of anonymization, and this Article provides the tools to do so.

I have labored over this article for a long time, and I am very happy to finally share it publicly. Over the next week, or so, I will write a few blog posts here, summarizing the article's high points and perhaps expanding on what I couldn't get to in a mere 28,000 words.

Thanks to Ed, David, and everybody else at Princeton's CITP for helping me develop this article during my visit earlier this year.

Please let me know what you think, either in these comments or by direct email.

Did the Sanford E-Mail Tipster or the Newspaper Break the Law?

Part of me doesn't want to comment on the Mark Sanford news, because it's all so tawdry and inconsistent with the respectable, family-friendly tone of Freedom to Tinker. But since everybody from the Gray Lady on down is plastering the web with stories, and because all of this reporting is leaving unanalyzed some Internet law questions, let me offer this:

On Wednesday, after Sanford's confessional press conference, The State, the largest newspaper in South Carolina, posted email messages appearing to be love letters between the Governor and his mistress. (The paper obscured the name of the mistress, calling her only "Maria.") The paper explained in a related news story that they had received these messages from an anonymous tipster back in December, but until yesterday's unexpected corroboration of their likely authenticity, they had just sat on them.

Did the anonymous tipster break the law by obtaining or disclosing the email messages? Did the paper break the law by publishing them? After the jump, I'll offer my take on these questions.

Tagged:  

FBI's Spyware Program

Note: I worked for the Department of Justice's Computer Crime and Intellectual Property Section (CCIPS) from 2001 to 2005. The documents discussed below mention a memo written by somebody at CCIPS during the time I worked there, but absolutely everything I say below reflects only my personal thoughts and impressions about the documents released to the public today.

Two years ago, Kevin Poulsen broke the news that the FBI had successfully deployed spyware to help catch a student sending death threats to his high school. The FBI calls the tool a CIPAV for "computer and internet protocol address verifier."

We learned today that Kevin filed a Freedom of Information Act request (along with EFF and CNet News) asking for other information about CIPAVs. The FBI has responded, Kevin made the 152 pages available, and I just spent the past half hour skimming them.

Here are some unorganized impressions:

  • The 152 pages don't take long to read, because they have been so heavily redacted. The vast majority of the pages have no substantive content at all.
  • Page one may be the most interesting page. Someone at CCIPS, my old unit, cautions that "While the technique is of indisputable value in certain kinds of cases, we are seeing indications that it is being used needlessly by some agencies, unnecessarily raising difficult legal questions (and a risk of suppression) without any countervailing benefit,"
  • On page 152, the FBI's Cryptographic and Electronic Analysis Unit (CEAU) "advised Pittsburgh that they could assist with a wireless hack to obtain a file tree, but not the hard drive content." This is fascinating on several levels. First, what wireless hack? The spyware techniques described in Poulsen's reporting are deployed when a target is unlocatable, and the FBI tricks him or her into clicking a link. How does wireless enter the picture? Don't you need to be physically proximate to your target to hack them wirelessly? Second, why could CEAU "assist . . . to obtain a file tree, but not the hard drive content." That smells like a legal constraint, not a technical one. Maybe some lawyer was making distinctions based on probable cause?
  • On page 86, the page summarizing the FBI's Special Technologies and Applications Office (STAO) response to the FOIA request, STAO responds that they have included an "electronic copy of 'Magic Quadrant for Information Access Technology'" on cd-rom. Is that referring to this Gartner publication, and if so, what does this have to do with the FOIA request? I'm hoping one of the uber geeks reading this blog can tie FBI spyware to this phrase.
  • Pages 64-80 contain the affidavit written to justify the use of the CIPAV in the high school threat case. I had seen these back when Kevin first wrote about them, but if you haven't seen them yet, you should read them.
  • It definitely appears that the FBI is obtaining search warrants before installing CIPAVs. Although this is probably enough to justify grabbing IP addresses and information packed in a Windows registry, it probably is not enough alone to justify tracing IP addresses in real time. The FBI probably needs a pen register/trap and trace order in addition to the warrant to do that under 18 U.S.C. 3123. Although pen registers are mentioned a few times in these documents--particularly in the affidavit mentioned above--many of the documents simply say "warrant." This is probably not of great consequence, because if FBI has probable cause to deploy one of these, they can almost certainly justify a pen register order, but why are they being so sloppy?

Two final notes: First, I twittered my present sense impressions while reading the documents, which was an interesting experiment for me, if not for those following me. If you want to follow me, visit my profile.

Second, if you see anything else in the documents that bear scrutiny, please leave them in the comments of this post.

Tagged:  

Fascinating New Blog: ComputationalLegalStudies.com

I was inspired to post the essay I discussed in the prior post by the debut of the best new law blog I have seen in a long time, Computational Legal Studies, featuring the work of Daniel Katz and Michael Bommarito, both graduate students in the University of Michigan's political science department.

Every single blog they have posted has caused me to smack my head once for not having thought of the idea first, and a second time for not having their datasets and skillz. Their visualization of who has gotten TARP funds and how they're connected to legislators deserves to be printed on posters and hung up in newsrooms across the country (not to mention in offices on Capitol Hill). They've also shown good taste by building a bridge to this blog, linking favorably back to the great CITP work led by David Robinson on government openness.

I will have more to say about Dan and Mike's new blog in the weeks and months to come, but for now it is enough to welcome them to the blogosphere.

Computer Programming and the Law: A New Research Agenda

By my best estimate, at least twenty different law professors on the tenure track at American law schools once held a job as a professional computer programmer. I am proud to say that two of us work at my law school.

Most of these hyphenate lawprof-coders rarely write any code today, and this is a shame. There are many good reasons why the world would be a better place if we began to integrate computer programming into legal scholarship (and more generally, into law and policy).

Two years ago, I wrote a blog post for a lawprof blog exploring this idea. I promised a follow-up post, but never delivered. A year later, I expanded the idea into an essay, which the good people at the Villanova Law Review agreed to publish sometime later this year. With this post, I am releasing a slightly-outdated draft of the essay for the first time to the public. You can download it at SSRN.

In the abstract, I say:

This essay proposes a new interdisciplinary research agenda called Computer Programming and the Law. By harnessing the power of computer programming, legal scholars can develop better tools, data, and insights for advancing their research interests. This essay presents the case for this new research agenda, highlights some examples of those who have begun to blaze the trail, and includes code samples to demonstrate the power and potential of developing software for legal scholarship. The code samples in this essay can be run like a piece of software—thanks to a technique known as literate programming—making this the world’s first law review article that is also a working computer program.

If you have any interest in the intersection of technology and policy (in other words, if you read this blog), please read the essay and let me know what you think. Unlike many law review articles, this one is short. And how bad could it be? It contains 350 lines of perl! (Wait, don't answer that!)

Tagged:  

Being Acquitted Versus Being Searched (YANAL)

With this post, I'm launching a new, (very) occasional series I'm calling YANAL, for "You Are Not A Lawyer." In this series, I will try to disabuse computer scientists and other technically minded people of some commonly held misconceptions about the law (and the legal system).

I start with something from criminal law. As you probably already know, in the American criminal law system, as in most others, a jury must find a defendant guilty "beyond a reasonable doubt" to convict. "Beyond a reasonable doubt" is a famously high standard, and many guilty people are free today only because the evidence against them does not meet this standard.

When techies think about criminal law, and in particular crimes committed online, they tend to fixate on this legal standard, dreaming up ways people can use technology to inject doubt into the evidence to avoid being convicted. I can't count how many conversations I have had with techies about things like the "open wireless access point defense," the "trojaned computer defense," the "NAT-ted firewall defense," and the "dynamic IP address defense." Many people have talked excitedly to me about tools like TrackMeNot or more exotic methods which promise, at least in part, to inject jail-springing reasonable doubt onto a hard drive or into a network.

People who place stock in these theories and tools are neglecting an important drawback. There are another set of legal standards--the legal standards governing search and seizure--you should worry about long before you ever get to "beyond a reasonable doubt". Omitting a lot of detail, the police, even without going to a judge first, can obtain your name, address, and credit card number from your ISP if they can show the information is relevant to a criminal investigation. They can obtain transaction logs (think apache or sendmail logs) after convincing a judge the evidence is "relevant and material to an ongoing criminal investigation." If they have probable cause--another famous, but often misunderstood standard--they can read all of your stored email, rifle through your bedroom dresser drawers, and image your hard drive. If they jump through a few other hoops, they can wiretap your telephone. Some of these standards aren't easy to meet, but all of them are well below the "beyond a reasonable doubt" standard for guilt.

So by the time you've had your Perry Mason moment in front of the jurors, somehow convincing them that the fact that you don't enable WiFi authentication means your neighbor could've sent the death threat, your life will have been turned upside down in many ways: The police will have searched your home and seized all of your computers. They will have examined all of the files on your hard drives and read all of the messages in your inboxes. (And if you have a shred of kiddie porn stored anywhere, the alleged death threat will be the least of your worries. I know, I know, the virus on your computer raises doubt that the kiddie porn is yours!) They will have arrested you and possibly incarcerated you pending trial. Guys with guns will have interviewed you and many of your friends, co-workers, and neighbors.

In addition, you will have been assigned an overworked public defender who has no time for far-fetched technological defenses and prefers you take a plea bargain, or you will have paid thousands of dollars to a private attorney who knows less than the public defender about technology, but who is "excited to learn" on your dime. Maybe, maybe, maybe after all of this, your lawyer convinces the judge or the jury. You're free! Congratulations?

The police and prosecutors run into many legal standards, many of which are much easier to satisfy than "beyond a reasonable doubt" and most of which are met long before they see an access point or notice a virus infection. By meeting any of these standards, they can seriously disrupt your life, even if they never end up putting you away.

Tagged:  

What Your Mailman Knows (Part 1 of 2)

A few days ago, National Public Radio (NPR) tried to offer some lighter fare to break up the death march of gloomier stories about economic calamity. You can listen to the story online. The story's reporter, Chana Joffe-Walt, followed a mail carrier named Andrea on her route around the streets of Seattle. The premise of the story is that Andrea can measure economic suffering along her mail route--and therefore in that mythical place, "Main Street"--by keeping tabs on the type of mail she delivered. I have two technology policy thoughts about this story, but because I have a lot to say, I will break this into two posts. In this post, I will share some general thoughts about privacy, and in the next post, I will tie this story to NebuAd and Phorm.

I was troubled by Andrea's and Joffe-Walt's cavalier approaches to privacy. In the course of the five minute story, Andrea reveals a lot of private, personal information about the people on her route. Only once does Joffe-Walt even hint at the creepiness of peering into people's private lives in this way, embracing a form of McNealy's "you have no privacy, get over it" declaration. In the first line of the story, Joffe-Walt says, "Okay before we can do this, I need to clear up one question: Yes, your mailman reads your postcards; she notices what magazines you get, which catalogs; she knows everything about you." The last line of the story is simply, "The government is just starting on its $700 billion plan. As it moves forward, Wall Street economists will be watching Wall Street; Fed economists will be watching Wall Street; Andrea will be watching the mail."

There are many privacy lessons I can draw from this: First, did the Postal Service approve Andrea's participation in the interview? If it did, did it weigh the privacy impact? If not, why not?

More broadly speaking, I bet all of the people who produced or authorized this story, from Andrea and Joffe-Walt to the Postal Service and NPR, if they thought about privacy at all, engaged in a cost-benefits balancing, and they evidently made the same types of mistakes on both sides of that balancing that people often make when they think about privacy.

First, what are the costs to privacy from this story? At first blush, they seem to be slight to non-existent because the reporter anonymized the data. Although most of the activity in the story appears to center on one city block in Seattle, we aren't told which city block. This is a lot like AOL arguing that it had anonymized its search queries by replacing IP addresses with unique identifiers or like Phorm arguing that it protects privacy by forgetting that you visited Orbitz.com and remembering instead only that you visited a travel-related website.

The NPR story exposes the flaw in this type of argument. Although a casual listener won't be able to place the street toured by Andrea, it probably wouldn't be very hard to pierce this cloak of privacy. In the story, we are told that the street is "three-quarters of a mile [north] of" Main Street. The particular block is "a wide residential block where section 8 housing butts against glassy, snazzy new chic condos that cost half-a-million dollars." Across the block are a couple businesses including a cafe "across the way." Does this describe more than a few possible locations in Seattle? [Insert joke about the number of cafes in Seattle here.]

It's probably even easier for someone who lives in Seattle to pinpoint the location, particularly if it is near where they live or work. For these people, thanks to NPR, they now know that in the Section 8 building lives "a single mom with an affinity for black leather is getting an overdraft notice" and a "minister . . . getting more late payment bills." The owner of the cafe has been outed as somebody who pays his bills only by applying for new credit cards. If you lived or worked on this particular block, wouldn't you have at least a hunch about the identities of the people tied to these potentially embarrassing facts?

Laboring under the mistaken belief that anonymization negated any costs to privacy, the creators of the story probably thought the costs were outweighed by the potential benefits. But these benefits seem to pale in comparison to the privacy risks, accurately understood. What does the listener gain by listening to this story? A small bit of anecdotal knowledge about the economic crisis? A reason to fear his mailman? The small thrill of voyeurism? A chance to think about the economic crisis while not seized by fear and dread? I'm not saying that these benefits are valueless, but I don't think they were justified when held against the costs.

Tagged:  

Opting In (or Out) is Hard to Do

Thanks to Ed and his fellow bloggers for welcoming me to the blog. I'm thrilled to have this opportunity, because as a law professor who writes about software as a regulator of behavior (most often through the substantive lenses of information privacy, computer crime, and criminal procedure), I often need to vet my theories and test my technical understanding with computer scientists and other techies, and this will be a great place to do it.

This past summer, I wrote an article (available for download online) about ISP surveillance, arguing that recent moves by NebuAd/Charter, Phorm, AT&T, and Comcast augur a coming wave of unprecedented, invasive deep-packet inspection. I won't reargue the entire paper here (the thesis is no doubt much less surprising to the average Freedom to Tinker reader than to the average lawyer) but you can read two bloggy summaries I wrote here and here or listen to a summary I gave in a radio interview. (For summaries by others, see [1] [2] [3] [4]).

Two weeks ago, Verizon and AT&T told Congress that they would monitor for marketing purposes only users who had opted in. According to Verizon VP Tom Tauke, "[B]efore a company captures certain Internet-usage data for targeted or customized advertising purposes, it should obtain meaningful, affirmative consent from consumers."

I applaud this announcement, but I'm curious how the ISPs will implement this promise. It seems like there are two architectural puzzles here: how does the user convey consent, and how does the provider distinguish between the packets of consenting and nonconsenting users? For an ISP, neither step is nearly as straightforward as it is for a web provider like Google, which can simply set and check cookies. For the first piece, I suppose a user can click a check box on a web-based form or respond to an e-mail, letting the ISP know he would like to opt in. These solutions seem clumsy, however, and ISPs probably want a system that is as seamless and easy to use as possible, to maximize the number of people opting in.

Once ISPs have a "white list" of users who have opted in, how do they turn this into on-the-fly discretionary packet sniffing? Do they map white-listed users to IP addresses and add these to a filter, or is there a risk that things will get out of sync during dhcp lease renewals? Can they use cookies, perhaps redirecting every http session to an ISP-run web server first using 301 http status codes? (This seems to be the way Phorm implements opt-out, according to Richard Clayton's illuminating analysis.) Do any of these solutions scale for an ISP with hundreds of thousands of users?

And are things any easier if the ISP adopts an opt-out system instead?

Syndicate content