Though the Aaron Swartz tragedy has brought some much needed attention to the CFAA, I want to focus on a more recent CFAA event — one that has received much less attention but might actually touch many more people than the case against Swartz.
Andrew “Weev” Auernheimer (whom I will call AA for short) was recently convicted under the CFAA and sentenced to 41 months and $73K restitution. Orin Kerr is representing him before the Third Circuit. I am seriously considering filing an amicus brief on behalf of all academics. In short, this case scares me in a much more personal way than prior discussed in my prior CFAA posts. More after the jump.
Here’s the basic story, as described by Orin Kerr:
When iPads were first released, iPad owners could sign up for Internet access using AT&T. When they signed up, they gave AT&T their e-mail addresses. AT&T decided to configure their webservers to “pre load” those e-mail addresses when it recognized the registered iPads that visited its website. When an iPad owner would visit the AT&T website, the browser would automatically visit a specific URL associated with its own ID number; when that URL was visited, the webserver would open a pop-up window that was preloaded with the e-mail address associated with that iPad. The basic idea was to make it easier for users to log in to AT&T’s website: The user’s e-mail address would automatically appear in the pop-up window, so users only needed to enter in their passwords to access their account. But this practice effectively published the e-mail addresses on the web. You just needed to visit the right publicly-available URL to see a particular user’s e-mail address. Spitler [AA’s alleged co-conspirator] realized this, and he wrote a script to visit AT&T’s website with the different URLs and thereby collect lots of different e-mail addresses of iPad owners. And they ended up collecting a lot of e-mail addresses — around 114,000 different addresses — that they then disclosed to a reporter. Importantly, however, only e-mail addresses were obtained. No names or passwords were obtained, and no accounts were actually accessed.
Let me paraphrase this: AA went to a publicly accessible website, using publicly accessible URLs, and saved the results that AT&T sent back in response to that URL. In other words, AA did what you do every time you load up a web page. The only difference is that AA did it for multiple URLs, using sequential guesses at what those URLs would be. There was no robot.txt file that I’m aware of (this file tells search engines which URLs should not be searched by spiders). There was no user notice or agreement that barred use of the web page in this manner. Note that I’m not saying such things should make the conduct illegal, but only that such things didn’t even exist here. It was just two people loading data from a website. Note that a commenter on my prior post asked this exact same question–whether “link guessing” was illegal–and I was noncommital. I guess now we have our answer.
The government’s indictment makes the activity sound far more nefarious, of course. It claims that AA “impersonated” an iPad. This allegation is a bit odd: the script impersonated an iPad in the same way that you might impersonate a cell phone by loading http://m.facebook.com to load the mobile version of Facebook. Go ahead, try it and you’ll see — Facebook will think you are a cell phone. Should you go to jail?
So, readers might say, what’s the problem here? AA should not have done what he did — he should have known that AT&T did not want him downloading those emails. Yeah, he probably did know that. But consider this: AA did not share the information with the world, as he could have. I am reasonably certain that if his intent was to harm users, we would never know that he did this — he would have obtained the addresses over an encrypted VPN and absconded. Instead, AA shared this flaw with the world. AT&T set up this ridiculously insecure system that allowed random web users to tie Apple IDs to email addresses through ignorance at best or hubris at worst. I don’t know if AA attempted to inform AT&T of the issue, but consider how far you got last time you contacted tech support with a problem on an ISP website. AA got AT&T’s attention, and the problem got fixed with no (known) divulgence of the records.
Before I get to academia, let me add one more point. To the extent that AA should have known AT&T didn’t desire this particular access, the issue is one of degree not of kind. And that is the real problem with the statute. There is nothing in the statute, absolutely nothing, that would help AA know whether he violated the law by testing this URL with one, five, ten, or ten thousand IDs. Here’s one to try: click here for a link to a concert web page deep link using a URL with a numerical code. Surely Ticketmaster can’t object to such deep linking, right? Well, it did, and sued Tickets.com over such behavior. It claimed, among other things, that each and every URL was copyrighted and thus infringed if linked to by another. It lost that argument, but today it could just say that such access was unwanted. For example, maybe Tickemaster doesn’t like me pointing out its ridiculous argument in the tickets.com case, making my link unauthorized. Or maybe I should have known because the Ticketmaster terms of service says that an express condition of my authorization to view the site is that I will not “Link to any portion of the Site other than the URL assigned to the home page of our site.” That’s right, TicketMaster still thinks deep linking is unauthorized, and I suppose that means I risk criminal prosecution for linking it. Imagine if I actually saved some of the data!
This is where academics come in. Many, many academics scrape. (Don’t stop reading here –“ I’ll get to non-scrapers below.) First, scraping is a key way to get data from online databases that are not easily downloadable. This includes, for example, scraping of the US Patent & Trademark Office site; although data is now available for mass download, that data is cumbersome, and scraper use is still common. That the PTO is public data does not help matters. In fact, it might make it worse, since “unauthorized” access to government servers might receive enhanced penalties!
Academics (and non-academics) in other disciplines scrape websites for research as well. How are these academics to know that such scraping is disallowed? What if there is no agreement barring them from doing so? What if there is a web-wrap notice as broad as Ticketmaster’s, purporting to bar such activities but with no consent by the user? The CFAA could send any academic to jail for ignoring such warnings –or worse — not seeing them in the first place. Such a prosecution would be preposterous, skeptics might say. I hope the skeptics are right, but I’m not hopeful. Though I can’t find the original source, I recall Orin Kerr recounting how his prosecutor colleagues said the same thing 10 years ago when he argued the CFAA might apply to those who breach contracts, and now such prosecutions are commonplace.
Finally, non-scrapers are surely safe, right? Maybe it depends on if they use Zotero. Thousands of people use it. How does Zotero get information about publications when the web site does not provide standardized citation data? You guessed it: a scraper. Indeed, a primary reason I don’t use Zotero is that the Lexis and Westlaw scrapers don’t work. But the PubMed importer scrapes. What if PubMed decide that it considered scraping of information unauthorized? Surely people should know this, right? If it wanted people to have this data, they would provide it in Zotero readable format. The fact that the information on those pages is publicly available is irrelevant; the statute makes no distinction. And if one does a lot of research, for example, checking 20 documents, downloading each, and scraping each page, the difference from AA is in degree only, not in kind.
The irony of this case is that the core conviction is only tangentially a problem with the statute (there are some ancillary issues that are a problem with the statute). “Unauthorized access” and even “exceeds authorized access” should never have been interpreted to apply to publicly accessible data on publicly accessible web sites. Since they have, then I am convinced that the statute is impermissibly broad, and must be struck down. At the very least it must be rewritten.