Academics Go To Jail – CFAA Edition

Though the Aaron Swartz tragedy has brought some much needed attention to the CFAA, I want to focus on a more recent CFAA event — one that has received much less attention but might actually touch many more people than the case against Swartz.

Andrew “Weev” Auernheimer (whom I will call AA for short) was recently convicted under the CFAA and sentenced to 41 months and $73K restitution. Orin Kerr is representing him before the Third Circuit. I am seriously considering filing an amicus brief on behalf of all academics. In short, this case scares me in a much more personal way than prior discussed in my prior CFAA posts. More after the jump.

Here’s the basic story, as described by Orin Kerr:

When iPads were first released, iPad owners could sign up for Internet access using AT&T. When they signed up, they gave AT&T their e-mail addresses. AT&T decided to configure their webservers to “pre load” those e-mail addresses when it recognized the registered iPads that visited its website. When an iPad owner would visit the AT&T website, the browser would automatically visit a specific URL associated with its own ID number; when that URL was visited, the webserver would open a pop-up window that was preloaded with the e-mail address associated with that iPad. The basic idea was to make it easier for users to log in to AT&T’s website: The user’s e-mail address would automatically appear in the pop-up window, so users only needed to enter in their passwords to access their account. But this practice effectively published the e-mail addresses on the web. You just needed to visit the right publicly-available URL to see a particular user’s e-mail address. Spitler [AA’s alleged co-conspirator] realized this, and he wrote a script to visit AT&T’s website with the different URLs and thereby collect lots of different e-mail addresses of iPad owners. And they ended up collecting a lot of e-mail addresses — around 114,000 different addresses — that they then disclosed to a reporter. Importantly, however, only e-mail addresses were obtained. No names or passwords were obtained, and no accounts were actually accessed.

Let me paraphrase this: AA went to a publicly accessible website, using publicly accessible URLs, and saved the results that AT&T sent back in response to that URL. In other words, AA did what you do every time you load up a web page. The only difference is that AA did it for multiple URLs, using sequential guesses at what those URLs would be.  There was no robot.txt file that I’m aware of (this file tells search engines which URLs should not be searched by spiders). There was no user notice or agreement that barred use of the web page in this manner. Note that I’m not saying such things should make the conduct illegal, but only that such things didn’t even exist here. It was just two people loading data from a website. Note that a commenter on my prior post asked this exact same question–whether “link guessing” was illegal–and I was noncommital. I guess now we have our answer.

The government’s indictment makes the activity sound far more nefarious, of course. It claims that AA “impersonated” an iPad. This allegation is a bit odd: the script impersonated an iPad in the same way that you might impersonate a cell phone by loading http://m.facebook.com to load the mobile version of Facebook. Go ahead, try it and you’ll see — Facebook will think you are a cell phone. Should you go to jail?

So, readers might say, what’s the problem here? AA should not have done what he did — he should have known that AT&T did not want him downloading those emails. Yeah, he probably did know that. But consider this: AA did not share the information with the world, as he could have. I am reasonably certain that if his intent was to harm users, we would never know that he did this — he would have obtained the addresses over an encrypted VPN and absconded. Instead, AA shared this flaw with the world. AT&T set up this ridiculously insecure system that allowed random web users to tie Apple IDs to email addresses through ignorance at best or hubris at worst. I don’t know if AA attempted to inform AT&T of the issue, but consider how far you got last time you contacted tech support with a problem on an ISP website. AA got AT&T’s attention, and the problem got fixed with no (known) divulgence of the records.

Before I get to academia, let me add one more point. To the extent that AA should have known AT&T didn’t desire this particular access, the issue is one of degree not of kind. And that is the real problem with the statute. There is nothing in the statute, absolutely nothing, that would help AA know whether he violated the law by testing this URL with one, five, ten, or ten thousand IDs.  Here’s one to try: click here for a link to a concert web page deep link using a URL with a numerical code. Surely Ticketmaster can’t object to such deep linking, right? Well, it did, and sued Tickets.com over such behavior. It claimed, among other things, that each and every URL was copyrighted and thus infringed if linked to by another. It lost that argument, but today it could just say that such access was unwanted.  For example, maybe Tickemaster doesn’t like me pointing out its ridiculous argument in the tickets.com case, making my link unauthorized. Or maybe I should have known because the Ticketmaster terms of service says that an express condition of my authorization to view the site is that I will not “Link to any portion of the Site other than the URL assigned to the home page of our site.” That’s right, TicketMaster still thinks deep linking is unauthorized, and I suppose that means I risk criminal prosecution for linking it. Imagine if I actually saved some of the data!

This is where academics come in. Many, many academics scrape. (Don’t stop reading here –“ I’ll get to non-scrapers below.) First, scraping is a key way to get data from online databases that are not easily downloadable. This includes, for example, scraping of the US Patent & Trademark Office site; although data is now available for mass download, that data is cumbersome, and scraper use is still common. That the PTO is public data does not help matters. In fact, it might make it worse, since “unauthorized” access to government servers might receive enhanced penalties!

Academics (and non-academics) in other disciplines scrape websites for research as well. How are these academics to know that such scraping is disallowed? What if there is no agreement barring them from doing so? What if there is a web-wrap notice as broad as Ticketmaster’s, purporting to bar such activities but with no consent by the user? The CFAA could send any academic to jail for ignoring such warnings –or worse — not seeing them in the first place. Such a prosecution would be preposterous, skeptics might say. I hope the skeptics are right, but I’m not hopeful. Though I can’t find the original source, I recall Orin Kerr recounting how his prosecutor colleagues said the same thing 10 years ago when he argued the CFAA might apply to those who breach contracts, and now such prosecutions are commonplace.

Finally, non-scrapers are surely safe, right? Maybe it depends on if they use Zotero. Thousands of people use it. How does Zotero get information about publications when the web site does not provide standardized citation data? You guessed it: a scraper. Indeed, a primary reason I don’t use Zotero is that the Lexis and Westlaw scrapers don’t work. But the PubMed importer scrapes. What if PubMed decide that it considered scraping of information unauthorized? Surely people should know this, right? If it wanted people to have this data, they would provide it in Zotero readable format. The fact that the information on those pages is publicly available is irrelevant; the statute makes no distinction. And if one does a lot of research, for example, checking 20 documents, downloading each, and scraping each page, the difference from AA is in degree only, not in kind.

The irony of this case is that the core conviction is only tangentially a problem with the statute (there are some ancillary issues that are a problem with the statute). “Unauthorized access” and even “exceeds authorized access” should never have been interpreted to apply to publicly accessible data on publicly accessible web sites. Since they have, then I am convinced that the statute is impermissibly broad, and must be struck down. At the very least it must be rewritten.

11 thoughts on “Academics Go To Jail – CFAA Edition

  1. I appreciate your general point, and indeed Zotero does scrape information.

    Two issues though
    1. IANAL, but I would argue that there is a difference between manually triggered scraping – which is what Zotero does – and automated scraping via a bot – which is what Aarons did and what is prohibited in most commercial sites terms of service. That certainly seems to be the site owner’s interpretation, as several major players such as Ovid and EBSCO have contributed code to Zotero that helps it scrape their sites.

    2. Your choice of example is unfortunate, because Zotero does in fact _not__ scrape data from pubmed (and nowhere in the thread you link to do I say that), but uses eutils, pubmeds public API to retrieve metadata for pubmed IDs.

  2. Well, I sure hope there’s a difference, but I don’t see it in the statute. Do you? And who said there was a terms of service on the AT&T page that AA went to? It was an email form, as far as I know.

    As for pubmed, if I’m wrong, I’m sorry about that. You don’t say it – I was going from another user: “Don’t go to the actual PDF for adding items to Zotero. Add them from the abstract or HTML full text view. Zotero scrapes metadata from the HTML page, which is not available in most cases when you are viewing PDF” If that’s not how it works, that’s great. I would expect that you don’t want to scrape if you can help it – for technical reasons if anything else. It may also be that the metadata is allowed under the terms of service.

  3. I should add that I am focusing on the scraping aspects, not on the article downloading aspects at issue in the Swartz case. I agree that the average Zotero user is far removed from the issues of the Swartz case, in a more obvious way, because the authorization to obtain articles, etc., is more clearly granted than the authorization for scraping data. Which is ironic, of course, since the articles are supposed to be the protected material, and the data is more “factual.”

  4. [This post responds to Orin Kerr’s 4/10 10:03pm comment on the thread following Michael’s cross-post on Prawfsblawg; I’m responding here because Prawfsblawg’s spam filter has gone all HAL on commenters.]

    Orin, that looks like a nice bright-line technological distinction, but I think it breaks down into a mushy social one on closer examination. First, the web is not really a publishing platform in the sense that everything put on it is public. There are networked computers, but some parts of the network are private and some are not. Pages on both private and public portions are the network are written in HTML and requested and transmitted using HTTP via a web browser. Merely knowing that something is on the network somewhere and retrievable by a web browser doesn’t really tell us whether access by general members of the public is authorized or not, even as a default.

    As to your proposed distinction — pages retrievable by typing stuff in the address bar are public, pages that require typing something into a field on a page are not — strikes me as too narrow and too broad. Too narrow because it’s possible to create a login page that transmits the login information entered in fields on the page — username and password — in the URL, via a “GET” request. That’s a password-type control that demands login credentials, just the same as any other login page, and I think most people would say that account pages retrieved by typing in the right username and password are not public. Sure, it’s *dreadfully insecure*, but whether access is authorized or not shouldn’t depend on the strength of the security measure, as I think you yourself have stated, what matters is the signal the security measure sends. And I don’t think that the particular portion of the page request where the password is transmitted to the site should matter either. For another example, how about a buffer overflow or SQL-injection attack that either retrieves restricted data or results in administrator access to the server? My understanding is that both can be accomplished through the URL portion of a page request. But certainly both are unauthorized access, even though any member of the public could type malformed URLs into their browser and achieve the same result.

    It’s also too broad. There are sites that require login and passwords, but where defeating that requirement seems questionable as unauthorized access. I’m thinking of sites that say, e.g., “No government agents allowed. If you are not a government agent, type “NO” to be allowed entry.” Typing NO lets you into the site. But it’s not really a password control that visitors understand keeps the pages restricted only specific people previously designated by the site owner. The same with sites like newspaper sites that provide free access, based on providing only an email address. Suppose someone finds a way around the login page for such a site (other than by typing something into the URL field of their browser). Is that unauthorized access? The site is essentially open to the public after a trivial hurdle. I can see a jury saying that bypassing that hurdle, like my “Type NO to proceed” or even just clicking a button, does not have the social significance necessary to make entry trespass, just as it might make that determination in a real-property type situation.

    I don’t see Pulte Homes as shedding much light on this. The union was sending emails — which I’ve argued is not even “access,” let alone “unauthorized access.” It was really a causing-damage-by-transmission claim, not an access claim. But even assuming it was access, the conventions for one-way communications are bound to be different. It’s hard to even imagine an email address that one should not send email to. It’s easy to imagine pages that one should not access — someone else’s bank account information, for example. The fact that no email addresses are unauthorized email addresses doesn’t help us determine which web pages are unauthorized web pages.

  5. Bruce, as a technical matter, you are absolutely right with respect to submissions. The difference here, though (which is what piqued my interest for academics) is the sequential nature of it. Even if the page submission format is the same (HTTP GET fields), a site protected by passwords that are moderately hidden from view and selected by the user or the site, and entered by the user or the site prior to entry are different, I think, from sequential numbers used to reference different database entries, especially where such numbers are visible to the user of the webpage, and not selected by either the user or the site as a security measure. It is this latter fact that makes a good scraper work, and that’s what worries me. Plus, the overbroad part you mention, because how are we supposed to tell the difference between the “OK” scraping and the “not OK” scraping?

  6. I don’t think there should be any world in which innocent scraping — scraping without having any intention or purpose of retrieving nonpublic pages — subjects the scraper to liability. I think that can be achieved by focusing on what the reasonable web user would believe. (Sure, that’s a test that’s fuzzy at the boundaries, but it’s clear enough for many purposes.) In the absence of any other evidence, web pages that are served up in response to a simple request with no clear indicia of private-ness — no fences, no walls of houses, no crops, no closed and locked doors on stores, no unmarked doors in shopping malls or airports — are authorized for retrieval, even if the website owner secretly believes otherwise. So randomly pulling up webpages one has no reason to believe are restricted would be fine. But the scraping equivalent of war-dialing is not — intentionally trying every random URL you can in order to get access to pages you know or reasonably should know are not being provided to the general public.

  7. I think the notion of “nonpublic” pages fails in a world of database driven URL queries. Consider Google Easter Eggs. You put a special query in, and get unexpected results. Is this nonpublic? It is, in the sense that it is not the expected behavior of the search engine. But it is obviously public, because people hear about them and test them. What about people who randomly put queries into Google to discover Easter Eggs? Are they accessing nonpublic pages? Only if they discover a new easter egg? Only if they discover a bug and get data they weren’t supposed to get?

    Indeed, I’m not even sure that the framework works for non-query pages. Consider the buried .htm file that’s not linked to by any other page on the site. If I go searching for such pages, or use wget to obtain all of the .htm files in a directory, am I violating the rule? What if one of the .htm files was left visible by accident and never meant to be seen? Reasonable people would assume that you can’t get it, but we’re not going to outlaw wget, are we?

    What the defendant did here was essentially a directory walk, but with ids rather than files. The data was publicly available, but just not “expected” to be seen by anyone but the user who had the matching id on their device. But that would mean that anyone faking browser headers to view a website that is otherwise turned off for their browser (often for technical rather than privacy reasons) is violating the law, too. I’m still not buying this as a distinction that can work in practice.

  8. “Nonpublic” was not supposed to be doing any work. I meant to use it just as shorthand for the social conventions of denoting restricted access that I’ve been describing. There’s no additional feature of “nonpublicness” that someone would have to figure out, the question is whether there were obvious indications prior to reaching the page that the reasonable person would understand to mark the page as restricted to a group that does not include the person. I really don’t think it’s that difficult in most cases to distinguish random content pages targeted at no one in particular from pages that are obviously meant to be seen only by particular customers, both in terms of their content and how they are typically reached, and I think it’s going to be a fairly rare case in which someone intentionally digs up pages that most people would believe to be appropriately and ex ante obviously limited in distribution but that to legal scholars seems like it should be freely available.

  9. There is a disengenuousness about all of this, even Orin Kerr’s work whom I respect, that is really troubling.

    To my knowledge, nobody has yet produced a case where CFAA was used for criminal prosecution solely for the innocent act of violating TOS rules for purposes of scholarship, general information, or regular website usage.

    Your account of the Weev cases simply leaves out the most important facts that were established by the prosecutor and accepted by the jury and that are clearly stated in the indictment. Weev intended to use that data for his own profit, possibly to extort AT&T. This was shown both by other witnesses and examination of other communication by Weev at the time. That is the heart of the case, not in any way peripheral, and suggesting that the formal violations were the reason he was prosecuted is cotnrary to the actual trial history.

    Had Weev taken 10 addresses & sent them to AT&T security, there would have been no case, and that’s the example you are really describing. He took something like 120,000, and the prosecution had plenty of reason to think his purpose was mercenary. Weev has pretty much bragged openly about hacking for profit in the press.

    Find me a case where someone has been prosecuted for simply taking public information from a public website *without* allegations of a deeper bad act, and we’ll have something to talk about. And no, you can’t use the Swartz case, as the prosecutors there thought Swartz intended to put the nonprofit JSTOR out of business by distributing their entire database of publications for free–my point being that’s what they alleged, and it’s far beyond “he took some information that he shouldn’t have.” I still don’t know of a single case where the prosecution tried–let alone a judge accepting–to use CFAA for a violation in which no other harm or misdeed was alleged.

  10. Oh, there’s always some other bad act in the case–until there isn’t. There was no case alleging a CFAA violation for violating a terms of service. Until there was. There was no case alleging a CFAA violation for surfing Facebook at work. Until there was.

    There was never a case alleging a CFAA violation for scraping a publicly available database that anyone could use for any purpose. Until there was. How about Register v. Verio? There the district court enjoined Verio for scraping a publicly available database – indeed a database that Register was REQUIRED to keep publicly available. The Court of appeals reversed, but NOT because it was a ridiculous argument that this scraping could ever be a CFAA violation, but only because the $5000 loss could not be shown. If Register had done as AT&T did, and sent paper postal mail to every registrant in the database, the loss might well have exceeded $5000.

    Should Weev have gathered 100 addresses rather than 100,000 to prove his point? Maybe, but we’re just talking about scale at that point. Both would be illegal under the act.

    And that’s because there are no bad acts required in the statute. I would be happy to add that in to get some clarity. I would also like to see the evidence– not the indictment that Weev is a bad guy (a fact about which I have no opinion), but the evidence– that THIS information was offered for sale.

    On a side note, I don’t use Swartz as an example here. I tend to agree with Orin that that case (though perhaps not the potential penalty) was much more supported than people give it credit for.

  11. I’ll add that if there were actual extortion evidence, then there could have been an indictment and conviction under our extortion statutes. AND, that statute could have been used to bump the misdemeanor to a felony. But that’s not what happened here. The misdemeanor was bumped to a felony by piggybacking state CFAA-type laws, a sort of double-dipping. This is one reason why I am skeptical of the evidence.

Comments are closed.