I have been in the middle of a major rethink of search engines’ efforts to digitize books. As it started I was ebulliently enthusiastic–I even wrote an article celebrating their potential to tame information overload. But major research librarians have been raising some important questions about search engines’ practices:
Several major research libraries have rebuffed offers from Google and Microsoft to scan their books into computer databases, saying they are put off by restrictions these companies want to place on the new digital collections. The research libraries, including a large consortium in the Boston area, are instead signing on with the Open Content Alliance [OCA], a nonprofit effort aimed at making their materials broadly available.
As the article notes, “many in the academic and nonprofit world are intent on pursuing a vision of the Web as a global repository of knowledge that is free of business interests or restrictions.”
As noble as I think this project is, I doubt it can ultimately compete with the monetary brawn of a Google. And why should delicate old books get scanned 3 or 4 times by duplicative efforts of Google, Microsoft, the OCA, and who knows what other private competitor? I also worry that a fragmented archiving system might create a library of Babel. So what is to be done?
My new position is: leverage current copyright challenges to Google’s book search program to guarantee that it serves the public interest. Here’s how that might work:
Googleâ€™s plans to scan and index hundreds of thousands of copyrighted books have provoked extraordinary public controversy and private litigation. This project aims to archive and provide text-based indexing for an enormous number of books. Googleâ€™s scanning of copyrighted books is prima facie infringement, but Google is presently asserting a fair use defense. The debate has largely centered on the rival property rights of Google and the owners of the copyrights of the books it would scan and edit.
Given Googleâ€™s alliance with some of the leading libraries in the world, journalistic narratives have largely portrayed the Google Book Search project as an untrammeled advance in public access to knowledge. However, other libraries are beginning to question the restrictive terms of the contracts that Google strikes when it agrees to scan and create a digital database of a libraryâ€™s books. While each library is guaranteed access to the books it agrees to have scanned, it is not guaranteed access to the entire index of scanned works.
Those restrictive terms foreshadow potential future restrictions on and tiering of their book search services. Well-funded libraries may pay a premium to gain access to all sources; lesser institutions may be left to scrounge among digital scraps. If permitted to become prevalent, such tiered access to information would threaten to rigidify and reinforce existing inequalities in access to knowledge, and life chances. Such tiering divides society into two groupsâ€“those who can afford to access the information, and those who cannot. To the extent that the latter groupâ€™s relative poverty is not its own fault, information tiering inequitably subjects it to yet another disadvantage, whereby othersâ€™ wealth can be leveraged into status, educational, or occupational advantage.
Given the diciness of the fair use case for projects like Google Book Search, courts should condition the legality of such archiving of copyrighted content on universal access to the contents of the resulting database. Landmark cases like Sony v. Universal have set a precedent for taking such broad public interests into account in the course of copyright litigation. Given the importance of â€œcommercialityâ€ in the first of the four fair use factors, suspicion of tiered access could also be figured into that prong of the test. A more ambitious (if less likely) solution would require Congress to set such terms in a legislative settlement of the issue.
However the matter is ultimately settled, any outcome in favor of dominant categorizers should be conditioned on their maintaining open access to search results. Such a condition would help assure that the type of â€œtiered accessâ€ common for legal resources would not further pervade the networked world. If Googleâ€™s proposed extension of the fair use defense succeeds, such a holding should be limited to current versions of the services that conduce to a common informational infrastructure. To the extent it or other search engines limit access to parts of their index, their public-spirited defenses of their archiving and indexing projects are suspect.
PS: For more thoughts on the future of digital archiving, see Diane Leenheer Zimmerman’s Can Our Culture Be Saved?
PPS: This is crossposted from Co-Op, and is part of a series, which starts here.
Your suggestions deals with only one legal issue, Making the scanned results available to everyone rather than just Google or Microsoft could arguably make the use non-commercial, since someone’s likely to offer the books for free. But that’s the weakest exception to copyright law. The courts won’t look with favor on someone who scans Harry Potter books and republishes them, even if he gives them away for free.
Keep in mind an important distinction. Traditional copyright infringement only robs the author once, taking the royalties he might have gotten. Making his book available online for free robs him twice. 1. No one is getting any money, so there’s no money to purse in court as damages. 2. No one is likely to want to bring back into print a book of little importance that available for free online. The author not only gets nothing, he will never be able to get a penny for his labor. That’s bad, very, very bad. Google really is robbing authors to enrich themselves.
Google and others are missing another point of copyright law. The law says that to publish someone’s book you need their permission. It doesn’t say that you can make some vague announcement to the world that you’re going to publish anyone’s work you want, requiring them on their own time to find you and tell you no. Copyright is the right to determine who copies, not the right to search out copiers and tell them to stop. In the latter case, there are court decisions beyond number that infringement has already occurred.
Claims that searchable full-text databases are fair use can’t survive even the most cursory comparison with a technology that’s newer than books–movies, but nevertheless covered by copyright. A movie repackages the content in a different format, but otherwise builds on the original. Even if most of the content is removed due to time constraints and even though a lot of additional creativity is required to create a movie, it remains a derivative that must be licensed. That’s well-established law.
Searchable databases are even more closely linked to the original than a movie and one can’t argue for an exception based on nothing more than their mere newness. The content is identical and the mechanism for scanning and OCRing requires no creativity. If fact it’s almost always done by machines. If OCRing would break a copyright by rendering it different in a mere technical way, then so does using a copy machine, and the law says that isn’t so.
And yes, there is an enormous problem with locating a copyright holder of a long out-of-print book and getting their permission. But that’s where Google and Microsoft should be using their considerable political muscle to get the problems of orphan works written into the law. They can’t just assume that because a problem exists, they have the right to impose a solution of their own choosing.
–Michael W. Perry, author of Untangling Tolkien
Google is interpreting fair-use correctly. Search Engines scan websites the same way.
Google Book Search is an enormous asset for consumers and may one day hold much of the collective knowledge of humanity: http://fishtrain.com/2007/11/30/google-book-search-wealth-of-giants/
We *already* have tiered system for access to information. Less well-funded institutions, and individuals, had access to a lot fewer information resources before the advent of mass digitization, through Google and other sources. And tiered is not necessarily bag; in many cases, tiered pricing (tied to FTE or budget) enables smaller libraries to acquire what they otherwise couldn’t, especially in the area of specialized databases and journals.