How digitization can harm research

A hole made by a worm in a 15th century manuscript from the Dubrovnik archives, via @EmirOFilipovic
[On Friday, 26 October 2012, I spoke on a panel about libraries at In Re Books, a conference on law and the future of books hosted by the New York Law School. My talk was about how the digitization of books harms research. I think that New York Law School will be posting video of the conference in a few days, but in the meantime, here's a version of my talk.]

Good afternoon. I’m here to play the role of Luddite. I’m going to talk about how the digitization of texts can slow and even harm research, which I believe is the traditional mission of libraries.

If digital texts were no more than supplements to printed texts, they wouldn’t threaten research at all. But in the real world, where budgets and resources are limited, the option of digital texts provokes change. I began to explore Luddism with a vengeance this past spring, when I found myself opposing the New York Public Library’s plan to consolidate its midtown Manhattan real estate by shipping most of the books in its research collection to New Jersey. I knew that offsite storage was a necessary evil, but evil, unfortunately, was the word. The New York Public Library’s books were supposed to be able to travel from New Jersey to Manhattan in two days, but they usually took three or four, and sometimes longer. The trustees, moreover, were proposing a proportion of offsite to onsite that was extreme. Of the 5 million books stored in the lion-adorned humanities library on 42nd Street, only 1.8 million were to remain. The library’s plan had been conceived in 2008, when the world was breathless with the promise of Google Books, but in 2011, Judge Denny Chin had killed a proposed settlement between Google Books and the Authors Guild. For the foreseeable future, copyright was going to keep most books out of digital circulation. A monkey, as he swings his way through the jungle, can see the next tree limb he needs to grab, but not always the tree limb after that. A scholar proceeds much the same way, finding in one book’s footnotes the name of the next book he needs to read. The New York Public Library’s consolidation plan threatened to slow scholars down to the speed of one tree limb every two or three days. That’s a slow monkey.

Scholars circulated a petition; journalists wrote articles for The Nation, The New York Times, and the journal n+1; I blogged so copiously in protest that the New Republic made fun of me; I was invited to join a library advisory committee; I was thrown off the advisory committee; library administrators debated critics at a New School panel; and passionate letters to the editor were printed in the New York Review of Books. Last month, the library’s administators announced that an $8 million gift by trustee Abby Milstein and her husband Howard Milstein would make it possible for the library to store an additional 1.5 million books under Bryant Park. This compromise mitigates the damage of the consolidation plan considerably, and while I still have reservations, I am grateful for the Milsteins’ gift and for the willingness of the library’s trustees and administrators to respond to scholars’ concerns. (In the past six months, the library has also radically improved delivery from offsite storage, and currently books often arrive within twenty-four hours.)

Today, I’d like to talk in a more general way about the dangers that research faces in the digital age. Even a Luddite like me can acknowledge that digital proxies are wonderful as supplements. Many times they have helped me locate a quote that I remembered reading but couldn’t find again. The ability to search millions of old newspaper pages is invaluable to a historian. Nonetheless, it’s worth noting that the e-book medium, on its own, is inconsistent with the requirements of research. Its permanence, for one thing, is unproven. Printed books can burn, and the wood fibers in paper can grow brittle—problems that the old research-library system countered by storing multiple copies in multiple locations. What catastrophes are electronic texts subject to? Will they survive, say, unexpected power outages during a military conflict? What about steady change in file formats? I still have the digital files of essays that I wrote in college, twenty-five years ago, but I can’t read many of them, because no translators exist to open them in current word processing programs. Maybe electromagnetic storage will endure for centuries, and maybe file formats will be stable from now on, but we don’t yet know for sure. [During the Q&A period at the end of the panel, a fellow panelist told me that the files stored at the Internet Archive have to be replaced every four years or so; their hard drive are constantly churning, because digital preservation requires ceaseless effort.]

Then there’s copyright, which today controls, and effectively blocks, digital circulation of 26 million of the 32 million titles ever published, unless Congress and other legislative bodies intervene. Just a few major American publishers sell e-books to libraries, and one of them sells only a license that expires after twenty-six checkouts. A library full of self-erasing books is a researcher’s nightmare.

Digitization isn’t the first fever to break out among librarians, and many of the objections that Nicholson Baker made to microfilming, more than a decade ago, apply to digital proxies as well. Digitization has the effect of making a few scans of a book into master copies. What if these master copies are flawed? What if later, after the original books have become rare or difficult to access, someone wants to see the book at higher resolution? What if someone wants to see the book in color? What if the people making the scans choose copies that don’t include all of the issues, volumes, or editions of the work? Recently, while researching the War of 1812, I wanted to check Basil Hall’s Fragments of Voyages and Travels, a British sailor’s memoir so popular that it went through many editions. The New York Public Library has shipped all its editions of the book offsite, but its online catalog page for a nine-volume edition printed in Edinburgh has a link to Hathi Trust. Unfortunately, Hathi Trust only has a scan of one of the nine volumes. The trust does have scans made at other libraries, but one of these files is corrupt, another has lopped off the bottom two lines of every page, and still others are of an abridged American edition. On Hathi Trust, I could only find readable scans of three of the nine Edinburgh volumes, and only three more in Google Books. (I would have checked the Internet Archive last night, but the site was down.) The New York Public Library’s physical copy of volume 4, meanwhile, has gone missing. When I asked to see it, a note from a librarian advised me to look in Hathi Trust.

E-books, when available, are easier to fetch than printed books but harder to read. A historian of the early modern period recently explained to me her experience with a 1696 French history of the city of Lyon. Each of the book’s parts has its own pagination, she noted, some with Arabic numerals, and others with Roman. The parts aren’t in chronological order, and in digital form, it’s hard to find one’s way around. In my own experience, a digital proxy is fine if all I need is to confirm that a fact appears on a particular page, and very convenient if I’m trying to assess whether a book is worth further study. But if I’m reading a book carefully, I want to be able to flip to the map if I can’t remember whether a fort is located on Lake Ontario or Lake Erie, to the end notes in order to judge whether the author really has the documentary evidence to prove his claims, and to the index so as to be able to remind myself of the back story of a figure last mentioned a hundred pages before. That’s not to mention the information conveyed by a book’s paper quality, binding, and size, all lost to a digital proxy. Maybe the next generation of reading devices will overcome these drawbacks, but we’re not there yet.

Let me mention here a potential unintended consequence of digitization. The New York Public Library’s administrators hope to minimize the inconvenience of offsite storage by making it a priority to ship offsite books that have been digitized. The American Council of Learned Societies, meanwhile, has assembled high-quality digitizations of 3,500 books recognized by scholars to be among the most important in their fields, and the New York Public Library subscribes to the council’s database. Here’s the paradox: the 3,500 books in the council’s database are by scholarly consensus titles that a specialist is likely to want to sit down and read all the way through—and by virtue of that, they’re in danger of being prioritized to go offsite. At smaller research libraries, without the luxury of offsite storage, they’re in danger of being discarded altogether.

Nicholson Baker came in for ridicule when he lamented digitization’s first encroachment into the library: the displacement of the card catalog. Cataloging, after all, is the sort of function that a computer ought to be able to do better than notecards and a typewriter. So you would think, anyway. When first introduced in the 1970s, the New York Public Library’s online catalog didn’t index all of its books. Forty years later, it still doesn’t. I write a blog called “Steamboats are ruining everything.” The title comes from a line in The Trippings of Tom Pepper, an 1847 novel by a friend of Herman Melville’s. There’s no listing for the book in the New York Public Library’s online catalog. But the library does have it, the first volume, anyway; I checked it out on Tuesday to make sure it was still there. The handwritten entry for it is printed in the bound volumes of the library’s now-demolished card catalog. As is an entry for John Spencer Bassett’s seven-volume 1926 edition of the Correspondence of Andrew Jackson, which also fails to appear in the library’s online catalog, unless you are canny enough to search for it under its series title, “Carnegie Institution of Washington Publication number 371.” Which, even if you were a reference librarian, you might not know.

The library has had decades to improve its online catalog. Instead, a few years ago, they introduced a social-media update of it, which the library’s own staff privately advise scholars to circumvent. A good online catalog is possible. Harvard’s is excellent, and I often use it in order to find things in New York Public by a sort of metadata triangulation. But trouble in online cataloging is widespread. If you’ve ever tried to search Google Books, Open Library, or the Internet Archive for a particular volume of a multivolume series—for a particular year of, say, Napoleon’s correspondence—you’ll learn that these resources are incapable of distinguishing one volume from another, unless the original scanner was thoughtful enough to add the volume number to the end of the title as a kludge.

Research takes place even in small public libraries, which are obliged to throw books out in order to make room for new ones, in a process known as weeding. Thirty years ago, when I worked as a teenager in my town’s public library, the librarians weeded according to due dates. If there weren’t many stamped in the back of a book, or if all the due dates were old, out the book went. Not very surreptitiously, one of the librarians sabotaged the system, palming the date stamp and wandering the stacks in order to stamp her favorites to make it look as if they had been recently checked out. There were books she couldn’t bear discarding, she explained, even if they weren’t popular. While engaged in the NYPL controversy, I received emails from librarians worried that no such sabotage is possible today. The computers know when a book isn’t earning its shelf space, and there is no way for a merely human librarian to assert that a book ought to stay even if it’s challenging. In libraries that perfectly optimize usage, these librarians worry, best-sellers push out classics, and an earnest and curious reader may not be able to find anything but entertainment.

My larger point is that the new capacities and energies of the digital age are sometimes deployed for their own sake, and in libraries, they need to be deployed in the service of research. Right now, the digerati have the ear of power and money. If you say bulldoze, they bulldoze. Please be careful.