Project will help researchers explore big data in HathiTrust digitized library

Illinois English professor Ted Underwood wants to know how the language describing male and female characters in works of fiction has changed since the late eighteenth century. He’s using data mining tools to gather information from thousands of books to answer that question.

The problem, though, is that books published after 1922 are still under copyright protection and their content can’t be shared freely online.

“There are hundreds of thousands of books out there, and we don’t talk about them,” Underwood said. “That is a dark landscape after the wall of copyright comes down. We can read the books one by one, but we can’t make generalizing claims at all.”

A project of the HathiTrust Research Center (HTRC)—a collaboration between the University of Illinois and Indiana University—aims to get around that problem and allow scholars to analyze large numbers of books while still respecting copyright laws. The project is being funded by a two-year, $1.17 million grant from the Mellon Foundation.

J. Stephen Downie, the HTRC codirector and a professor in GSLIS, is the Illinois project lead. “Researchers at GSLIS and HTRC are interested in the intersection between the humanities and big data, and in finding ways to advance our use of computational tools to make large strides in humanities research. In this way, we can help scholars like Ted add to our cultural understanding,” Downie said.  

HathiTrust is a consortium of more than 100 university and public libraries, which has amassed a massive collection of digitized texts containing nearly 14 million volumes and 5 billion pages. About two-thirds of the works in the HathiTrust collection are still under copyright protection.

Until recently scholars were limited in their research by how many volumes they could read. Now they can use big data to answer research questions through computational analysis that gathers information from huge numbers of books, a concept called “distant reading.”

For example, a scholar may want to find the pages in a book that contain a poem or an image of interest, or find connections between two authors or between an author and a place.

New tools under development at the HathiTrust Research Center will create metadata to better describe the works and search individual pieces to find required information. They will also allow scholars to visualize the data they get in response to a query—for example, how much information comes from a certain time period or geographic region.

But even when scholars know what they want to analyze, the copyrighted material can’t be freely distributed to them online. The Mellon grant will make possible further work on an initiative, the “HTRC Data Capsule,” that will allow researchers to analyze data without violating copyrights. When the data is retrieved, it is released to the researcher without that person ever having access to the online text, thus respecting the copyright protections—a model called “nonconsumptive research.”

“The only thing that is really accessing the data is the algorithm,” Downie said. “It’s very important we remain good copyright citizens.”

HathiTrust Research Center has already developed a prototype that can search smaller sets of data—thousands of volumes. Now researchers want to scale that up so the system will be able to run a search of hundreds of thousands or millions of volumes.

“The HathiTrust collection is big data in size. To step through all the nearly 14 million digitized books in 24 hours would require 14,000 computers running simultaneously,” said Beth Plale, a co-principal investigator for the project, codirector of HTRX and a professor of computing and informatics at Indiana University. “The funding received by the Mellon Foundation will allow us to extend the prototype to larger machines.”

It will also help build the secure computer environment necessary for dealing with copyrighted content, she said.

Being able to analyze works in the vast HathiTrust collection means Underwood and other humanities researchers can ask broader questions and have a much larger, more diverse dataset to work with. The research results Underwood gets “could end up being a significantly different picture,” he said. “I’ll be much more confident getting results that reflect a diversity of literary traditions.”

University Librarian John Wilkin said, "By bringing together computation, tools and this remarkable body of text, we can facilitate new and more innovative approaches to solving big problems in a wide array of disciplines. That's a remarkable difference maker."

HathiTrust Research Center was established with the help of seed money from both the University of Illinois and Indiana University (IU), as well as funding from the Eli Lilly Endowment. HathiTrust is providing $1 million over four years with equal collaborative funding from Illinois and IU. In addition to support from Mellon, HTRC activities have received funding from the Sloan Foundation, the Institute for Museum and Library Services, the Social Science and Humanities Research Council of Canada, and the National Endowment for the Humanities.