HTRC releases new dataset with features extracted from over 13 million volumes

Stephen Downie
J. Stephen Downie, Professor, Associate Dean for Research, and Co-Director of the HathiTrust Research Center
Ted Underwood
Ted Underwood, Professor

Unique in its sheer size and breadth, a new open dataset released by the HathiTrust Research Center (HTRC) will provide researchers with access to otherwise restricted information. The HTRC Extracted Features (EF) Dataset reports quantitative counts of words, lines, parts of speech, and other details extracted from each page of the more than thirteen million volumes found in the HathiTrust Digital Library. 

An earlier release of the EF Dataset, drawn from a subset covering only the five million volumes in HathiTrust's public domain collection, has enabled novel research from scholars in economics, history, linguistics, literary studies, and sociology, among other fields. The new EF dataset, released under a Creative Commons Attribution license, provides access to features drawn from the remaining eight million volumes that otherwise would be unavailable to scholars because of copyright restrictions.

"Right now, many arguments about literary history come to a stop in 1923. Most works after that date are still covered by copyright in the U.S., so it has been very difficult to organize digital collections that would allow us to survey a broad, representative sample of writing. We know there’s a big literary landscape out there, but we can only map it with a flashlight, looking at one book at a time," explained Professor Ted Underwood, one of the dataset's developers. "The extracted features released by HathiTrust Research Center are like turning on the moonlight. For the first time, literary historians will get to survey the whole landscape at once. In order to comply with copyright law, we use only limited data about the books. But there’s still an enormous amount we can learn about literary history."

Underwood has used the dataset to research why novelists spent less time discussing women from the middle of the nineteenth century to the middle of the twentieth. He found the decline was visible both as a shift in gendered pronouns and in personal names.

"If we expected to see growing gender equality in fiction, we’re actually seeing the reverse," said Underwood. "The overrepresentation of male characters in fiction gets worse all the way down to the 1960s. The trend is visible in works written by men and women alike. A lot more research will be required to understand this phenomenon, but without data from HathiTrust, literary scholars wouldn't even know it existed."

HTRC is a partnership between Indiana University, the University of Illinois at Urbana-Champaign, and the HathiTrust. Established in 2011, HTRC develops cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. 

"The HTRC EF Dataset creates opportunities for scholarship and teaching that were previously impossible," said J. Stephen Downie, codirector of the HTRC and professor and associate dean for research at the iSchool. "We look forward to learning how the scholarly community incorporates them into research, labs, and classrooms."