The HathiTrust Research Center is pleased to announce the release of its Extracted Features Dataset, a dataset derived from 4.8 million public domain volumes, totaling over 1.8 billion pages currently available in the HathiTrust Digital Library collection. The dataset includes over 734 billion words, dozens of languages, and spans multiple centuries.
The release of this dataset enables analytical work by a single researcher at a scale that, before now, had been virtually impossible. Humanities scholars can analyze all or part of the data to make new discoveries and further understanding about history, culture, and language.
“Large views of our published literature are extremely valuable for observing historic, cultural, and linguistic trends. This dataset addresses and solves common problems that researchers face including access to that literature, the technical obstacles to processing it, and the copyright issues involved when working with consumable—that is, individually readable—books,” said Peter Organisciak, doctoral candidate at the Graduate School of Library and Information Science (GSLIS) at Illinois and researcher on the project.
Researchers from GSLIS, the Illinois Informatics Initiative, and the Department of English at Illinois contributed to the creation of the dataset. After writing the code that identifies the many facets of the text, the team processed the enormous amount of data using Blue Waters, one of the most powerful supercomputers in the world, located at the National Center for Supercomputing Applications (NCSA) on the Illinois campus.
In addition to applications in digital humanities, the dataset is also a useful tool for computer modeling and machine learning. Computer scientists can use the dataset to build algorithms that can determine whether a piece of text is written in English or French, for example.
“This release gives scholars a powerful computational tool that can change the way text analysis is conducted. Now we can bring empirical data, potentially at a large scale, but definitely at a detailed, deep scale, to humanities scholarship. This dataset is a major accomplishment for our researchers and we will continue to develop new features and new modes of analysis for this unique world-class corpus,” said J. Stephen Downie, co-director of the HTRC and associate dean for research at GSLIS.
The project team includes Boris Capitanu, informatics developer; Ted Underwood, professor of English; Organisciak; Sayan Bhattacharyya, post-doctoral research associate at GSLIS; Loretta Auvil, informatics developer; Colleen Fallaw, research programmer; and Downie.
“The work being done by the HTRC is a wonderful convergence of information technology and the written word. The release of this dataset puts the HTRC and Illinois at the forefront of digital humanities and illustrates why university investment in large-scale humanities computing is important,” said Peter Schiffer, vice-chancellor for research at the Urbana campus of the University of Illinois.
The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.
The Extracted Features dataset is available free for download at https://sharc.hathitrust.org/features.