HathiTrust Research Center receives NEH support for open research tools

Stephen Downie
J. Stephen Downie, Professor, Associate Dean for Research, and Co-Director of the HathiTrust Research Center

The HathiTrust Research Center (HTRC), cohosted by the iSchool at Illinois and the Luddy School of Informatics at Indiana University, has received a $325,000 Digital Humanities Advancement Grant from the National Endowment for the Humanities. One of 15 awarded nationwide, this grant will support the development of a new set of visualizations, analytical tools, and infrastructure to enable users to interact more directly with the rich data extracted from the HathiTrust Digital Library’s collection of more than 17.5 million digitized volumes.

The project, "Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction" (TORCHLITE) will be led jointly by Professor J. Stephen Downie, associate dean for research at the iSchool and co-director of HTRC, and John Walsh, HTRC director and associate professor of information and library science at Indiana University. HTRC staff at both universities will collaborate during the next two years to accomplish TORCHLITE's goals.

"TORCHLITE will enable us to increase dramatically, and to open more fully, public access to the massive, rich data that HTRC has created from the HathiTrust Digital Library corpus," said Downie. "We have already developed innovative ways to transform, enhance, and provide access to data created from the many millions of scanned books held by HathiTrust. With TORCHLITE, we'll create new methods for accessing this data, together with several easy-to-use tools to allow people to interact with it, analyze it, and visualize it in novel ways."

The data of interest is contained in HTRC's flagship "Extracted Features" (EF) dataset, which consists of rich metadata and statistical information inferred by algorithm from the digitized texts of the entire HathiTrust corpus and documents every word on every page, including the number of times the word appears, its part of speech, and other formal features of the language on the page. The EF dataset, and methods for computing over it, have enabled many forms of full-text analysis—even of copyrighted materials. The EF dataset contains nearly three trillion tokens (or in other words, words) representing more than six billion pages of text, making it arguably the largest open dataset of its kind that is readily available to researchers around the world.

In addition to creating new methods for dealing with this enormous data set, and perhaps more impactfully, Downie emphasized that TORCHLITE will develop a framework on which digital humanities scholars, digital librarians, data scientists, and anyone else interested in textual analysis, can build and implement their own tools, and deploy them in their own environments, in order to access the HTRC data more directly and openly. TORCHLITE's tools and methods will enable the retrieval of standard volume-level descriptors—such as title, publisher, date of publication, genre, and page count—along with page- and word-level linguistic and statistical information.

In addition to creating interactive, easy-to-use tools and dashboards, TORCHLITE will promote broad community engagement through a workshop and a mentored hackathon in autumn 2023, in the hopes of encouraging individual researchers to develop their own tools using the project’s application programming interface (API).

HTRC is the official research arm of HathiTrust, a library consortium that hosts books owned and digitized by its member libraries, often in cooperation with the Google Books Project and other mass-digitization efforts. Its mission is to contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge through data-intensive computational methods.

Updated on
Backto the news archive

Related News

Summer Getaway to focus on media literacy

After a pandemic break, the iSchool is pleased to announce that the seventh annual Summer Getaway will be offered in person on June 10-11. Led by iSchool faculty/instructors/staff and school librarians, this professional development event offers a series of workshops focused on topics related to media literacy.

2022 Summer Getaway-Media Literacy

Lee selected for leadership institute

MS/LIS student Kyra Lee had the opportunity to network with leaders in the LIS field at the 2022 Black Caucus American Library Association (BCALA) Leadership Institute. At the inaugural event, which took place from April 12-14 in Durham, North Carolina, LIS students and early career library professionals gathered for workshops, panels, facilitated discussions, and presentations. Lee was one of eighteen students selected to participate in the institute.

Kyra Lee

2021 Downs Intellectual Freedom Awards given to #FReadom Fighters and ALA Office for Intellectual Freedom staff

For libraries and librarians, 2021 was an especially challenging year in terms of the increase in attempts at censorship. According to the American Library Association (ALA) Office for Intellectual Freedom, the number of challenges to library materials more than tripled from 2020 to 2021. In addition, current estimates show that 82 to 97 percent of challenges go unreported, suggesting that the total number of challenges are significantly greater.

#FReadom Fighter logo

New project to improve health of patients with kidney failure

There are approximately 600,000 individuals in the U.S. who are undergoing hemodialysis (HD) therapy for kidney failure. In hemodialysis, a machine filters wastes, salts, and fluid from the blood when an individual's kidneys are no longer healthy enough to do this work adequately. While lifestyle changes such as getting more exercise and making better nutritional choices would benefit HD patients, they are not popular with patients—leading to poor health outcomes. A new project, led by Assistant Professor Jessie Chin, aims to boost HD patients' commitment to exercise through a long-term motivational interviewing conversational agent (LotMintBot).

Jessie Chin

iSchool alumni named 2022 Movers & Shakers

Five iSchool alumni are included in Library Journal’s 2022 class of Movers & Shakers, an annual list that recognizes 41 professionals who are moving the library field forward as a profession. Jeanie Austin (PhD '17) was honored in the Advocates category, Van McGary (MS/LIS '18) was honored in the Change Agents category, Elisandro Cabada (MS/LIS '17) and Robin Davis (MS/LIS '12) were honored in the Innovators category, and Barbara Alvarez (MS/LIS '12) was honored in the Educators category.