New tools available to mine world's largest digital repository of books

Stephen Downie
J. Stephen Downie, Professor, Associate Dean for Research, and Co-Director of the HathiTrust Research Center

This week the HathiTrust Research Center (HTRC) announced the availability of data mining and analytics tools for the HathiTrust Digital Library, a collection of digital texts from over 70 research libraries around the world. The new tools provide a much-needed entry point to large-scale analysis of HathiTrust’s contents.

“All of us at Indiana University and the University of Illinois, who have been working toward this release for the last year, can be proud of enabling a first round of shared computation tools for the HathiTrust corpus,” said Beth Plale, professor in the IU School of Informatics and Computing and co-director of the HTRC. “Now we can share this framework for analytical (non-consumptive) research.”

Indiana University and the University of Illinois are the founding partners of the HTRC. The new infrastructure release follows an aggressive development path set forth by the HTRC Executive Management Team at the 2012 HTRC UnCamp, a gathering of HTRC developers, researchers, and librarians. Users can now apply sophisticated computational research methodology across the large-scale collection, leveraging metadata crafted over time by libraries.

In phase two of the HTRC (September 2012-March 2013), the HTRC Technical Working Group created production versions of the beta services previewed at the 2012 UnCamp event. They are now working to open the resources to community testers who are part of the HTRC User Group Community. (For subscription details, see: and join htrc-usergroup-l.)

“This represents a major step forward in understanding how new knowledge can be derived from one of the largest digital library collections in the world,” notes J. Stephen Downie, professor in the Graduate School of Library and Information Science at the University of Illinois and co-director of the HTRC.

The HTRC software and services provide the analytical entry point to the digital texts and are based on a completely new technical foundation. This foundation leverages existing analytics tools such as SEASR, digital library software such as Blacklight, and a services-oriented architecture application interface. The current production phase includes a HTRC Sandbox that is open to scholars for evaluation of the HTRC software and services as part of their experiments.

"This is a significant step forward in making the HathiTrust digital collection a valued source for creating new scholarship," remarked Laine Farley, member of the HathiTrust Board of Governors and executive director of the California Digital Library.

Research Areas: