One of the world’s largest digital libraries opens doors to text-mining scholars

Stephen Downie
J. Stephen Downie, Professor, Associate Dean for Research, and Co-Director of the HathiTrust Research Center

Who influenced Charles Darwin when he was writing his pioneering theory of evolution, On the Origin of Species? Indiana University (IU) professor Colin Allen wants to know, and the HathiTrust Research Center may now hold the answer.

The HathiTrust Research Center (HTRC), a cooperative service of Indiana University, the University of Illinois, and HathiTrust, has expanded its services to support computational research on the entire collection of one of the world’s largest digital libraries, held by HathiTrust. HathiTrust’s collections include over 14 million digitized volumes, including more than 7 million books, more than 725,000 US federal government documents, and more than 350,000 serial publications. HathiTrust’s collections are drawn from some of the largest research libraries in North America, including Indiana University and the University of Illinois.  

Previously the HathiTrust Research Center supported analysis of only the public domain subset of the HathiTrust collection. HTRC is now the only place where scholars like Allen can perform text mining on the entire HathiTrust collection. In other words, researchers can now explore the entire collection, run an algorithm against all 14 million volumes, and make new connections and discoveries in the process.

Text mining is crucial to Allen’s research. As a member of the IU Department of History and Philosophy of Science and Medicine and IU’s cognitive science program, he is collaborating with informatics professor Simon DeDeo and graduate student Jaimie Murdock to research how what Darwin read influenced his theory of evolution. They can now use the HathiTrust collection, developing algorithms to analyze the books and journals Darwin himself read in the 1800s.

“We have only scratched the surface of what is possible,” said Allen. “Using advanced computing, scholars will be able to analyze patterns in millions of books and understand how individual authors, who are limited to selectively reading just a few thousand of them, nevertheless manage to make creative and innovative contributions that ripple throughout the entire culture.”  

”Supporting innovative uses of the collections we are preserving is a vital part of our mission,” said Mike Furlough, executive director of HathiTrust. “The HathiTrust Research Center is an essential part of the HathiTrust partnership. Its secure environment for computational analysis, coupled with the expanded services, is an absolute game changer for science and scholarship.”   

Staff members of the Indiana University Pervasive Technology Institute (PTI) and the Data to Insight Center (D2I) have helped expand the service to support 14 million volumes. “The big data infrastructure of HTRC ensures that researchers will retain access to the collection even as it grows in size,” said Beth Plale, Indiana co-director of HTRC and professor of informatics and computing at IU. “A researcher carrying out text mining on millions of texts needs both tools and the help of HTRC experts in high performance mining techniques. HTRC research staff bridge the gap between the researcher and the data.”

At first, researchers will be able to access the HTRC collection through its Advanced Collaborative Services grants. This peer-reviewed grant process gives awardees dedicated HTRC staff time.  

HTRC expects to make the full collection available through its secure HTRC data capsules in spring 2017. A features data set, derived from the full collection at both volume level and page level, will be released in fall 2016. “The upcoming release of the extracted features data derived from the full collection will enable researchers to have hands-on access to HT materials allowing scholars to refine their research questions for the corpus in the comfort of their own labs. Another game changing breakthrough for HTRC,” said J. Stephen Downie, the Illinois co-director of HTRC and a professor at the Graduate School of Library and Information Science (GSLIS) at the University of Illinois at Urbana-Champaign.

“This step exemplifies how researchers combine computer science, informatics, humanities, and cyberinfrastructure in ways that enable new forms of scholarship,” said Brad Wheeler, IU vice president for information technology and interim dean of the IU School of Informatics and Computing. “IU is proud to be a co-founder, operator, and research partner in all that the HathiTrust has accomplished as one of the world’s foremost digital libraries.”

About the HathiTrust and its Research Center
HathiTrust, a partnership of academic and research institutions, works to ensure that the cultural record is preserved and accessible long into the future. HathiTrust was launched jointly in 2008 through a partnership with the consortium known as the Committee on Institutional Cooperation, and the university libraries of the University of California. Since then, HathiTrust has grown to include over 110 members from around the world.

The HathiTrust Research Center is a partnership between Indiana University, the University of Illinois at Urbana-Champaign, and the HathiTrust. Established in 2011, HTRC develops cyberinfrastructure and cutting-edge software tools for advanced computational analytics to the growing digital record of human knowledge. The HTRC is staffed by an interdisciplinary team of personnel at the Indiana University Pervasive Technology Institute and the Data to Insight Center, the Graduate School of Library and Information Science at Illinois, and libraries at both Indiana and Illinois.