This week, Professor and Executive Associate Dean J. Stephen Downie was a guest speaker at the Herder Institute in Marburg and the University of Göttingen. Downie, who serves as co-director of the HathiTrust Research Center (HTRC), lectured on the HTRC's "Tools for Open Research and Computation with HathiTrust: Leveraging Intelligent Text Extraction" (TORCHLITE) project.
The HTRC facilitates nonprofit and educational uses of the HathiTrust Digital Library (HTDL) by enabling computational analysis of the library's 19 million volumes, of which around 10 million are under copyright restrictions. Funded by the National Endowment for Humanities from 2022 through 2024, TORCHLITE created easy-to-use text analysis tools, dashboards, and application programming interfaces—all of which remain active and available—to facilitate open cultural analytics research using the uniquely valuable HTDL data.
The data of interest is contained in HTRC's flagship "Extracted Features" (EF) dataset, which consists of rich metadata and statistical information inferred by algorithm from the digitized texts of the entire HathiTrust corpus and documents every word on every page, including the number of times the word appears, its part of speech, and other formal features of the language on the page. The EF dataset, and methods for computing over it, have enabled many forms of full-text analysis—even of copyrighted materials. The EF dataset contains nearly 3 trillion tokens (or in other words, words) representing more than 6 billion pages of text, making it arguably the largest open dataset of its kind that is readily available to researchers around the world.
In his talk, Downie highlighted the motivations, challenges, and accomplishments of TORCHLITE to date, along with its upcoming next steps that envision the creation of an international consortium of similar groups, tentatively called the "Cultural Open Data Exchange (CODEx)," which will promote and extend HTRC's EF model and methods, enabling other cultural heritage institutions provide access to their otherwise closed collections.
Downie conducts work in digital libraries, digital humanities, and music information retrieval. He holds a bachelor's degree in music theory and composition, along with master's and doctoral degrees in library and information science, all from the University of Western Ontario.