Exploring the Billions and Billions of Words in the HathiTrust Corpus with Bookworm: HathiTrust + Bookworm Project

TIME FRAME
2014 – present
TOTAL FUNDING TO DATE
$504,373

The HathiTrust Research Center (HTRC) is partnering with the Cultural Observatory team that developed the Google Books Ngram Viewer together with Google. The goal of this collaboration is to implement a greatly enhanced open-source version of the Cultural Observatory’s open-source “Bookworm” text analysis and visualization tool designed to assist scholars to meet the challenges posed by the massive scale of the HT corpus. We are calling our multi-disciplinary, multi-institutional collaboration, the HathiTrust + Bookworm (HT+BW) Project. Participating institutions include the University of Illinois, Indiana University, Northeastern University, Baylor College of Medicine, and Rice University.

Bookworm is a tool that visualizes language usage trends in repositories of digitized texts in a simple and powerful way. HT+BW is intended to provide more powerful visualizations than seen in earlier efforts, because it will allow multi-faceted “slicing and dicing” of the data by an enhanced set of content-based and metadata-based features. The new Bookworm will help scholars to better build their research collections of texts, called worksets, and will augment their utility by improving: a) discovery in HTRC, so that scholars can find new texts to add to their worksets; b) custom analysis, for scholars to uncover novel patterns within their worksets; and c) exploration of the full HathiTrust corpus, so that scholars can view high-level trends but with the control to compare by such features as subject classification, place of publication, genre, and language.

HathiTrust + Bookworm Project

Personnel

HathiTrust + Bookworm Project
J. Stephen Downie
Principal Investigator (PI)

Funding Agencies

National Endowment for the Humanities — 2014 — $504,373

Research Areas

Digital Humanities, Digital Libraries, Information Retrieval