In order for older texts to be searchable, contemporary English needs to be translated into language from various historical timeframes. The project will develop software that will let people enter a query in contemporary English, and search over English texts throughout history—from Medieval times to the present day. The project will mostly involve training statistical models that assign probabilities of the translation to a word or phrase in a target English language. The project will also look at how to display results in order to provide the user with the most probable answer to the query.

Time affects information retrieval in many ways. Collections of documents change as new items are indexed. The content of documents themselves may change. Users submit queries at particular moments in time. And perhaps most importantly, people’s assessment of a document’s relevance to a query is often time-dependent. For example, searchers of news archives might seek information on a past event where relevant documents cluster in a window of time. Users of social media services such as Twitter demand topically relevant information that is new. People who monitor particular topics in the news (for example, editors of Wikipedia) take action when they find information that is topically relevant and that changes current knowledge. The traces of information created by change in documents,...

The HathiTrust Research Center (HTRC) is partnering with the Cultural Observatory team that developed the Google Books Ngram Viewer together with Google. The goal of this collaboration is to implement a greatly enhanced open-source version of the Cultural Observatory’s open-source “Bookworm” text analysis and visualization tool designed to assist scholars to meet the challenges posed by the massive scale of the HT corpus. We are calling our multi-disciplinary, multi-institutional collaboration, the HathiTrust + Bookworm (HT+BW) Project. Participating institutions include the University of Illinois, Indiana University, Northeastern University, Baylor College of Medicine, and Rice University.

Bookworm is a tool that visualizes language usage trends in repositories of...


This project aims to improve search engine effectiveness by using knowledge base (KB) entries to inform query expansion. While the intersection of KBs and information retrieval (IR) is a growing research area, this project proposes a novel approach to KB-based query modeling. In particular, this project proposes to let the structure that KB authors impose within individual KB entries guide the final query model. For instance, authors of Wikipedia pages divide individual entries into sections, subsections, bulleted lists, etc. The goal of this project is to use such intra-entity structure to derive highly focused query models. This project is a collaboration between Miles Efron, his Google sponsor, and a PhD student with the goal of advancing the state of the art in using structured...

This project, conducted collaboratively by the iSchool and the University Library, will further our understanding of four translational research questions:

  1. As compared to general collection catalog records, item-level metadata for digitized special collections are frequently more granular, richer in non-bibliographic entities, and expressed using custom vocabularies and schemas. What differences and additional challenges are encountered when transforming legacy special collections metadata records into LOD?
  2. Typically interfaces used to discover and view digitized special collections are disconnected from the online public access catalogs and ancillary services used to provide user access to general library collections. Can LOD reconnect library special and...

Despite the ubiquity of search in many people’s daily lives, a lack of search literacy can make it difficult to find solutions to technical problems, such as completing software-based tasks like troubleshooting program installations. iSchool Professor Michael Twidale and Assistant Professor Max Wilson of the University of Nottingham have received funding from Google for a project that aims to develop an understanding of search literacy, and to recommend best practices for teaching technical search literacy and creating tools in support of this kind of search.

Music prints and manuscripts created over the past thousand years sit on the shelves of libraries and museums around the globe. As these organizations digitize their collections, images of these scores are increasingly accessible online. However, the musical content remains difficult to search.

Google Books and HathiTrust have already made it possible to search the content of text documents through Optical Character Recognition (OCR), which transforms digital images of texts into a symbolic representation that can be searched by computers. For digital images of musical scores, the analogous technology is Optical Music Recognition (OMR).

The research team is working to improve OMR technology so that computers can recognize the musical symbols in these images, enabling us...


The HathiTrust has provided funding for the HathiTrust Research Center (HTRC), colocated at University of Illinois and Indiana University, to serve as the research arm of the HathiTrust and create an agile, technology-rich service for researchers in the digital humanities, social sciences, natural sciences, and informatics. This service will help researchers conduct nonconsumptive research on the HathiTrust digital library database, a collection of just under 14 million digitized volumes, equating to 4.9 billion pages, 60% of which is under some copyright restriction. At the same time, center staff will develop and refine tools to aid in digital humanities and text mining research over large databases and will operate the secure, large-scale computation environment required by this...

Scholarly publications today are still mostly disconnected from the underlying data and code used to produce the published results and findings, despite an increasing recognition of the need to share all aspects of the research process. As data become more open and transportable, a second layer of research output has emerged, linking research publications to the associated data, possibly along with its provenance. This trend is rapidly followed by a new third layer: communicating the process of inquiry itself by sharing a complete computational narrative that links method descriptions with executable code and data, thereby introducing a new era of reproducible science and accelerated knowledge discovery. In the Whole Tale (WT) project, all of these components are linked and accessible...

This project builds upon, extends, and integrates two developmental research threads within the HathiTrust Research Center (HTRC). The first thread originates from work that was conducted in the Workset Collections for Scholarly Analysis (WCSA): Prototyping Project. The second thread continues the work of the Data Capsules (DC) project, previously supported by the Alfred P. Sloan Foundation (2011-2014). The primary objective of the WCSA+DC project is the seamless integration of the workset model and tools with the Data Capsule framework to provide non-consumptive research access to HathiTrust's massive corpus of data objects, securely and at...


Jul. 13, 2016

Associate Professor Miles Efron will participate in the 39th International Conference of the Association for Computing Machinery (ACM) Special Interest Group on Information Retrieval (SIGIR). The conference will be held July 17-21 in Pisa, Italy.

Efron and his doctoral students Craig Willis and Garrick Sherman will present the short paper, “What Makes a Query Temporally Sensitive.”

From the abstract: This work takes an in-depth look at the factors that affect manual classifications of “temporally sensitive” information needs. We use qualitative and quantitative techniques to analyze 660 topics from the Text Retrieval Conference (TREC) previously used in the experimental evaluation of temporal retrieval models. Regression analysis is used to model previous manual classifications. We identify factors and potential problems with previous classifications, proposing principles and guidelines for future work on...

Mar. 1, 2016

Every month, Google alone fields billions of search requests. The staggering demand for information, coupled with the exponentially growing amount of information available, means that reliable search results are key to maneuvering a flooded information landscape.

Associate Professor Miles Efron is among the leading scholars investigating ways to improve search. With funded research projects supported by the National Science Foundation as well as by industry partners such as Google, he looks at the issue from a variety of angles, including questions of query representation and how temporal factors affect the relationship between queries and relevant information.

Though his research is thick with writing code and creating algorithms, Efron approaches his work through the lens of a humanist, incorporating his academic background in classics and medieval studies. “My goal is to translate familiar humanist concerns and see how they resonate in the kinds of domains that...

Feb. 23, 2016
Tim Cole, right, mathematics librarian, is helping develop tools so scholars like Ted Underwood, left, can use computational analysis to answer research questions. J. Stephen Downie, center, is the project lead. Photo by Joyce Seay-Knoblauch.

Illinois English professor Ted Underwood wants to know how the language describing male and female characters in works of fiction has changed since the late eighteenth century. He’s using data mining tools to gather information from thousands of books to answer that question.

The problem, though, is that books published after 1922 are still under copyright protection and their content can’t be shared freely online.

“There are hundreds of thousands of books out there, and we don’t talk about them,”...

Feb. 17, 2016

GSLIS master’s students Jessica Colbert and Annabella Irvine will participate in the Midwest Bisexual Lesbian Gay Transgender Ally College Conference (MBLGTACC) this week, where they will lead a workshop on locating LGBT materials in libraries and will represent the GSLIS student group, Queer Library Alliance (QLA). Held annually, the interdisciplinary conference is organized by students and is the largest event of its kind in the country. MBLGTACC 2016 will be held at Purdue University on February 19-21.

The workshop, "Finding Ourselves in the Library: Locating LGBT Materials in Libraries,” was developed by Colbert and fellow GSLIS MS/LIS student Brittany Craig.

Abstract: This workshop serves to assist students in pursuit of queer literatures, histories, and other LGBTQIA-related library materials. Libraries have been a hub for marginalized populations, particularly...

Jan. 25, 2016

Associate Professor Miles Efron has been named the GSLIS Centennial Scholar for 2015-2016. The Centennial Scholar award is endowed by alumni and friends of GSLIS and given in recognition of outstanding accomplishments and/or professional promise in the field of library and information science.

“This is a real honor. One of the things that makes GSLIS a great academic home is the excellence and intellectual diversity of our faculty. To be recognized in this way by colleagues whom I really admire is so gratifying. I give my strongest thanks to the GSLIS faculty for this recognition and support of my work,” Efron said.

“This award will help me to continue organizing GSLIS’s ongoing participation in the annual Text Retrieval Conference (TREC), hosted by the National Institute of Standards and Technology. It will also afford me a much-welcomed freedom to pursue a project in the digital humanities—analyzing data from the HathiTrust—that I have had on the back burner for a...

Dec. 21, 2015

Abbott's nutrition business has awarded GSLIS $25,000 for a project led by Assistant Professor Vetle Torvik titled, “Computer-assisted text-mining across biomedical papers and patents for competitive intelligence.” Working with Torvik on this year-long project is doctoral student Adam Kehoe (MS ’09).

With guidance from biomedical scientists at Abbott, Torvik and Kehoe will develop a new system for literature-based discovery that will draw on literature from both academia and industry. The goal is to create a system that will improve discovery in the realm of biomedical research by combining these two traditionally independent bodies of work. In addition to meeting Abbott’s specific research needs, the new tool will be made available online to external researchers.

“We are interested in what extra can we find when we put together these two separated literatures. Are there discoveries to be had when we put these things together?” said Torvik.


Dec. 11, 2015

J. Stephen Downie, GSLIS professor and associate director for research, gave a keynote at the 16th International Society for Music Information Retrieval Conference (ISMIR) held in Malaga, Spain, this fall.

Downie reflected on how far music information retrieval (MIR) research has come in making access to music resources quick and simple over the last three decades.

“I am constantly surprised by the totally groovy and jaw-droppingly spiffy new music applications developed by my fellow MIR researchers. This represents the extraordinary vibrancy of the MIR community,” said Downie.

This year marked the eleventh running of MIREX, the Music Information Retrieval Evaluation eXchange, which was held in conjunction with the ISMIR conference. MIREX is an evaluation campaign of music information retrieval systems and algorithms that takes place annually at the conference. Nearly 100 researchers...