Broadening Access to Text Analysis by Describing Uncertainty

A project to study errors and paratextual noise in optically transcribed digital library texts, and the consequences of these errors on historical and humanistic conclusions measuring trends across time.

The noise associated with digital transcription has become an important obstacle to humanistic research. While the errors in digital texts are easily observed, the downstream effects of error on scholarship are far from clear. Consequential problems for the humanities often spring less from the average level of error in a collection than from the uneven distribution of noise across different periods, genres, and social strata. Uncertainty about this problem undermines confidence in research and discourages some scholars from using digital libraries at all. To address these problems, we will 1) Create paired libraries of clean, manually transcribed volumes and optically-transcribed versions of the same volumes, with or without paratext. 2) Conduct parallel experiments in these corpora to empirically measure the distortions affecting scholarship. 3) Construct a map of error and share resources that help scholars estimate levels of uncertainty in their work.

