Whole Tale enables new discovery by bringing 'life' to research articles

Posted: June 10, 2016

The National Science Foundation (NSF) has awarded $5M to the "Whole Tale" project, led by Professor Bertram Ludäscher (PI) along with CIRSS affiliate Matthew Turk (co-PI, NCSA) and Associate Professor Victoria Stodden (co-PI). The five-year NSF Data Infrastructure Building Blocks (DIBBs) project will create methods and tools for scientists to link executable code, data, and other information directly to online scholarly publications, with the aim of helping to ensure reproducibility and pave the way for new discoveries. Project partners include co-PIs at the University of Chicago, the Texas Advanced Computing Center, the University of California, Santa Barbara, and the University of Notre Dame.

Further details follow below from the official press release.

Directions for a new piece of "some assembly required" furniture are only useful if the user has the parts listed in the instruction manual. That makes putting those coffee tables and bookcases relatively easy to put together, compared to designing and constructing your own from scratch.

Scientists at the National Center for Supercomputing Applications (NCSA) and the iSchool at the University of Illinois at Urbana-Champaign are hoping to do the same thing with computer code. "Whole Tale," a new, five-year, $5 million National Science Foundation-funded Data Infrastructure Building Blocks' (DIBBs) project, aims to give researchers the same instructions and ingredients to help ensure reproducibility and pave the way for new discoveries.

Whole Tale will enable researchers to examine, transform, and then seamlessly republish research data, creating "living articles" that will enable new discovery by allowing researchers to construct representations and syntheses of data.

"It's almost expected nowadays that when you publish the paper you link the paper to data," explains co-PI Matthew Turk, a research scientist at NCSA. "Linking papers to code as well as data is becoming more common. Whole Tale will take that a step further and let other researchers to replicate the experience of doing the research but in their own way."

"Whole Tale" alludes to both the "whole publication story" and the "long tail of science." The project will create methods and tools for scientists to link executable code, data, and other information directly to online scholarly publications, whether the resources used are small-scale computation or state-of-the-art high-performance computing.

"Whole Tale will support the full lifecycle of computational science, from writing code to conducting the experiment to publishing the results, by creating new methods that bring together existing tools and make them easier to use," says PI Bertram Ludäscher, a professor in the iSchool at Illinois where he is director of the Center for Informatics Research in Science and Scholarship (CIRSS), and also an NCSA researcher.

How will Whole Tale work? Through a web-browser, a scientist will be able to seamlessly access research data and carry out analyses in the Whole Tale environment. Digital research objects, such as code, scripts, or data produced during the research, can be shared between collaborators. These will be bundled with the paper to produce a "living article," accessible by reviewers and the scientific community for in-depth pre- and post-publication peer review. Augmenting the traditional research publication with the full computational environment will enable discovery of underlying data and code, facilitating reproducibility and reuse. Whole Tale will provide an environment of multiple, independently developed frontends (e.g., Jupyter, RStudio, or Shiny) where data can be explored in myriad ways to yield better opportunities for understanding, use, and reuse of the data.

Researchers envision a three-part process:

1. Prepublication and the Research Environment: The Jupyter project provides a powerful and popular frontend for research, storing real-time information about the research pipeline during the research process. A federation of data repositories will be accessible uniformly through DataONE, and tools such as Globus data publication services will allow a group of researchers to share in the creation of datasets and metadata. Providing a personal, federated storage system for intermediate products using ownCloud and iRODS enables collaboration to take place nearest the data itself. Integrated tools such as BrownDog provide support in creating the appropriate metadata.

2.  Publication Process, Peer Review, and Embedded Articles: Whole Tale will allow scientists to organize data, software, and workflows into curatable collections with assigned digital object identifiers (DOIs) as the paper is written and these collections will be ready for publication with the paper. The PIs are working with key publishers, including BioOne’s Elementa: Science of the Anthropocene, on the tools and services needed to realize the project.

3. Postpublication Access, Persistence, Reproducibility, and Reuse: Whole Tale will collaborate with initiatives developing new data-literature linking services. In this way, users can independently verify the published findings, as well as have the option to execute the codes on a different system.

Whole Tale is a highly collaborative project led by Illinois PI Bertram Ludäscher and co-PIs Mattnew Turk of NCSA and Victoria Stodden, also a member of the GSLIS faculty and an NCSA researcher. Additional co-PIs include Kyle Chard, senior researcher and fellow in the Computation Institute at the University of Chicago and Argonne National Laboratory; Niall Gaffney, director of data intensive computing at the Texas Advanced Computing Center; Matt Jones, director of informatics research and development at the National Center for Ecological Analysis and Synthesis at University of California, Santa Barbara; and Jarek Nabrzyski, director of the Center for Research Computing at the University of Notre Dame.

Follow the development of this five-year project at: http://wholetale.org/.

Filed Under: Data Curation, Design and Evaluation of Information Systems and Services, Science Processes