Team Illinois Developing Advanced Provenance Tools for DataONE

Bertram Ludäscher
Bertram Ludäscher, Professor and Director, Center for Informatics Research in Science and Scholarship

Provenance information describes the origin and history of artifacts. Because of the vital role played by data and workflow provenance in support of transparency and reproducibility in computational and data science, creating tools for capturing and using provenance information is an important yet challenging task.

Post-doctoral Research Associate Yang Cao and Professor Bertram Ludäscher recently presented joint work on data provenance at the Data Observation Network for Earth (DataONE) All Hands Meeting in Santa Ana Pueblo, New Mexico. In their poster and system demonstration, jointly authored by a team of University of Illinois students and staff as well as collaborators from the UK, Cao and Ludäscher demonstrated how the YesWorkflow tool is "Revealing the Detailed History of Script Outputs with Hybrid Provenance Queries."1

In an earlier article for the Winter 2015/6 issue of DataONE News, "Your Data has a History, too: Towards Transparency and Reproducibility through Provenance,"2 Ludäscher discussed data provenance—how critical it is for transparency, data quality, and computational reproducibility, yet how difficult it is to make use of provenance information unless better tools are available. "Gathering provenance and then linking data, provenance, and software with each other and to publications is a complex and often labor-intensive, manual process. However, as more and more tools become 'provenance-aware' and allow scientists to record and share provenance information, there is hope that provenance management will become much easier and more seamless in the future," he said.

One such tool, YesWorkflow,3,4 is based on a simple annotation language for data analysis scripts. According to Ludäscher, "This language-independent, lightweight annotation approach not only yields an informative workflow model of a script, thus facilitating understanding and reuse of the script, but it can also be used to reconstruct runtime provenance information from script executions and link this information back to the scientist’s conceptual workflow. In this way, provenance can be the subject and driver of powerful queries against the scientist's own data, making provenance not only useful metadata for others, but letting scientists themselves immediately benefit from the provenance information they created."

DataONE is supported by the National Science Foundation and was developed to ensure the preservation, access, and reuse of science data via a federation of member nodes and coordinating nodes, an investigator toolkit, and a broad education and outreach program.

Ludäscher, director of the iSchool's Center for Informatics Research in Science and Scholarship (CIRSS), is a leading figure in data and knowledge management, focusing on the modeling, design, and optimization of scientific workflows, provenance, data integration, and knowledge representation. He joined the iSchool faculty in 2014 and is a faculty affiliate at NCSA and the Department of Computer Science. 

References

1Yang Cao,  Duc Vu, Qiwen Wang, Qian Zhang, Priyaa Thavasimani, Timothy McPhillips, Paolo Missier, Bertram Ludäscher (2016). Revealing the Detailed History of Script Outputs with Hybrid Provenance. Poster and System Demonstration, DataONE All Hands Meeting, September 20-22, Santa Ana Pueblo, New Mexico.
2Bertram Ludäscher (2016). “Your Data has a History, too: Towards Transparency and Reproducibility through Provenance,” DataONE News 4(2), Winter 2015/6.
3YesWorkflow toolkit.
4T. McPhillips, S. Bowers, K. Belhajjame, B. Ludäscher (2015). Retrospective Provenance Without a Runtime Provenance Recorder. 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15).

Tags:
Updated on
Backto the news archive

Related News

Tilley to serve on Lynd Ward Prize jury

Associate Professor Carol Tilley has been selected to serve as a judge for the 2022 Lynd Ward Graphic Novel Prize, which is presented to the best graphic novel, fiction or nonfiction, published in the previous year by a living U.S. or Canadian citizen or resident. The annual award is sponsored by Penn State University Libraries and administered by the Pennsylvania Center for the Book, an affiliate of the Center for the Book at the Library of Congress.

Carol Tilley

iSchool researchers receive funding for napari plugin project

A new project led by Assistant Professor Matthew Turk is among the napari plugin projects that have recently received support from the Chan Zuckerberg Initiative (CZI) in its effort to advance bioimaging technologies. Visiting Research Scientist Christopher Havlin will serve as co-principal investigator on the project, "Enabling Access To Multi-resolution Data."

Matthew Turk

New project focuses on rare categories

Associate Professor Jingrui He has been awarded a three-year, $500,000 grant from the National Science Foundation (NSF) to develop explainable techniques to detect and track rare categories. For her project, "RareXplain: A Computational Framework for Explainable Rare Category Analysis," she will focus on real-world problems where underrepresented, rare (abnormal) examples play critical roles, such as defective silicon wafers resulting from a new semiconductor manufacturing process and rare but severe complications (e.g., kidney failure) among diabetes patients.

Jingrui He

Lueg to join iSchool faculty

The iSchool is pleased to announce that Christopher Lueg will join the faculty as a professor in January 2022. He is currently a professor of medical informatics at the Bern University of Applied Sciences in Biel/Bienne, Switzerland.

Christopher Lueg

Why is a past attempt to ban 'Beloved' from a high school curriculum a political issue now?

Newly elected Virginia Republican Gov. Glenn Youngkin ran a campaign ad featuring a mother who eight years ago tried to ban Beloved, the Pulitzer Prize-winning novel by Toni Morrison, from her son's advanced placement high school English class. Youngkin's use of the ad has generated a discussion about banning books. Emily Knox is a professor and the interim associate dean for academic affairs in the School of Information Sciences, the author of Book Banning in 21st-Century America and editor of Trigger Warnings: History, Theory, Context. She talked with News Bureau arts and humanities editor Jodi Heckel.

Emily Knox