Team Illinois Developing Advanced Provenance Tools for DataONE

Bertram Ludäscher
Bertram Ludäscher, Professor and Director, Center for Informatics Research in Science and Scholarship

Provenance information describes the origin and history of artifacts. Because of the vital role played by data and workflow provenance in support of transparency and reproducibility in computational and data science, creating tools for capturing and using provenance information is an important yet challenging task.

Post-doctoral Research Associate Yang Cao and Professor Bertram Ludäscher recently presented joint work on data provenance at the Data Observation Network for Earth (DataONE) All Hands Meeting in Santa Ana Pueblo, New Mexico. In their poster and system demonstration, jointly authored by a team of University of Illinois students and staff as well as collaborators from the UK, Cao and Ludäscher demonstrated how the YesWorkflow tool is "Revealing the Detailed History of Script Outputs with Hybrid Provenance Queries."1

In an earlier article for the Winter 2015/6 issue of DataONE News, "Your Data has a History, too: Towards Transparency and Reproducibility through Provenance,"2 Ludäscher discussed data provenance—how critical it is for transparency, data quality, and computational reproducibility, yet how difficult it is to make use of provenance information unless better tools are available. "Gathering provenance and then linking data, provenance, and software with each other and to publications is a complex and often labor-intensive, manual process. However, as more and more tools become 'provenance-aware' and allow scientists to record and share provenance information, there is hope that provenance management will become much easier and more seamless in the future," he said.

One such tool, YesWorkflow,3,4 is based on a simple annotation language for data analysis scripts. According to Ludäscher, "This language-independent, lightweight annotation approach not only yields an informative workflow model of a script, thus facilitating understanding and reuse of the script, but it can also be used to reconstruct runtime provenance information from script executions and link this information back to the scientist’s conceptual workflow. In this way, provenance can be the subject and driver of powerful queries against the scientist's own data, making provenance not only useful metadata for others, but letting scientists themselves immediately benefit from the provenance information they created."

DataONE is supported by the National Science Foundation and was developed to ensure the preservation, access, and reuse of science data via a federation of member nodes and coordinating nodes, an investigator toolkit, and a broad education and outreach program.

Ludäscher, director of the iSchool's Center for Informatics Research in Science and Scholarship (CIRSS), is a leading figure in data and knowledge management, focusing on the modeling, design, and optimization of scientific workflows, provenance, data integration, and knowledge representation. He joined the iSchool faculty in 2014 and is a faculty affiliate at NCSA and the Department of Computer Science. 

References

1Yang Cao,  Duc Vu, Qiwen Wang, Qian Zhang, Priyaa Thavasimani, Timothy McPhillips, Paolo Missier, Bertram Ludäscher (2016). Revealing the Detailed History of Script Outputs with Hybrid Provenance. Poster and System Demonstration, DataONE All Hands Meeting, September 20-22, Santa Ana Pueblo, New Mexico.
2Bertram Ludäscher (2016). “Your Data has a History, too: Towards Transparency and Reproducibility through Provenance,” DataONE News 4(2), Winter 2015/6.
3YesWorkflow toolkit.
4T. McPhillips, S. Bowers, K. Belhajjame, B. Ludäscher (2015). Retrospective Provenance Without a Runtime Provenance Recorder. 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP'15).

Tags:
Updated on
Backto the news archive

Related News

iSchool researchers present at iConference 2024

The following iSchool faculty and students participated in the virtual portion of iConference 2024 from April 15-18. The in-person portion of the conference will be held in Changchun, China, from April 22-26. The theme of this year’s conference is "Wisdom, Well-being, Win-win."

Trainor receives the Karen Wold Level the Learning Field Award

Senior Lecturer Kevin Trainor has been selected by the Division of Disability Resources and Educational Services (DRES) to receive the 2024 Karen Wold Level the Learning Field Award. This award honors exemplary members of faculty and staff for advocating and/or implementing instructional strategies, technologies, and disability-related accommodations that afford students with disabilities equal access to academic resources and curricula. 

Kevin Trainor

Seo coauthors chapter on data science and accessibility

Assistant Professor JooYoung Seo and Mine Dogucu, professor of statistics in the Donald Bren School of Information and Computer Sciences at the University of California Irvine, have coauthored a chapter in the new book Teaching Accessible Computing. The goal of the book, which is edited by Alannah Oleson, Amy J. Ko and Richard Ladner, is to help educators feel confident in introducing topics related to disability and accessible computing and integrating accessibility into their courses.

JooYoung Seo

iSchool instructors ranked as excellent

Fifty-five iSchool instructors were named in the University's List of Teachers Ranked as Excellent for Fall 2023. The rankings are released every semester, and results are based on the Instructor and Course Evaluation System (ICES) questionnaire forms maintained by Measurement and Evaluation in the Center for Innovation in Teaching and Learning. 

iSchool Building

ConnectED: Tech for All podcast launched by Community Data Clinic

The Community Data Clinic (CDC), a mixed methods data studies and interdisciplinary community research lab led by Associate Professor Anita Say Chan, has released the first episode of its new podcast, ConnectED: Tech for All. Community partners on the podcast include the Housing Authority of Champaign County, Champaign-Urbana Public Health District, Project Success of Vermilion County, and Cunningham Township Supervisor’s Office.

Community Data Clinic podcast logo