Nikolaus Parulian Dissertation Defense

Doctoral candidate Nikolaus Parulian will defend his dissertation, "A Conceptual Model for Transparent, Reusable, and Collaborative Data Cleaning."

Abstract:
Data cleaning is an essential component of data preparation in machine learning and other data science workflows. It is widely recognized as the most time-consuming and error-prone part when working with real-world data. The data cleaning operations or decisions made while preparing the datasets can significantly impact the reliability and trustworthiness of the results of any subsequent analysis. To ensure data cleaning processes are transparent and can be audited, data cleaning tools need to capture provenance information. However, transparent data cleaning not only requires that provenance (e.g., the operation history and value changes) be recorded but also that those changes are easy to explore, query, and evaluate by the data scientists who want to use the cleaned data.

Existing provenance models, e.g., the W3C PROV standard or model extensions that add prospective provenance (i.e., workflow) elements to it, still fall short of natural requirements for data-cleaning to trace and query changes at different levels of granularity (e.g., operation level, column level, and cell level). To this end, we proposed a new conceptual model that can capture fine-grained retrospective provenance for data cleaning steps. This initial model has been extended by adding prospective provenance information to represent operations or workflows that change the datasets. The resulting hybrid provenance model for data cleaning can be used to answer powerful queries that earlier models cannot answer. A prototypical implementation of the model and corresponding provenance queries have been developed to demonstrate the effectiveness of the overall approach for advanced use cases such as auditing of data cleaning workflows.

We further developed and extended the model to present a conceptual model focusing on reusability and collaboration in data-cleaning, specifically addressing data-cleaning scenarios where multiple users contribute to changes in a dataset. We explored new use cases, particularly collaborative data cleaning, where different data curators can collaborate on a dataset's cleaning process. The conceptual model captures information about dataset changes and represents the various steps and actions taken by different users during the data cleaning process. It enables users to track the changes made by each curator, identify the dependencies between different data-cleaning operations, and facilitate collaboration among the curators.

By leveraging this model on an experimental case study, we have analyzed and showcased the reusability of data-cleaning workflows, how different users contribute to the cleaning process, and evaluated the effectiveness of workflow collaboration in improving data quality.

The committee includes Professor Bertram Ludäscher, Chair & Director of Research; Professor J. Stephen Downie; Associate Professor Jana Diesner; Assistant Professor Nigel Bosch. 

Questions? Contact Nikolaus Parulian

This event is sponsored by Professor Bertram Ludäscher (iSchool)