CFG Seminar by Nikolaus Parulian: Collaborative Data Cleaning

Data cleaning and preparation are essential parts of data curation lifecycles and scientific workflow. Exploratory data mining and data cleaning takes up 80% of the scientific research pipeline. However, a data cleaning task can be tedious for a single user, involving lots of exploration and iteration, and prone to error, especially when a curator finds various problems in the dataset.

Nevertheless, single-user data cleaning can also introduce bias where the cleaning quality will only be as good as their knowledge. Therefore, we can define collaboration as assigning a data-cleaning task to multiple data curators to work on the same dataset and purpose. However, multiple users can introduce new problems, such as planning or dividing tasks, data change disagreement, and conflicting process dependency.

Understanding these variations and analyzing the combined workflow is important for data curation to evolve the data cleaning workflow and improve the dataset’s quality. Hopefully, the model and framework we will discuss can help improve the data-cleaning pipeline through collaboration.

Questions? Contact Lan Li or Liri Fang.

This event is sponsored by Conceptual Foundation Groups