Lan Li's Dissertation Defense
PhD candidate Lan Li will present her dissertation defense, “Transparent, Reusable, and Purpose-Driven Data Cleaning.” Li’s dissertation committee includes Professor Bertram Ludäscher (Chair), Professor Allen Renear, Professor Vetle I. Torvik, and Teaching Assistant Professor Craig Willis.
Abstract
High-quality data, defined as data that is fit for use, plays a crucial role in decision-making and supports reliable analysis. By fitness for use, we mean that data is well-structured, contextually relevant, and prepared to support a given data analysis purpose. However, real-world datasets frequently contain data quality problems, including inaccurate values (e.g., typos or outliers), missing entries, inconsistencies in representation or format, and duplicate records. Data cleaning, which involves detecting and correcting such issues during data preparation, significantly impacts the quality of downstream analysis and is therefore a vital part of the data curation process in data science.
Despite its importance, data cleaning remains one of the most time-consuming and error-prone stages of the data pipeline. If existing errors are left unresolved or new errors are introduced during this phase, they can propagate through the pipeline and directly affect the reliability of downstream analysis. A promising strategy to mitigate these risks is to enhance transparency and reusability by documenting data cleaning activities as workflows, which record the sequence of operations applied to a dataset. These workflows provide a structured foundation for understanding and auditing the process, while supporting the generalization and reuse of cleaning operations across datasets.
However, workflows produced by interactive data cleaning tools such as OpenRefine present several limitations. First, the provenance information recorded in these workflows is often incomplete, limiting transparency in understanding how data cleaning operations relate to and impact the dataset. Second, there is no principled framework to assist researchers in determining which components of a data cleaning recipe can be reused and how to apply them effectively.
This thesis proposes models, frameworks, and tools to advance transparency and reusability in human-curated data cleaning workflows, and demonstrates how these insights can inform and support automated data cleaning approaches. The goal is to automatically generate high-quality, purpose-driven workflows using foundational models such as pretrained language models (PLMs) and large language models (LLMs).
To improve transparency, we propose provenance models and tools that capture how operations transform data and enable reuse at the subworkflow level. These models provide insight into both the structural and semantic aspects of data cleaning workflows, forming a foundation for automation. In the context of PLM-based entity resolution, provenance tracking helps identify where and how injected knowledge influences the final matching results. As a result, the decision-making process becomes more transparent and explainable.
Reusability is examined at both the operation and purpose levels. At the operation level, a multilayered framework analyzes transformations across five dimensions: functionality, structure, semantics, schema, and scope. At the purpose level, a conceptual model links analytical goals to workflow structure, supporting a tool that leverages LLMs to automatically generate high-quality, purpose-driven workflows.
Finally, we present an end-to-end automated data cleaning framework that adapts the purpose-driven conceptual model to improve both transparency and reusability. More importantly, it addresses key open challenges in applying LLMs to data cleaning and contributes benchmark datasets to support future research
Questions? Contact Lan Li.