Trustworthy Computational Science Speaker Series: George Alter

George Alter

George Alter, research professor emeritus in the Institute for Social Research at the University of Michigan, will present "SDTL and SDTH: Machine Actionable Descriptions of Data Transformations."

Abstract: Realizing the promise of research transparency and the FAIR principles (Wilkinson et al., 2016) requires provenance metadata, i.e., documentation of the origins, contents, and meaning of data. SDTL (Structured Data Transformation Language) and SDTH (Structured Data Transformation History) provide machine-actionable metadata about scripts used to process and transform statistical data in languages like SPSS, SAS, Stata, R, and Python. Unlike most data provenance models, SDTL and SDTH document individual commands within these scripts, rather than treating scripts as ‘black boxes’ described only by their inputs and outputs. SDTL was created to work with metadata standards, like Data Documentation Initiative (DDI) and Ecological Markup Language (EML), so that descriptions of data transformations can be integrated into data production workflows. Since SDTL is structured in formats like JSON and XML, it can also serve as an intermediate language for translation between other languages. SDTH extends the W3C PROV model to facilitate basic queries about the origins and effects of variables, dataframes, and files in data transformation scripts.

George Alter is research professor emeritus in the Institute for Social Research at the University of Michigan. His research integrates theory and methods from demography, economics, and family history with historical sources to understand demographic behaviors in the past. From 2007 to 2016, Alter was director of the Inter-university Consortium for Political and Social Research, the world's largest archive of social science data. He has been active in international efforts to promote research transparency, data sharing, and secure access to confidential research data, and he has worked on new metadata standards that improve the reusability and interoperability of research data. His current projects aim to automate the capture of metadata from statistical analysis software, compare fertility transitions in contemporary and historical populations, and to create a FAIR vocabulary of terms used in population research.

This series, open to the public, is hosted by the Center for Informatics Research in Science and Scholarship (CIRSS). For the Spring 2024 schedule and access to previous talks, visit the Trustworthy Computational Science website. If you are interested in this speaker series, please subscribe to our speaker series calendar: Google Calendar or Outlook Calendar

Questions? Contact Janet Eke 

This event is sponsored by Center for Informatics Research in Science and Scholarship