Stodden proposes guide for developing common data science approaches

The use of data science tools in research across campuses has exploded–from engineering and science to the humanities and social sciences. But there is no established data science discipline and no recognized way for various academic fields to develop and integrate accepted data science processes into research.

Associate Professor Victoria Stodden has proposed a framework for guiding researchers and curriculum development in data science and for aiding policy and funding decisions. She outlines the approach in the journal Communications of the ACM.

Stodden has studied issues of reproducibility of research findings for more than a decade. Now, the widespread use of computational tools for research has initiated discussions about transparency, bias, ethics and other topics. These ideas are broader than any particular field, and researchers from different fields need a common framework for how to approach and talk about them, she said.

Stodden said her approach will help define data science as a scientific discipline in its own right; provide a way to have a common conversation across various disciplines; encourage development of and train researchers and scientists on data-driven research methods; help them to agree on the most important issues in the emerging field of data science; and help consumers of computational research to understand how the results were produced.

"I'm hoping it's a way to unify the conversations going on now–to help them evolve and share knowledge in a way to leverage and learn from what other people are doing–and talk about what's going on across different disciplines," Stodden said.

The framework helps identify which issues can be generalized across disciplines and which are specific to a discipline, she said.

Stodden's proposal builds on the concept of the data life cycle used by information scientists to describe the various stages of a dataset. Her data science life cycle looks at not only datasets, but also the tools of computational research such as computer code and software, as well as the research findings.

The data science life cycle would allow researchers to look at the computational research process from data collection to analysis, validation, dissemination and ultimately how the research findings are used in public policy discussions, she said. It would bring into the conversation concepts of transparency, reproducibility of results, how results are interpreted, potential bias and ethics.

reproducibility data flow
An example of the data science life cycle, which describes the stages of data science research. Courtesy of Victoria Stodden

"It's a framework for how to bring all these different topics together and think about what it means to have a field of data science," Stodden said. "With more strategic thinking about what data science means, and what it means to leverage these tools, we will be doing better science."

The data science life cycle recognizes the need for preserving data, software and computational information and making them widely available after results are published, allowing for reproducibility.

Her approach also will help guide the development of a curriculum of data science, she said, providing a way to see where existing courses fit and where new courses may need to be developed.

"For a student seeking to do advanced coursework in data science, it can appear that statistics is not computational enough, computer science isn’t data inference-focused enough, information science is too broad, and the domain sciences don’t provide a broad enough pedagogical agenda in data science," she wrote.

Updated on
Backto the news archive

Related News

New project to help scientists mitigate risks of environmental pollutants

In addition to killing insects and weeds, pesticides can be toxic to the environment and harmful to human health. A new project led by Associate Professor Dong Wang and Huichun Zhang, Frank H. Neff Professor of Civil Engineering at Case Western Reserve University, will help scientists mitigate the environmental and ecological risks of pollutants such as pesticides and develop remediation strategies for cleaner water, soil, and air. The researchers have received a three-year, $402,773 National Science Foundation (NSF) grant for their project, "Machine Learning Modeling for the Reactivity of Organic Contaminants in Engineered and Natural Environments."

Dong Wang

New course focuses on the social history of games and gaming

The iSchool has introduced a new course for undergraduate students who are interested in gaming. Social History of Games & Gaming (IS 199 SHG) is a survey of the history of gaming from the ancient world through the twentieth century and its impact on science, society, and culture. Taught by Teaching Associate Professor David Dubin, the course fulfills a general education requirement for students majoring in information sciences. It is taught in a lecture and discussion format, engaging students with the material and promoting participation.

David Dubin

iSchool researchers discuss misinformation

Several iSchool researchers participated in the recent Misinformation Research Symposium, which was hosted by the Center for Social and Behavioral Science and sponsored by the Center for Advanced Study, Interdisciplinary Health Sciences Institute, and National Center for Supercomputing Applications. The goals of the symposium were to help connect misinformation research on campus, foster interdisciplinary teams interested in collaborating on external submissions, and learn more about the needs of existing and emerging research groups on campus. 

Black and Knox pen chapters for new handbook on information policy

A new book on information policy includes chapters by Professor Emeritus Alistair Black and Associate Professor and Interim Associate Dean for Academic Affairs Emily Knox. Research Handbook on Information Policy, edited by Alistair S. Duff, was recently published by Edward Elgar Publishing. The handbook covers topics such as the history and future of information policy, freedom of information and expression, intellectual property, and information inequality.

research handbook on information policy

Disciplining Data: A conversation with a school of information sciences dean

Eunice Santos, professor and dean of the School of Information Sciences at the University of Illinois Urbana-Champaign, recently sat down with David B. Wilkins, faculty director of the Harvard Law School Center on the Legal Profession, for a conversation about the intersection of information sciences and the law, and how to train students to be effective collaborators and translators between the disciplines.

Eunice Santos