Data Analytics Subscribe to Data Analytics


Institute of Museum and Library Services

This project will create both a master’s and doctoral-level specialization in Socio-technical Data Analytics (SODA). Partnerships with local researchers and businesses who already work with large data-sets will enable master's graduates to receive first-hand experience with both the social and technical implications of large digital data collections, and thus be well-prepared for leadership roles in academic and corporate environments. Similarly, doctoral students will consider multiple stages of the information lifecycle, which will help to ensure that their research findings will generalize to a range of scholarly and business practices. Case studies from these partners will be incorporated into new courses that will initially be held on campus and will later be evolved to the School...

National Center for Supercomputing Applications

Assistant Professor Jana Diesner a received an Faculty Fellowship and seed funding for her project, “Predictive Modeling for Impact Assessment,” from the National Center for Supercomputing Applications (NCSA). Diesner collaborates closely with NCSA scientists on the project, which builds on her work developing computational solutions to assess the impact of issue-focused information projects such as social justice documentaries and books. Her research team leverages big social data for this purpose and combines techniques from machine learning and natural language processing to identify a fine-grained set of impact factors from textual data sources such as news articles, reviews, and social media. This project aims to locate...

Ford Foundation

Films are produced, screened and perceived as part of a larger and continuously changing ecosystem that involves multiple stakeholders and themes. This project will measure the impact of social justice documentaries by capturing, modeling and analyzing the map of these stakeholders and themes in a systematic, scalable and analytically rigorous fashion. This solution will result in a validated, re-useable and end-user friendly methodology and technology that practitioners can use to assess the long-term impact of media productions beyond the number of people who have seen a screening or visited a webpage. Moreover, bringing the proposed computational methodology into a real-world application context can serve as a case-study for demonstrating the usability of this cutting-edge solution...

Social Sciences and Humanities Research Council of Canada

Music prints and manuscripts created over the past thousand years sit on the shelves of libraries and museums around the globe. As these organizations digitize their collections, images of these scores are increasingly accessible online. However, the musical content remains difficult to search.

Google Books and HathiTrust have already made it possible to search the content of text documents through Optical Character Recognition (OCR), which transforms digital images of texts into a symbolic representation that can be searched by computers. For digital images of musical scores, the analogous technology is Optical Music Recognition (OMR).

The research team is working to improve OMR technology so that computers can recognize the musical symbols in these images, enabling us...


INDICATOR is a novel information system for collecting, integrating, and analyzing data from multiple sources to provide public health decision makers real-time data on the health of their community. Data comes from sources as varied as emergency department visits, school attendance, veterinary clinics, and social media postings and together have been used to change public policy in outbreak events.

Funding for this project was provided by the Carle Foundation, Centers for Disease Control and Prevention, and the U.S. Department of Agriculture.

National Science Foundation

Scholarly publications today are still mostly disconnected from the underlying data and code used to produce the published results and findings, despite an increasing recognition of the need to share all aspects of the research process. As data become more open and transportable, a second layer of research output has emerged, linking research publications to the associated data, possibly along with its provenance. This trend is rapidly followed by a new third layer: communicating the process of inquiry itself by sharing a complete computational narrative that links method descriptions with executable code and data, thereby introducing a new era of reproducible science and accelerated knowledge discovery. In the Whole Tale (WT) project, all of these components are linked and accessible...

National Science Foundation

Catastrophic events such as Fukushima and Katrina have made it clear that integrating physical and social causes of failure into a cohesive modeling framework is critical in order to prevent complex technological accidents and to maintain public safety and health. In this research, experts in Probabilistic Risk Assessment (PRA), Organizational Behavior and Information Science and Data Analytics disciplines collaborate to provide answers to the following key questions: what social and organizational factors affect technical system risk; how and why do these factors influence risk; and how much do they contribute to risk? In addition to scientific contributions to organizational science, PRA, and data analytics, this research provides regulatory and industry decision-makers with...

Korea Institute of Science and Technology Information

How do limitations and intransparencies in data quality and data provenance bias research outcomes, and how can we detect and mitigate these limitations? For example, we have been investigating the impact of entity resolution errors on network analysis results. We found that commonly reported network metrics and derived implications can strongly deviate from the truth—as established based on gold standard data or approximations thereof—depending on the efforts dedicated to entity resolution.


How can we use user-generated content to construct, infer or refine network data? We have been tackling this problem by leveraging communication content produced and disseminated in social networks to enhance graph data. For example, we have used domain-adjusted sentiment analysis to label graphs with valence values in order to enable triadic balance assessment. The resulting method enables fast and systematic sign detection, eliminates the need for surveys or manual link labeling, and reduces issues with leveraging user-generated (meta)-data. 

National Science Foundation

The yt project aims to produce an integrated science environment for collaboratively asking and answering astrophysical questions. To do so, it will encompass the creation of initial conditions, the execution of simulations, and the detailed exploration and visualization of the resultant data. It will also provide a standard framework based on physical quantities interoperability between codes.

Development of yt is driven by a commitment to Open Science principles as manifested in participatory development, reproducibility, documented and approachable code, a friendly and helpful community of users and developers, and Free and Libre Open Source Software.

National Science Foundation

This project will develop a mobile sensor technology for performing detection and identification of viral and bacterial pathogens. By means of a smartphone-based detection instrument, the results are shared with a cloud-based data management service that will enable physicians to rapidly visualize the geographical and temporal spread of infectious disease. When deployed by a community of medical users (such as veterinarians or point-of-care clinicians), the PathTracker system will enable rapid determination and reporting of instances of infectious disease that can inform treatment and quarantine responses that are currently not possible with tests performed at central laboratory facilities. 

Immediate uses for the technology are for diagnosis of viral infection in human...

Korea Institute of Science and Technology Information

The project team will work on extracting key concepts from scholarly publications and explore techniques for building a taxonomy of extracted concepts by leveraging open knowledge bases (e.g., Wikipedia). The outcome of this process will be evaluated for various science and technology knowledge platform-based analysis services. The techniques, which reduce semantic ambiguity, will analyze conceptual novelty and expertise of researchers / research institutes across time, leading to a better understanding of the evolution of scientific domains in a scholarly community. The research will also lead to the development of open source tools to allow this research work to be replicated.


Jul. 13, 2018

Members of the Diesner research group will present a paper and posters at the 4th Annual International Conference on Computational Social Science (IC2S2), which will be held July 13-15 at Northwestern University. Assistant Professor Jana Diesner is a program committee co-chair for the conference. IC2S2 brings together academic researchers, industry experts, open data activists, and government agency workers to explore challenges, methods, and research questions in the field of computational social science.

Doctoral student Shubhanshu Mishra will present a poster, “Construction of Hierarchical Subject Headings for Computer Science and Their Application to Studying Temporal Trends in Scholarly Literature,” which he coauthored with Hyejin Lee of the Korea Institute of Science and Technology Information; Jinseok Kim (PhD ’17), research assistant professor at the University...

Jul. 10, 2018

Assistant Professor and PhD Program Director Jana Diesner has been named a 2018-2019 Linowes Fellow by the Cline Center for Advanced Social Research at the University of Illinois. The fellowship "provides exceptionally promising tenure-stream faculty with opportunities for innovation and discovery using the Cline Center's data holdings and/or analytic tools."

Diesner will work on her project, "Using Natural Language Processing to Measure and Understand the Description of Hurricanes Depending on the Gender of Storm Names and Geo-Location of Reporting," which is a collaboration with Sharon Shavitt, professor of business administration and psychology at Illinois; Kiju Jung, senior lecturer at the University of Sydney; Ly Dinh, PhD student at the iSchool; and the Cline Center. Previous work by Jung, Shavitt, Viswanathan, and Hilbe has shown that severe hurricanes assigned female names have caused more deaths than those assigned male names. Diesner's project involves using...

Jul. 9, 2018

Assistant Professor Matthew Turk will present the Whole Tale research project at the 17th annual Scientific Computing with Python conference (SciPy 2018), which will be held July 9-15 in Austin, Texas. The conference brings together participants from industry, academia, and government for tutorials, talks, and developer sprints.

Turk will give the talk, "Sneaking Data into Containers with the Whole Tale," with Kacper Kowalik, a research scientist at the National Center for Supercomputing Applications (NCSA). The goal of the Whole Tale research project is to enable researchers to examine, transform, and then seamlessly republish research data, creating "living articles" that will lead to new discovery by allowing researchers to construct representations and syntheses of data. 

"In this talk, we'll describe how the project leverages existing...

Jul. 6, 2018

Professor and Center for Informatics Research in Science and Scholarship (CIRSS) Director Bertram Ludäscher will be the keynote speaker for the 7th International Provenance and Annotation Workshop (IPAW) during ProvenanceWeek 2018, which will be held July 9-13 at King's College in London. While provenance information has long been recognized as crucial metadata in the information sciences, provenance research has become increasingly important in computer science as well.

During ProvenanceWeek, researchers from computer science and related disciplines will participate in the two main events, i.e., the biennial IPAW and the annual TaPP (Theory and Practice of Provenance) workshop, and in affiliated events that focus on novel directions for provenance.

In his opening keynote at IPAW, "From Workflows to Provenance and Reproducibility: Looking...

Jul. 3, 2018

Doctoral student Shubhanshu Mishra will present his research at the 29th ACM Conference on Hypertext and Social Media, which will be held July 9-12 in Baltimore, Maryland. The conference will focus on the role of links, linking, hypertext, and hyperlink theory on the web and beyond.

Mishra will give the talk, "Detecting the correlation between sentiment and user-level as well as text-level meta-data from benchmark corpora," which he coauthored with Assistant Professor Jana Diesner. Their study examined whether users with similar Twitter characteristics have similar sentiments and what meta-data features of tweets and users correlate with tweet sentiment. 

From the abstract: We address these two questions by analyzing six popular benchmark datasets where tweets are annotated with sentiment labels. We consider user-level as well as tweet-...

Jun. 25, 2018

Assistant Professor and PhD Program Director Jana Diesner is a sessions chair and presenter for the 38th Sunbelt Conference, which will be held on June 26-July 1 in Utrecht, the Netherlands. The conference provides a venue for social scientists, mathematicians, computer scientists, ethnologists, epidemiologists, public health experts, and others to present current work in the area of social networks.

Diesner co-organized the "Words and Networks" sessions, which are dedicated to innovative research at the intersection of text analysis (including discourse analysis, content analysis, text mining, and natural language processing) and network analysis. Previous research has shown that without considering the content of text data for certain types of network analysis—e.g., when studying communication networks and social media...

Jun. 1, 2018

PhD student Yingjun Guan will present his research at the Conference on Statistical Learning and Data Science / Nonparametric Statistics, which will be held June 4-6 at Columbia University. The conference brings together researchers in statistical machine learning and data mining from academia, industry, and government to discuss topics such as big data analytics, classification, learning theory, network analysis, and signal and image processing. The Department of Statistics at the University of Illinois is a cosponsor of the event.

Guan will present his poster, "IMDB Review Mining and Movie Recommendation," in which he examines how movie- and user-related information extracted from the Internet Movie Database (IMDB) can be used to predict movie ratings. 

"Our project plans to benefit both the movie production corporations and the audience," Guan explained. "Data mining and supervised machine...