Xiaoliang Jiang's' Dissertation Defense
PhD candidate Xiaoliang Jiang will present his dissertation defense, “Identifying Place Names in Scientific Writing Based on Language Models, Linked Data, and Metadata.” Jiang’s dissertation committee includes Prof. Vetle Torvik (Chair), Prof. Nigel Bosch, Prof. Stephen Downie, and Prof. Meicen Sun.
Abstract
Geographic information is essential for understanding patterns of health, disease, and scientific knowledge, yet its potential has so far been only partially realized. In biomedical literature, existing metadata sources such as author affiliations and MeSH terms provide useful but incomplete coverage, while extracting place names directly from text remains difficult to scale with high accuracy. This dissertation introduces a multi-stage framework to identify and disambiguate geographic named entities in PubMed abstracts, leveraging language embeddings, metadata, and linked data sources including MapAffil, GeoNames, and MeSH. The framework integrates large-scale candidate generation with probabilistic classification and hierarchical disambiguation. The result is a comprehensive dataset encompassing 18.8 million biomedical abstracts and over 25 million candidate geographic mentions, each linked to both GeoNames and MapAffil. Each mention is assigned two complementary probabilities: one estimating whether the term refers to a place based on sentence-level linguistic evidence, and another capturing its salience within context, reflecting consistency with metadata, linked data, and cited information. On a manually curated gold standard, combining these signals yields high performance (precision 93.2%, recall 92.3%), while the overall dataset achieves 99.4% coverage of geographic mentions in PubMed2018. The dataset serves as a benchmark for machine learning, providing training and annotation resources for geographic named entity recognition and related tasks, and further enables downstream applications in information retrieval, public health surveillance, and the science of science. The dataset is available from the Illinois Data Bank.
Questions? Contact Xiaoliang Jiang.