PhD candidate Xiaoliang Jiang successfully defended his dissertation, "Identifying Place Names in Scientific Writing Based on Language Models, Linked Data, and Metadata," on November 3.
Jiang's dissertation committee included Associate Professor Vetle Torvik (chair), Associate Professor Nigel Bosch, Professor J. Stephen Downie, and Assistant Professor Meicen Sun.
Abstract: Geographic information is crucial for understanding health, disease, and scientific activity, yet its potential has so far been only partially realized. Existing metadata, such as affiliations and MeSH terms, provide useful but incomplete coverage, and extracting place names directly from text remains difficult to scale with high accuracy. This dissertation presents a multi-stage framework for identifying and disambiguating geographic named entities in PubMed abstracts by integrating language embeddings, metadata, and linked external sources including cited metadata, MapAffil, and GeoNames. The system performs large-scale candidate generation, probabilistic classification, and hierarchical disambiguation to produce a dataset covering 18.8 million abstracts and over 25 million candidate mentions, each linked to MapAffil and GeoNames. Each mention receives three probabilities capturing linguistic evidence, metadata-based salience, and a final combined score. On a manually curated gold standard, combining them yields strong performance (precision 93.2%, recall 92.3%). The resulting dataset provides a benchmark for geographic NER and supports downstream applications in information retrieval, public health, and science-of-science research.