A Caltech Library Service

Classification of Astrophysics Journal Articles with Machine Learning to Identify Data for NED

Chen, Tracy X. and Ebert, Rick and Mazzarella, Joseph M. and Frayer, Cren and Terek, Scott and Chan, Ben H. P. and Cook, David and Lo, Tak and Schmitz, Marion and Wu, Xiuqin (2022) Classification of Astrophysics Journal Articles with Machine Learning to Identify Data for NED. Publications of the Astronomical Society of the Pacific, 134 (1031). Art. No. 014501. ISSN 0004-6280. doi:10.1088/1538-3873/ac3c36.

[img] PDF - Accepted Version
Creative Commons Attribution Non-commercial Share Alike.


Use this Persistent URL to link to this item:


The NASA/IPAC Extragalactic Database (NED) is a comprehensive online service that combines fundamental multi-wavelength information for known objects beyond the Milky Way and provides value-added, derived quantities and tools to search and access the data. The contents and relationships between measurements in the database are continuously augmented and revised to stay current with astrophysics literature and new sky surveys. The conventional process of distilling and extracting data from the literature involves human experts to review the journal articles and determine if an article is of extragalactic nature, and if so, what types of data it contains. This is both labor intensive and unsustainable, especially given the ever-increasing number of publications each year. We present here a machine learning (ML) approach developed and integrated into the NED production pipeline to help automate the classification of journal article topics and their data content for inclusion into NED. We show that this ML application can successfully reproduce the classifications of a human expert to an accuracy of over 90% in a fraction of the time it takes a human, allowing us to focus human expertise on tasks that are more difficult to automate.

Item Type:Article
Related URLs:
URLURL TypeDescription Paper
Chen, Tracy X.0000-0001-9152-6224
Ebert, Rick0000-0002-9500-8587
Mazzarella, Joseph M.0000-0002-8204-8619
Cook, David0000-0002-6877-7655
Schmitz, Marion0000-0002-2055-7549
Wu, Xiuqin0000-0002-4788-9236
Additional Information:© 2022. The Astronomical Society of the Pacific. Received 2021 July 27; accepted 2021 November 22; published 2022 January 11. We want to acknowledge the past and continuing work of the Natural Language Processing Group at Stanford University, without which this application would have been much more difficult. We are also grateful to the American Astronomical Society Journals and IOP Publishing, the Oxford University Press, and EDP Sciences for their support in providing extensive and on-going access to their publications. We thank the anonymous referee for the critical review and comments which resulted in substantial improvements to the article. This work was funded by the National Aeronautics and Space Administration through a cooperative agreement with the California Institute of Technology. Facilities: ADS - , NED. - Software: Stanford Classifier v3.9.2, The Stanford Natural Language Processing Group,
Group:Infrared Processing and Analysis Center (IPAC)
Funding AgencyGrant Number
Subject Keywords:Astronomy databases – Classification
Issue or Number:1031
Record Number:CaltechAUTHORS:20220119-572400000
Persistent URL:
Official Citation:Tracy X. Chen et al 2022 PASP 134 014501
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:112984
Deposited By: George Porter
Deposited On:20 Jan 2022 16:07
Last Modified:20 Jan 2022 16:07

Repository Staff Only: item control page