A Caltech Library Service

Automated generation of gene summaries at the Alliance of Genome Resources

Kishore, Ranjana and Arnaboldi, Valerio and Van Slyke, Ceri E. and Chan, Juancarlos and Nash, Robert S. and Urbano, Jose M. and Dolan, Mary E. and Engel, Stacia R. and Shimoyama, Mary and Sternberg, Paul W. (2020) Automated generation of gene summaries at the Alliance of Genome Resources. Database : The Journal of Biological Databases and Curation, 2020 . Art. No. baaa037. ISSN 1758-0463. PMCID PMC7304461. doi:10.1093/database/baaa037.

[img] PDF - Published Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, we developed a new algorithm that traverses ontology graphs in order to group terms by their common ancestors. The algorithm optimizes the coverage of the initial set of terms and limits the length of the final summary, using measures of information content of each ontology term as a criterion for inclusion in the summary. The automated gene summaries are generated with each Alliance release, ensuring that they reflect current data at the Alliance. Our method effectively leverages category-specific curation efforts of the Alliance member databases to create modular, structured and standardized gene summaries for seven member species of the Alliance. These automatically generated gene summaries make cross-species gene function comparisons tenable and increase discoverability of potential models of human disease. In addition to being displayed on Alliance gene pages, these summaries are also included on several MOK gene pages.

Item Type:Article
Related URLs:
URLURL TypeDescription CentralArticle
Arnaboldi, Valerio0000-0002-2563-5374
Van Slyke, Ceri E.0000-0002-2244-7917
Chan, Juancarlos0000-0002-7259-8107
Dolan, Mary E.0000-0001-7732-3295
Engel, Stacia R.0000-0001-5472-917X
Shimoyama, Mary0000-0003-1176-0796
Sternberg, Paul W.0000-0002-7699-0173
Additional Information:© The Author(s) 2020. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Received 14 February 2020; Revised 6 April 2020; Accepted 29 April 2020. We thank Daniela Raciti, Constance M. Smith, Doug G. Howe and Sian Gramates for their suggestions for generating gene expression summaries. We thank Susan M. Bello, Yvonne M. Bradford and Jennifer Smith for their help in generating disease relevance summaries. We also thank Kimberly Van Auken, Peter D’Eustachio, Yvonne Bradford, Nomi Harris and Chris Mungall for discussions related to the project and for comments and suggestions on the manuscript. Special thanks to all the Alliance developers and technical staff for helping integrate the gene summaries software into the Alliance portal. Funding: National Institutes of Health/National Human Genome Research Institute grant (U24HG002223-19S1, 1U24HG010859); National Institutes of Health/National Human Genome Research Institute grants (P41HG002659 (ZFIN), U24HG002223 (WB), U41HG000739 (FB), U41HG001315 (SGD), HG000330 (MGD), U41HG002273 (GOC)); National Institutes of Health/National Heart, Lung and Blood Institute (HL64541 to R.G.D.); Medical Research Council-UK (MR/L001020/1 (WB)). Funding for open access charge: National Institutes of Health/National Human Genome Research Institute grant (1U24HG010859-01).
Funding AgencyGrant Number
Medical Research Council (UK)MR/L001020/1
PubMed Central ID:PMC7304461
Record Number:CaltechAUTHORS:20200624-104212433
Persistent URL:
Official Citation:Ranjana Kishore, Valerio Arnaboldi, Ceri E Van Slyke, Juancarlos Chan, Robert S Nash, Jose M Urbano, Mary E Dolan, Stacia R Engel, Mary Shimoyama, Paul W Sternberg, the Alliance of Genome Resources, Automated generation of gene summaries at the Alliance of Genome Resources, Database, Volume 2020, 2020, baaa037,
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:104002
Deposited By: George Porter
Deposited On:24 Jun 2020 19:40
Last Modified:16 Nov 2021 18:27

Repository Staff Only: item control page