Annotation-free prediction of microbial dioxygen utilization
Abstract
Aerobes require dioxygen (O2) to grow; anaerobes do not. However, nearly all microbes—aerobes, anaerobes, and facultative organisms alike—express enzymes whose substrates include O2, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O2 utilization has wide-ranging effects on microbes: aerobes typically have larger genomes encoding distinctive O2-utilizing enzymes, for example. These effects permit high-quality prediction of O2 utilization from annotated genome sequences, with several models displaying ≈80% accuracy on a ternary classification task for which blind guessing is only 33% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genomic sequence content—e.g., triplets of amino acids—perform as well as intensive annotation-based classifiers, enabling rapid processing of genomes. We further show that amino acid trimers are useful because they encode information about protein composition and phylogeny. To showcase the utility of rapid prediction, we estimated the prevalence of aerobes and anaerobes in diverse natural environments cataloged in the Earth Microbiome Project. Focusing on a well-studied O2 gradient in the Black Sea, we found quantitative correspondence between local chemistry (O2:sulfide concentration ratio) and the composition of microbial communities. We, therefore, suggest that statistical methods like ours might be used to estimate, or “sense,” pivotal features of the chemical environment using DNA sequencing data.
Copyright and License
© 2024 Flamholz et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.
Acknowledgement
Supplemental Material
Data Availability
Source code is available at github.com/flamholz/annotation_free_dioxygen_utilization.
A provided script automates retrieval of data from the figshare repository at https://figshare.com/articles/dataset/Annotation-free_prediction_of_microbial_dioxygen_utilization/26065345
Contributions
Avi I. Flamholz and Joshua E. Goldford contributed equally to this article. The author order was decided alphabetically.
Files
Name | Size | Download all |
---|---|---|
md5:8173cac54924b1ed3ee4567573428666
|
1.2 MB | Preview Download |
md5:fe3356d64f1ed27c73380bbb05862bd4
|
1.2 MB | Preview Download |
Additional details
- PMID
- 39230322
- Jane Coffin Childs Memorial Fund for Medical Research
- Gordon and Betty Moore Foundation
- Physics of Living Systems Fellows GBMF4513
- National Aeronautics and Space Administration
- Interdisciplinary Consortia for Astrobiology Research 80NSSC23K1357
- California Institute of Technology
- Schmidt Scholars in Software Engineering -
- Howard Hughes Medical Institute
- Hanna Gray Fellow GT16787
- National Institutes of Health
- UCSD FIRST -
- Resnick Sustainability Institute
- California Institute of Technology
- Caltech Center for Evolutionary Sciences -
- National Science Foundation
- Navigating the New Arctic (NNA) 2127442
- United States Army Research Office
- W911NF-22-2-0210
- National Science Foundation
- PHY-1748958
- Accepted
-
2024-06-18Accepted
- Available
-
2024-09-04Published online
- Caltech groups
- Division of Biology and Biological Engineering
- Publication Status
- In Press