of 17
Machine learning models
can identify individuals
based on a resident oral
bacteriophage family
Gita
Mahmoudabadi
1
*
, Kelsey
Homyk
2
, Adam B.
Catching
3
,
Ana
Mahmoudabadi
4
, Helen Bermudez
Foley
5
,
Arbel D.
Tadmor
6
and Rob
Phillips
7
1
Department of Bioengineering, Stanford University, Stanford, CA, United States,
2
Genentech Inc., San
Francisco, CA, United States,
3
Biophysics, National Institute of Allergy and Infectious Diseases,
Bethesda, MD, United States,
4
WellStar Kennestone Hospital, Marietta, GA, United States,
5
Department
of Environmental Health, University of Southern California, Los Angeles, CA, United States,
6
Personalized Computational Genomics, Translationale Onkologie an der Universitätsmedizin der
Johannes Gutenberg-Universität Mainz, Mainz, Germany,
7
Departments of Biophysics and Biology,
California Institute of Technology, Pasadena, CA, United States
Metagenomic studies have revolutionized the study of novel phages. However
these studies trade depth of coverage for breadth. We show that the targeted
sequencing of a small region of a phage terminase family can provide suf
fi
cient
sequence diversity to serve as an individual-speci
fi
c barcode or a
phageprint
’’
,
de
fi
ned as the relative abundance pro
fi
le of the variants within a terminase family.
By collecting ~700 oral samples from ~100 individuals living on multiple
continents, we found a consistent trend wherein each individual harbors one
or two dominant variants that coexist with numerous low-abundance variants. By
tracking phageprints over the span of a month across ten individuals, we
observed that phageprints were genera
lly stable, and found instances of
concordant temporal
fl
uctuations of variants sh
ared between partners. To
quantify these patterns further, we built machine learning models that, with
high precision and recall, distinguished individuals even when we eliminated the
most abundant variants and further downsampled phageprints to 2% of the
remaining variants. Except between partners, phageprints are dissimilar between
individuals, and neither country-of-residence, genetics, diet nor cohabitation
seem to play a role in the relatedness of phageprints across individuals. By
sampling from six different oral sites, we were able to study the impact of
millimeters to a few centimeters of separation on an individual
s phageprint
and found that such limited spatial separation results in site-speci
fi
c phageprints.
KEYWORDS
virus, metagenomics, forensics, machi
ne learning, virome, oral microbiome,
phages, terminase
Frontiers in
Microbiomes
frontiersin.org
01
OPEN ACCESS
EDITED BY
Jesu
́
s Muñoz-Rojas,
Meritorious Autonomous
University of Puebla, Mexico
REVIEWED BY
Liliana Lopez Pliego,
Meritorious Autonomous
University of Puebla, Mexico
Alma Rosa Netzahuatl
Alma Rosa Netzahuatl-Muñoz,
University of Tlaxcala, Mexico
*CORRESPONDENCE
Gita Mahmoudabadi
gitam@stanford.edu
RECEIVED
27 March 2024
ACCEPTED
17 July 2024
PUBLISHED
03 September 2024
CITATION
Mahmoudabadi G,
Homyk K,
Catching AB,
Mahmoudabadi A,
Foley HB,
Tadmor AD and
Phillips R (2024) Machine learning models
can identify individuals based on a
resident oral bacteriophage family.
Front. Microbiomes
3:1408203.
doi: 10.3389/frmbi.2024.1408203
COPYRIGHT
© 2024 Mahmoudabadi, Homyk, Catching,
Mahmoudabadi, Foley, Tadmor and Phillips.
This is an open-access article distributed under
the terms of the
Creative Commons Attribution
License (CC BY).
The use, distribution or
reproduction in other forums is permitted,
provided the original author(s) and the
copyright owner(s) are credited and that the
original publication in this journal is cited, in
accordance with accepted academic
practice. No use, distribution or reproduction
is permitted which does not comply with
these terms.
TYPE
Original Research
PUBLISHED
03 September 2024
DOI
10.3389/frmbi.2024.1408203
Introduction
Viruses of bacteria, or phages, are among the most numerous
and diverse biological entities on our planet. They play important
roles as regulators of microbial ecosystems through rapid infection
cycles and gene transfer events (
Roux et al., 2016
;
Touchon et al.,
2017
;
Gregory et al., 2019
). Yet, compared to their bacterial hosts,
and despite their proven potential to transform
fi
elds such as
medicine, agriculture and biotechnology (
Szafran
́
ski et al., 2017
;
Svircev et al., 2018
;
Kortright et al., 2019
;
Sieiro et al., 2020
;
Duan
et al., 2022
), phages remain as some of the least studied members of
the human microbiome (
Shkoporov and Hill, 2019
;
Guerin and
Hill, 2020
). Even across familiar habitats such as the human body,
the identity of phages and their corresponding bacterial hosts, their
population structure, their modes of transfer between habitats, their
co-evolutionary history with bacterial and human hosts, their role
in health and disease, and other important topics remain
relatively unexplored.
We chose to study phages residing in the human mouth as it
represents a multifaceted and medically important ecosystem. Studies
have revealed phages as highly abundant members of the human oral
cavity, with distinct communities at sites of disease, capable of
augmenting the bacterial arsenal of pathogenic genes (
Roberts and
Mullany, 2010
;
Edlund et al., 2015
;
Santiago-Rodriguez et al., 2015
;
Mart
ı
́
nez et al., 2021
;
Matrishin et al., 2023
). These studies have relied
on the shotgun metagenomic approach, in part because one of the
de
fi
ning features of viral genomes is the lack of a universally
conserved sequence analogous to the 16S ribosomal RNA
sequences in bacteria, which is used as a universal marker to draw
conclusions about bacterial evolution and taxonomic classi
fi
cation
(
Woese et al., 1990
;
Yarza et al., 2014
). This marker-based approach is
indispensable to microbial ecology because it allows a high coverage
depth of the 16S region, which in turn, enables precise and
reproducible depictions of bacte
rial community compositions
(
Caporaso et al., 2011
;
Proctor et al., 2018
).
Using current sequencing platforms, the trade-off for coverage
depth is typically the coverage breadth (
Supplementary Figure S1
).
In comparison to the marker-based approach, shotgun
metagenomics provides much greater breadth of coverage and
offers several advantages. However, it suffers from several key
disadvantages. The coverage depth is often heterogeneous and
remains comparatively low in these studies, meaning that the
de
novo
assembly of genomes from complex environments remains a
signi
fi
cant challenge (
Yu et al., 2017
;
Johansen et al., 2022
), even for
abundant members with relatively short genome lengths (
Dutilh
et al., 2014
;
Meyer et al., 2022
). Moreover, the genomes assembled
through shotgun metagenomics are often consensus genomes or an
average representation of similar genomes within an environment
(
Lapidus and Korobeynikov, 2021
).
Due to these technical challenges, the marker-based approach
allows orders of magnitude greater coverage depth by focusing the
reads on a small genomic segment, and thus provides a much higher
resolution view of microbial communities. The targeted approach is
therefore widely used to complement shotgun metagenomic
depictions of bacterial communities (
Costea et al., 2018
;
Rath
et al., 2019
). Because of their high mutation rates and rapid
turnovers, viral genomes are incredibly diverse, and the study of
the sequence diversity within a virus family could be much more
deeply explored through targeted sequencing. Even within a single
species
, viral genomes exist as a collection of related variants,
which are often described as
quasispecies
’’
or as a
mutant
spectrum
. The mutant spectra of RNA viruses is well described
in early and recent studies of RNA phages and RNA viruses,
particularly for lab strains (
Eigen, 1971
;
Weissmann et al., 1973
;
Domingo and Perales, 2019
;
Sun et al., 2021
). DNA phages, on the
other hand, are less studied within this framework, primarily
because they have lower mutation rates compared to RNA phages
(
Domingo et al., 2012
). Even less explored are the mutant spectra of
DNA phages within a dynamic host environment.
As such, the overarching aim of this study was to apply targeted
sequencing to understudied DNA phages in their native context, to
explore their inter-and intra-personal diversity, their spatial
patterns of distribution, as well as temporal dynamics in a large-
scale and high-resolution fashion that allows for observing their
individual variants as well as the collective mutant spectra. Thus, we
fi
rst had to choose regions within phage genomes on which to
perform targeted sequencing. While one could relatively easily
target sequences of well characterized phages, we were motivated
to create a roadmap for mining metagenomic datasets and shedding
light on understudied phages.
Towardsthisgoal,we
fi
rst developed and benchmarked
Metagenomic Clustering by Reference Library or MCRL, which is
an algorithm for the identi
fi
cation of non-redundant gene families
within a metagenome (
Tadmor and Phillips, 2022
). In a previous
study, we then applied MCRL to oral metagenomes of seven
individuals from two studies conducted in two different
continents (
Xie et al., 2010
;
Belda-Ferre et al., 2012
;
Tadmor
et al., 2023
). By focusing the search on the terminase (large
subunit) gene families, we were able to narrow down the search
from thousands of viral gene families to seven non-homologous
terminase families that were shared across individuals in these two
studies (
Tadmor et al., 2023
).
In the absence of a genomic taxonomy for viruses, we have
referred to those phages that encode members of the same
terminase family as members of the same phage family (
Tadmor
et al., 2023
). This notation is predicated on previous studies,
including our own (
Mahmoudabadi and Phillips, 2018
), that have
shown no signi
fi
cant sequence similarity between terminase
sequences of unrelated phages (
Brüssow and Desiere, 2001
;
Wangchuk et al., 2021
) as well as studies that have used the
terminases to build phage phylogenetic trees (
Al-Shayeb et al.,
2020
;
Auslander et al., 2020
). Moreover, we focused our search on
terminases because they are among the most functionally-conserved
genes in double-stranded DNA phage genomes (
Leavitt et al., 2013
;
Lokareddy et al., 2022
). Unlike several other viral genes such as
integrases and lysins, terminases lack bacterial homologs, and thus,
are considered to be unique to phages (
Casjens, 2003
). Additionally,
we have previously successfully used terminases to probe phage-
bacteria interactions within a complex host environment, namely
the termite gut (
Tadmor et al., 2011
).
Mahmoudabadi et al.
10.3389/frmbi.2024.1408203
Frontiers in
Microbiomes
frontiersin.org
02
To test whether we were successful in identifying terminase
families that were prevalent enough in the human phageome to be
practical experimental targe
ts, we searched for them across
hundreds of metagenomic samples from the Human Microbiome
Project (HMP) (
Human Microbiome Project Consortium, 2012a
)
spanning ~100 individuals and 18 body sites (
Tadmor et al., 2023
).
Remarkably, we showed that despite the individual-speci
fi
c nature
of the human virome and the small number of individuals from
which these terminase families were originally identi
fi
ed, they are
prevalent across the HMP cohort. In this study we chose to focus on
HB1 and HA terminase families as they were the two most prevalent
families, detected in most individuals within the HMP cohort
(
Tadmor et al., 2023
). In the following paragraphs we summarize
some of our earlier
fi
ndings, particularly those pertinent to HA and
HB1 terminase families.
To identify the putative habitats of the phages encoding these
terminase families, we searched through ~4000 environmental
metagenomes from the IMG/VR (
Paez-Espino et al., 2017
)and
IMG/M (
Chen et al., 2019
) databases comprising numerous distinct
habitats, in addition to ~100 environmental metagenomes from the
VIROME database (
Wommack et al., 2012
). Most terminase families
were found to be largely human-associated, and instances where
remote homologs were found in environmental phages, the human-
derived phage sequences were phylo
genetically distinguishable from
their environmental counterparts. Additionally, by examining various
body sites, we showed that most ter
minase families were primarily
localized to the human oral cavity. The HB1 terminase family was
found as an exception given that it is detected also in the human gut,
though we showed that the oral and the gut-derived HB1 terminase
family members were phylogenetically distinct.
Through experiments where we separated the bacterial and viral
fractions of oral samples, we were able to demonstrate that the HA
phage family is likely lysogenic and infects various species of the
Steptococcus
genus, whereas the oral HB1 phage family is likely lytic,
and its host species remains to be di
scovered. Moreover, we show the
positions of the closest HA and HB1 terminase homologs in previously
sequenced full phage genomes (
Supplementary Figure S2
).
Additionally, through selection p
ressure analysis and alignment of
functional motifs, we showed that HA- and HB1-encoding phages are
likely functionally active members of the human oral virome. Finally,
we designed primers to target these phage families using their
respective terminase families within oral samples from nine
individuals and showed that we could indeed reliably capture them
experimentally. The
primers for HA and HB1 are provided again in
this study (
Supplementary Table S1
).
In this study, we target the HA and HB1 terminase families to
obtain at least several thousand sequences per terminase family, per
oral sample, and thereby increase the resolution or the coverage
depth by several orders of magnitude from our previous study. By
creating instructional videos and collection kits, we enabled citizen
scientists to gather ~700 samp
les spanning ~100 individuals
residing in different parts of the world (
Figure 1
). We will
demonstrate that at high resolution, the mutant spectrum derived
from members of just a single phage terminase family can already
serve as a
fi
ngerprint, or a
phageprint
”–
highly unique to an
individual. Phageprints were not observable through our earlier
study of metagenomic datasets (
Tadmor et al., 2023
), and
demonstrate the power of combining metagenomic mining with
targeted sequencing to put a spotlight on uncharacterized phage
families and their sequence diversity in their native contexts.
By examining phage terminase families at 6 different oral sites,
and by comparing phageprints of individuals living across the globe,
we were able to study the effect of spatial separation, ranging from
several millimeters to thousands of kilometers. We found that the
spatial separation of just a few centimeters - the distance between an
individual
s gingival sites and the hard palate, for example - already
results in highly distinct phageprints for the HA phage family. In
contrast, HB1 phageprints from different oral sites within an
individual were highly simila
r. Additionally, we found that
neither genetics nor cohabitation seem to play a role in the
relatedness of phageprints across individuals.
Furthermore, by daily sampling of phageprints from the tongue
dorsum over the course of a month across ten individuals we
continued to see individual-speci
fi
cphageprintswithmany
variants that persisted over time. We also identi
fi
ed variants that
were
fl
uctuating concordantly in partners. Through various
diversity metrics we quanti
fi
ed the inter-and intra-personal
distances between phageprints as a function of space and time.
We used machine learning models to further quantify the
identi
fi
ability of an individual
sphageprintandshowed
remarkably high model performances on unseen data. These
models had very high performances even as the most abundant
variants were removed and even when 98% of the remaining
variants were randomly removed.
Results
Humans harbor diverse, personal
phageprints that are persistent in time
From a methodological standpoint, targeted sequencing of
teminase families is very similar to 16S sequencing (
Caporaso et al.,
2011
;
Human Microbiome Project Consortium, 2012b
). Using
barcoded primers, we employed PCR and next generation
sequencing to attain millions of paired-end reads for each
terminase family (
Figure 1
). We took stringent measures against
contaminants by 1) conducting our DNA extraction, PCR and post-
PCR experiments in separate physical spaces, and 2) running
fi
ve no
template control reactions for every PCR run, as well as three no-
sample DNA extraction reactions for every DNA extraction run to
ensure there are no contaminants in the DNA extraction kits. Upon
sequencing and performing several quality control
fi
lters, the reads
were demultiplexed based on their barcoded primer sequence. Using
error-correcting DNA barcodes, we were able to detect errors and
removed sequences if they contained errors in their barcode.
Furthermore, we eliminated nearly all sequencing errors by using
paired-end reads which covered the full length of both terminase
families (300 bp) and allowed only paired sequences with 100%
match across the entire sequence (see Materials and Methods).
All reads derived from the same terminase family were then
pooled and clustered based on their DNA sequence similarity into
Mahmoudabadi et al.
10.3389/frmbi.2024.1408203
Frontiers in
Microbiomes
frontiersin.org
03
Operational Taxonomic Units (OTUs), or what we will
interchangeably refer to as variants. An OTU table is constructed
wherein the number of reads belonging to each OTU (columns)
within each sample (rows) is denoted. Using the OTU table, we can
plot the relative abundances of each OTU within a sample. As a
shorthand, we refer to this plot as a phageprint.
With bacterial 16S data, sequences are generally clustered at
97% sequence similarity into OTUs. At this threshold, each OTU is
conventionally referred to as a bacterial species. In the absence of
convention for handling viral targeted sequencing data, we have
used here various sequence similarity thresholds for clustering
including 100% sequence similarity, thereby allowing only
identical sequences in each cluster. We found the results to be
largely robust to variations in the sequence similarity threshold (see
Materials and Methods: Examining the effect of OTU sequence
similarity threshold,
Supplementary Figure S3
).
As an example, we show the HA phageprint from a subject
s
tongue dorsum (top surface) at two time points (
Figure 2A
). As
shown in this
fi
gure, and across all other phageprints we have
constructed for both terminas
e families, each phageprint is
dominated by a small number of variants or OTUs (typically one
or two). In addition to these OTUs, there are many OTUs with
abundance values that are low but reproducible, and some that are
fairly persistent in time within each subject. Generally, the
dominant OTUs are not the same across different subjects.
Before probing a larger number of individuals, we aimed to
quantify our pipeline
s detection and reproducibility thresholds to
understand what levels of OTU temporal
fl
uctuation is biological
FIGURE 1
A schematic summary of the main experimental and bioinformatic methods: 1) Discovery of ubiquitous phage families by examining large terminase
sequences that occur across different metagenomic datasets described in our earlier work (
Tadmor et al., 2023
), 2) experimental sampling of several
cohorts for temporal and spatial analysis of phageprints in related in unrelated individuals, 3) DNA extraction from oral bio
fi
lm samples, 4) PCR using
barcoded primers followed by PCR clean-up and paired-end sequencing, 5) joining paired-end reads to eliminate sequencing errors, 6) additional
quality control steps to further eliminate errors based on Phred scores and error-correcting barcodes, 7) demultiplexing of reads based on their
barcode sequence and linking sequences to the sample they originate from, 8) gathering reads from all samples and clustering them based on
sequence similarity into Operational Taxonomic Units (OTUs), 9) counting the number of sequences belonging to each OTU from each sample (i.e.
constructing an OTU table), and rarefying the table so that each sample is represented by the same total number of sequences, and denoising the
OTU table to eliminate OTUs with relative abundances below an experimentally determined reproducibility threshold, 10) visualizing phageprints
which are the relative abundance pro
fi
les of OTUs (1 through N) in a given sample, 11) performing various downstream diversity analysis using the
constructed OTU table as the basis, 12) creating machine learning models based on full and downsampled OTU tables. These model types include
Logistic Regression (LR), Multi-Layer Perceptron (MLP), K-nearest Neighbor (KNN) and Gradient Boosting Classi
fi
er (GBC). Note that these steps are
performed separately for HA and HB1 sequences.
Mahmoudabadi et al.
10.3389/frmbi.2024.1408203
Frontiers in
Microbiomes
frontiersin.org
04
versus technical. To that end, we obtained 3 different samples from
a subject
s tongue dorsum. We then performed DNA extraction and
PCR separately on each sample and sequenced these samples. The
logic behind this experiment was to capture a lumped measure of
noise arising from various experimental processes depicted in
Supplementary Figure S4
. We show that the relative abundance of
the variants making up each phageprint across these three samples
are highly reproducible, and the maximum standard deviation for
OTU relative abundances was less than 0.007, with the majority less
than 0.002 and close to 0. Moreover, we
fl
agged OTUs that had
appeared in only one or two samples out of three. As expected, we
observed that the number of reproducible OTUs increases as a
A
B
D
C
FIGURE 2
The temporal dynamics of an individual
s phageprint over the course of a month (on average 25 daily samples were collected during this period).
(A, B)
HA phageprints from subject 37 at two different time points,
(A)
0th time point, right after brushing tongue dorsal and teeth surfaces and
(B)
24 hours after the initial time point (no brushing in between time points). Each phageprint is derived from the analysis of 4000 sequences. OTUs
are de
fi
ned at 98% sequence similarity.
(C)
HB1 phageprint temporal dynamics on subject 1
s tongue dorsum. The x-axis contains OTUs ordered
according to the depicted phylogenetic tree of the OTU sequences (the phylogenetic tree is provided largely to serve as a schematic). Each OTU is
composed of identical sequences (i.e. 100% sequence similarity threshold). The y-axis depicts the relative abundance of each OTU, and the z-axis
shows the
fl
uctuations in relative abundance of each OTU in time.
(D)
Depictions of HB1 phageprint temporal dynamics in different subjects. The
format of these plots is the same as that panel
(C)
, and the order of OTUs is based on their phylogenetic distance and identical across all plots. All
samples are collected from the tongue dorsum. Note that subject 2 and 4 are partners, and their phageprints share some main features.
Mahmoudabadi et al.
10.3389/frmbi.2024.1408203
Frontiers in
Microbiomes
frontiersin.org
05
function of the relative abundance threshold, and all OTUs with
greater than 0.001 relative abundance were reproducible across all
three samples (
Supplementary Figure S5
). Thus, we arrived at 0.001
relative abundance as the reproducibility threshold for OTUs, and
denoised OTU tables by eliminating OTUs that did not meet this
threshold across any of the samples. We have performed similar
benchmarking studies on a larger number of subjects and included
separate sequencing runs to account for any variation that may be
introduced by a sequencing run (
Supplementary Figure S6
). In
short, through stringent quality control
fi
lters and benchmarking of
our experimental and bioinformatic work
fl
ow, we showed that
phageprints are highly reproducible (see Materials and Methods).
To further explore the temporal dynamics of these phageprints,
ten subjects collected bio
fi
lm from the tongue dorsum every 24
hours for a month though on average subjects returned samples
from 25 days as they missed to sample some days. The HB1
phageprint temporal dynamics on a subject
s tongue dorsum is
depicted in
Figure 2
. Here, to provide a more detailed view, we
cluster the HB1 sequences into OTUs based on 100% sequence
similarity, or in other words, we are depicting the relative
abundance of individual sequences.
Given the dynamic nature of an ecosystem like the human
mouth, it is counter-intuitive that over a month, the main features
of each phageprint is preserved in all subjects. However, as we will
investigate further, there are
fl
uctuations that are biological rather
than technical. A global trend is that the dominant OTUs typically
remain dominant throughout the sampling period in all subjects
(
Figure 2
). This observation is especially interesting in light of the
wide range of diets and oral hygiene practices across subjects
(
Supplementary Figure S7
).
To make quantitative pairwise comparisons between
phageprints we employed several commonly used metrics such as
Bray-Curtis and Unifrac, and in doing so, we distill the comparison
of thousands of sequences from any two samples to a single score.
All distance metrics paint similar pictures of the HB1 terminase
family, depicting it as highly individual-speci
fi
c and persistent in
time (
Supplementary Figure S8
;
Figure 3
). Because phageprints in
different individuals have such distinct compositions, abundance-
based metrics are especially suitable for describing them. However,
even the binary Jaccard distance metric which does not consider
variant abundances point to a similar conclusion. As is expected
from the heat maps shown in
Supplementary Figure S8
, the intra-
personal distances are markedly lower than the inter-personal, with
the notable exception being subjects 2 and 4, who are
partners (
Figure 3
).
Machine learning models detect with high
precision and recall an individual
s
phageprint even when phageprints are
heavily downsampled
In addition to these distance metrics, we were motivated to
build machine learning models whose performance could further
quantify the predictability of an individual
s phageprint within the
temporal cohort. We
fi
rst built several types of machine learning
models, including Logistic Regression (LR), K-Nearest Neighbor
(KNN), Gradient Boosting Classi
fi
er (GBC), and Multi-Layer
Perceptron (MLP), each of which perform a binary classi
fi
cation
of an individual
s phageprint from the rest (i.e. one-versus-rest
models). The input to these models was the OTU table, where the
rows are samples (i.e. day 1 to 30 for each subject) and the
columns are the OTUs. Across the temporal cohort consisting
of ten individuals, ~7300 HB1 OTU
s were collectively detected.
This table was split for training (70%) and testing (30%) such that
models would be trained on 70% of the time points from each
individual. To quantify the performance of the models, we
performed ten iterations of random train/test splits and report
the median and the 95% con
fi
dence intervals for the Area Under
the Precision-Recall curve (AUPR) and the Area under the
Receiver-Operator Curve (AUROC).
All model types performed remarkably well with very high
performances for both the Logistic Regression and the Multi-Layer
Perceptron model types (
Figure 4
;
Supplementary Table S2
). We
performed the same exercise on an OTU table built from HA
terminase family OTUs, and arrived at similarly high model
performances (
Supplementary Figures S9
,
10
;
Supplementary Tables
S4
,
S5
). It is important to note that we excluded subject 4 from this
particular analysis because we wanted to measure the model
s
performance for unrelated individuals, as partners
coevolving
phageprints would be a confounding factor. We also provide models
built that include both partners and
demonstrate that they have high
performances even when highly simi
lar phageprints are included in the
dataset (
Supplementary Figure S11
). For example, using the GBC
model type, the lowest AUPR and AUROC median values obtained
across subjects were 0.98
and 0.92, respectively.
Given that phageprints are dominated by one or two OTUs, it
is reasonable to assume that the exclusion of these dominant
OTUs would dissolve the individual-speci
fi
c and time-persistent
nature of phageprints. To formally test this assumption, we
removed the top ten most abun
dant OTUs of each sample from
the entire dataset. A total of ~600 OTUs were removed from the
dataset, removing on average two thirds of the reads from each
sample. Upon removing these OT
Us, we rescaled the OTU table
such that the relative abundan
ce of the remaining OTUs would
again add up to 1. To our surprise, the exclusion of the top most
abundant OTUs still resulted in nearly perfect classi
fi
cation
(
Supplementary Tables S6
,
S7
). We further randomly
downsampled to 2% of the total remaining OTUs, resulting in
just 226 OTUs, and rescaled the resulting OTU table as previously
described. The performance of the models still remained nearly as
high as before (
Supplementary Tables S8
,
S9
).
The reason for the repeated observation of phageprints even
when drastically subsampled, is due to the fact that many low-
abundance OTUs have individual-speci
fi
c patterns of occurrence.
By hierarchical clustering of this small subset of the original OTU
table (
Supplementary Figure S12
), most samples from the same
individual cluster together, and thus, machine learning models can
easily pick out an individual
s phageprint from others even using a
small fraction of the total data for each subject.
Mahmoudabadi et al.
10.3389/frmbi.2024.1408203
Frontiers in
Microbiomes
frontiersin.org
06
Less than 1% of OTUs are shared across
all subjects
We measured the sharing of OTUs across subjects by
collapsing the OTU table into a table of subjects by OTUs
rather than samples by OTUs, such that if an OTU was
identi
fi
ed at any point within the sampling period (~30 days), it
is given a value of 1, and 0 otherwise. With this binary table, we
created an UpSet plot where the number of OTUs unique to each
subject as well as the number of OTUs shared between different
sets of subjects is shown (
Supplementary Figure S13
).
Less than 1% (~0.8%) of all OTUs were detected across all
subjects. The relative abundance of these generalist OTUs per
subject is hierarchically clustered and shown in
Supplementary
Figure S14
. Again, we see that partners cluster most closely together
even based on this small subset of OTUs. Finally, a much higher
percentage of total OTUs, about 85%, are detected in at least two
subjects, and the rest are only detected in one individual. Based on
these results, we can conclude that while the same variants may
appear in different subjects, the individual speci
fi
city of phageprints
emerge in large part because the relative abundances of variants is
often individual-speci
fi
c.
FIGURE 3
HB1 phageprint temporal dynamics quanti
fi
ed using pairwise distance metrics and visualized using
(A)
heatmaps and
(B)
box-and-whisker plots. The
pairwise distance metrics include: Pearson distance (1- Pearson correlation), Binary Jaccard, Abundance Jaccard, Bray Curtis and unweighted
Unifrac. Top: The heatmap scale applies to all heatmaps shown. Subjects 02 and 04 are partners. Samples from each subject are chronologically
ordered. Bottom: Intra-and inter-personal distances between HB1 phageprints in 10 subjects, over the span of a month. The outliers de
fi
ned as
those outside of the 1.5 x IQR (inter-quartile range) are denoted by
+
. The box-plots corresponding to the comparisons between the couple in this
study are highlighted.
Mahmoudabadi et al.
10.3389/frmbi.2024.1408203
Frontiers in
Microbiomes
frontiersin.org
07