Machine learning models can identify individuals based on a resident oral bacteriophage family

Machine learning models

can identify individuals

based on a resident oral

bacteriophage family

Gita

Mahmoudabadi

, Kelsey

Homyk

, Adam B.

Catching

Ana

Mahmoudabadi

, Helen Bermudez

Foley

Arbel D.

Tadmor

and Rob

Phillips

Department of Bioengineering, Stanford University, Stanford, CA, United States,

Genentech Inc., San

Francisco, CA, United States,

Biophysics, National Institute of Allergy and Infectious Diseases,

Bethesda, MD, United States,

WellStar Kennestone Hospital, Marietta, GA, United States,

Department

of Environmental Health, University of Southern California, Los Angeles, CA, United States,

Personalized Computational Genomics, Translationale Onkologie an der Universitätsmedizin der

Johannes Gutenberg-Universität Mainz, Mainz, Germany,

Departments of Biophysics and Biology,

California Institute of Technology, Pasadena, CA, United States

Metagenomic studies have revolutionized the study of novel phages. However

these studies trade depth of coverage for breadth. We show that the targeted

sequencing of a small region of a phage terminase family can provide suf

cient

sequence diversity to serve as an individual-speci

c barcode or a

“

phageprint

’’

ned as the relative abundance pro

le of the variants within a terminase family.

By collecting ~700 oral samples from ~100 individuals living on multiple

continents, we found a consistent trend wherein each individual harbors one

or two dominant variants that coexist with numerous low-abundance variants. By

tracking phageprints over the span of a month across ten individuals, we

observed that phageprints were genera

lly stable, and found instances of

concordant temporal

uctuations of variants sh

ared between partners. To

quantify these patterns further, we built machine learning models that, with

high precision and recall, distinguished individuals even when we eliminated the

most abundant variants and further downsampled phageprints to 2% of the

remaining variants. Except between partners, phageprints are dissimilar between

individuals, and neither country-of-residence, genetics, diet nor cohabitation

seem to play a role in the relatedness of phageprints across individuals. By

sampling from six different oral sites, we were able to study the impact of

millimeters to a few centimeters of separation on an individual

’

s phageprint

and found that such limited spatial separation results in site-speci

c phageprints.

KEYWORDS

virus, metagenomics, forensics, machi

ne learning, virome, oral microbiome,

phages, terminase

Frontiers in

Microbiomes

frontiersin.org

OPEN ACCESS

EDITED BY

Jesu

́

s Muñoz-Rojas,

Meritorious Autonomous

University of Puebla, Mexico

REVIEWED BY

Liliana Lopez Pliego,

Meritorious Autonomous

University of Puebla, Mexico

Alma Rosa Netzahuatl

Alma Rosa Netzahuatl-Muñoz,

University of Tlaxcala, Mexico

*CORRESPONDENCE

Gita Mahmoudabadi

gitam@stanford.edu

RECEIVED

27 March 2024

ACCEPTED

17 July 2024

PUBLISHED

03 September 2024

CITATION

Mahmoudabadi G,

Homyk K,

Catching AB,

Mahmoudabadi A,

Foley HB,

Tadmor AD and

Phillips R (2024) Machine learning models

can identify individuals based on a

resident oral bacteriophage family.

Front. Microbiomes

3:1408203.

doi: 10.3389/frmbi.2024.1408203

Mahmoudabadi, Foley, Tadmor and Phillips.

This is an open-access article distributed under

the terms of the

Creative Commons Attribution

License (CC BY).

The use, distribution or

reproduction in other forums is permitted,

provided the original author(s) and the

original publication in this journal is cited, in

accordance with accepted academic

practice. No use, distribution or reproduction

is permitted which does not comply with

these terms.

TYPE

Original Research

PUBLISHED

03 September 2024

DOI

10.3389/frmbi.2024.1408203

Introduction

Viruses of bacteria, or phages, are among the most numerous

and diverse biological entities on our planet. They play important

roles as regulators of microbial ecosystems through rapid infection

cycles and gene transfer events (

Roux et al., 2016

;

Touchon et al.,

2017

;

Gregory et al., 2019

). Yet, compared to their bacterial hosts,

and despite their proven potential to transform

fi

elds such as

medicine, agriculture and biotechnology (

Szafran

́

ski et al., 2017

;

Svircev et al., 2018

;

Kortright et al., 2019

;

Sieiro et al., 2020

;

Duan

et al., 2022

), phages remain as some of the least studied members of

the human microbiome (

Shkoporov and Hill, 2019

;

Guerin and

Hill, 2020

). Even across familiar habitats such as the human body,

the identity of phages and their corresponding bacterial hosts, their

population structure, their modes of transfer between habitats, their

co-evolutionary history with bacterial and human hosts, their role

in health and disease, and other important topics remain

relatively unexplored.

We chose to study phages residing in the human mouth as it

represents a multifaceted and medically important ecosystem. Studies

have revealed phages as highly abundant members of the human oral

cavity, with distinct communities at sites of disease, capable of

augmenting the bacterial arsenal of pathogenic genes (

Roberts and

Mullany, 2010

;

Edlund et al., 2015

;

Santiago-Rodriguez et al., 2015

;

Mart

ı

́

nez et al., 2021

;

Matrishin et al., 2023

). These studies have relied

on the shotgun metagenomic approach, in part because one of the

fi

ning features of viral genomes is the lack of a universally

conserved sequence analogous to the 16S ribosomal RNA

sequences in bacteria, which is used as a universal marker to draw

conclusions about bacterial evolution and taxonomic classi

fi

cation

(

Woese et al., 1990

;

Yarza et al., 2014

). This marker-based approach is

indispensable to microbial ecology because it allows a high coverage

depth of the 16S region, which in turn, enables precise and

reproducible depictions of bacte

rial community compositions

(

Caporaso et al., 2011

;

Proctor et al., 2018

Using current sequencing platforms, the trade-off for coverage

depth is typically the coverage breadth (

Supplementary Figure S1

In comparison to the marker-based approach, shotgun

metagenomics provides much greater breadth of coverage and

offers several advantages. However, it suffers from several key

disadvantages. The coverage depth is often heterogeneous and

remains comparatively low in these studies, meaning that the

novo

assembly of genomes from complex environments remains a

signi

fi

cant challenge (

Yu et al., 2017

;

Johansen et al., 2022

), even for

abundant members with relatively short genome lengths (

Dutilh

et al., 2014

;

Meyer et al., 2022

). Moreover, the genomes assembled

through shotgun metagenomics are often consensus genomes or an

average representation of similar genomes within an environment

(

Lapidus and Korobeynikov, 2021

Due to these technical challenges, the marker-based approach

allows orders of magnitude greater coverage depth by focusing the

reads on a small genomic segment, and thus provides a much higher

resolution view of microbial communities. The targeted approach is

therefore widely used to complement shotgun metagenomic

depictions of bacterial communities (

Costea et al., 2018

;

Rath

et al., 2019

). Because of their high mutation rates and rapid

turnovers, viral genomes are incredibly diverse, and the study of

the sequence diversity within a virus family could be much more

deeply explored through targeted sequencing. Even within a single

“

species

”

, viral genomes exist as a collection of related variants,

which are often described as

“

quasispecies

’’

or as a

“

mutant

spectrum

”

. The mutant spectra of RNA viruses is well described

in early and recent studies of RNA phages and RNA viruses,

particularly for lab strains (

Eigen, 1971

;

Weissmann et al., 1973

;

Domingo and Perales, 2019

;

Sun et al., 2021

). DNA phages, on the

other hand, are less studied within this framework, primarily

because they have lower mutation rates compared to RNA phages

(

Domingo et al., 2012

). Even less explored are the mutant spectra of

DNA phages within a dynamic host environment.

As such, the overarching aim of this study was to apply targeted

sequencing to understudied DNA phages in their native context, to

explore their inter-and intra-personal diversity, their spatial

patterns of distribution, as well as temporal dynamics in a large-

scale and high-resolution fashion that allows for observing their

individual variants as well as the collective mutant spectra. Thus, we

fi

rst had to choose regions within phage genomes on which to

perform targeted sequencing. While one could relatively easily

target sequences of well characterized phages, we were motivated

to create a roadmap for mining metagenomic datasets and shedding

light on understudied phages.

Towardsthisgoal,we

fi

rst developed and benchmarked

Metagenomic Clustering by Reference Library or MCRL, which is

an algorithm for the identi

fi

cation of non-redundant gene families

within a metagenome (

Tadmor and Phillips, 2022

). In a previous

study, we then applied MCRL to oral metagenomes of seven

individuals from two studies conducted in two different

continents (

Xie et al., 2010

;

Belda-Ferre et al., 2012

;

Tadmor

et al., 2023

). By focusing the search on the terminase (large

subunit) gene families, we were able to narrow down the search

from thousands of viral gene families to seven non-homologous

terminase families that were shared across individuals in these two

studies (

Tadmor et al., 2023

In the absence of a genomic taxonomy for viruses, we have

referred to those phages that encode members of the same

terminase family as members of the same phage family (

Tadmor

et al., 2023

). This notation is predicated on previous studies,

including our own (

Mahmoudabadi and Phillips, 2018

), that have

shown no signi

fi

cant sequence similarity between terminase

sequences of unrelated phages (

Brüssow and Desiere, 2001

;

Wangchuk et al., 2021

) as well as studies that have used the

terminases to build phage phylogenetic trees (

Al-Shayeb et al.,

2020

;

Auslander et al., 2020

). Moreover, we focused our search on

terminases because they are among the most functionally-conserved

genes in double-stranded DNA phage genomes (

Leavitt et al., 2013

;

Lokareddy et al., 2022

). Unlike several other viral genes such as

integrases and lysins, terminases lack bacterial homologs, and thus,

are considered to be unique to phages (

Casjens, 2003

). Additionally,

we have previously successfully used terminases to probe phage-

bacteria interactions within a complex host environment, namely

the termite gut (

Tadmor et al., 2011

Mahmoudabadi et al.

10.3389/frmbi.2024.1408203

Frontiers in

Microbiomes

frontiersin.org

To test whether we were successful in identifying terminase

families that were prevalent enough in the human phageome to be

practical experimental targe

ts, we searched for them across

hundreds of metagenomic samples from the Human Microbiome

Project (HMP) (

Human Microbiome Project Consortium, 2012a

)

spanning ~100 individuals and 18 body sites (

Tadmor et al., 2023

Remarkably, we showed that despite the individual-speci

fi

c nature

of the human virome and the small number of individuals from

which these terminase families were originally identi

fi

ed, they are

prevalent across the HMP cohort. In this study we chose to focus on

HB1 and HA terminase families as they were the two most prevalent

families, detected in most individuals within the HMP cohort

(

Tadmor et al., 2023

). In the following paragraphs we summarize

some of our earlier

fi

ndings, particularly those pertinent to HA and

HB1 terminase families.

To identify the putative habitats of the phages encoding these

terminase families, we searched through ~4000 environmental

metagenomes from the IMG/VR (

Paez-Espino et al., 2017

)and

IMG/M (

Chen et al., 2019

) databases comprising numerous distinct

habitats, in addition to ~100 environmental metagenomes from the

VIROME database (

Wommack et al., 2012

). Most terminase families

were found to be largely human-associated, and instances where

remote homologs were found in environmental phages, the human-

derived phage sequences were phylo

genetically distinguishable from

their environmental counterparts. Additionally, by examining various

body sites, we showed that most ter

minase families were primarily

localized to the human oral cavity. The HB1 terminase family was

found as an exception given that it is detected also in the human gut,

though we showed that the oral and the gut-derived HB1 terminase

family members were phylogenetically distinct.

Through experiments where we separated the bacterial and viral

fractions of oral samples, we were able to demonstrate that the HA

phage family is likely lysogenic and infects various species of the

Steptococcus

genus, whereas the oral HB1 phage family is likely lytic,

and its host species remains to be di

scovered. Moreover, we show the

positions of the closest HA and HB1 terminase homologs in previously

sequenced full phage genomes (

Supplementary Figure S2

Additionally, through selection p

ressure analysis and alignment of

functional motifs, we showed that HA- and HB1-encoding phages are

likely functionally active members of the human oral virome. Finally,

we designed primers to target these phage families using their

respective terminase families within oral samples from nine

individuals and showed that we could indeed reliably capture them

experimentally. The

primers for HA and HB1 are provided again in

this study (

Supplementary Table S1

In this study, we target the HA and HB1 terminase families to

obtain at least several thousand sequences per terminase family, per

oral sample, and thereby increase the resolution or the coverage

depth by several orders of magnitude from our previous study. By

creating instructional videos and collection kits, we enabled citizen

scientists to gather ~700 samp

les spanning ~100 individuals

residing in different parts of the world (

Figure 1

). We will

demonstrate that at high resolution, the mutant spectrum derived

from members of just a single phage terminase family can already

serve as a

fi

ngerprint, or a

“

phageprint

”–

highly unique to an

individual. Phageprints were not observable through our earlier

study of metagenomic datasets (

Tadmor et al., 2023

), and

demonstrate the power of combining metagenomic mining with

targeted sequencing to put a spotlight on uncharacterized phage

families and their sequence diversity in their native contexts.

By examining phage terminase families at 6 different oral sites,

and by comparing phageprints of individuals living across the globe,

we were able to study the effect of spatial separation, ranging from

several millimeters to thousands of kilometers. We found that the

spatial separation of just a few centimeters - the distance between an

individual

’

s gingival sites and the hard palate, for example - already

results in highly distinct phageprints for the HA phage family. In

contrast, HB1 phageprints from different oral sites within an

individual were highly simila

r. Additionally, we found that

neither genetics nor cohabitation seem to play a role in the

relatedness of phageprints across individuals.

Furthermore, by daily sampling of phageprints from the tongue

dorsum over the course of a month across ten individuals we

continued to see individual-speci

fi

cphageprintswithmany

variants that persisted over time. We also identi

fi

ed variants that

were

fl

uctuating concordantly in partners. Through various

diversity metrics we quanti

fi

ed the inter-and intra-personal

distances between phageprints as a function of space and time.

We used machine learning models to further quantify the

identi

fi

ability of an individual

’

sphageprintandshowed

remarkably high model performances on unseen data. These

models had very high performances even as the most abundant

variants were removed and even when 98% of the remaining

variants were randomly removed.

Results

Humans harbor diverse, personal

phageprints that are persistent in time

From a methodological standpoint, targeted sequencing of

teminase families is very similar to 16S sequencing (

Caporaso et al.,

2011

;

Human Microbiome Project Consortium, 2012b

). Using

barcoded primers, we employed PCR and next generation

sequencing to attain millions of paired-end reads for each

terminase family (

Figure 1

). We took stringent measures against

contaminants by 1) conducting our DNA extraction, PCR and post-

PCR experiments in separate physical spaces, and 2) running

fi

ve no

template control reactions for every PCR run, as well as three no-

sample DNA extraction reactions for every DNA extraction run to

ensure there are no contaminants in the DNA extraction kits. Upon

sequencing and performing several quality control

fi

lters, the reads

were demultiplexed based on their barcoded primer sequence. Using

error-correcting DNA barcodes, we were able to detect errors and

removed sequences if they contained errors in their barcode.

Furthermore, we eliminated nearly all sequencing errors by using

paired-end reads which covered the full length of both terminase

families (300 bp) and allowed only paired sequences with 100%

match across the entire sequence (see Materials and Methods).

All reads derived from the same terminase family were then

pooled and clustered based on their DNA sequence similarity into

Mahmoudabadi et al.

10.3389/frmbi.2024.1408203

Frontiers in

Microbiomes

frontiersin.org

Operational Taxonomic Units (OTUs), or what we will

interchangeably refer to as variants. An OTU table is constructed

wherein the number of reads belonging to each OTU (columns)

within each sample (rows) is denoted. Using the OTU table, we can

plot the relative abundances of each OTU within a sample. As a

shorthand, we refer to this plot as a phageprint.

With bacterial 16S data, sequences are generally clustered at

97% sequence similarity into OTUs. At this threshold, each OTU is

conventionally referred to as a bacterial species. In the absence of

convention for handling viral targeted sequencing data, we have

used here various sequence similarity thresholds for clustering

including 100% sequence similarity, thereby allowing only

identical sequences in each cluster. We found the results to be

largely robust to variations in the sequence similarity threshold (see

Materials and Methods: Examining the effect of OTU sequence

similarity threshold,

Supplementary Figure S3

As an example, we show the HA phageprint from a subject

’

tongue dorsum (top surface) at two time points (

Figure 2A

). As

shown in this

fi

gure, and across all other phageprints we have

constructed for both terminas

e families, each phageprint is

dominated by a small number of variants or OTUs (typically one

or two). In addition to these OTUs, there are many OTUs with

abundance values that are low but reproducible, and some that are

fairly persistent in time within each subject. Generally, the

dominant OTUs are not the same across different subjects.

Before probing a larger number of individuals, we aimed to

quantify our pipeline

’

s detection and reproducibility thresholds to

understand what levels of OTU temporal

fl

uctuation is biological

FIGURE 1

A schematic summary of the main experimental and bioinformatic methods: 1) Discovery of ubiquitous phage families by examining large terminase

sequences that occur across different metagenomic datasets described in our earlier work (

Tadmor et al., 2023

), 2) experimental sampling of several

cohorts for temporal and spatial analysis of phageprints in related in unrelated individuals, 3) DNA extraction from oral bio

lm samples, 4) PCR using

barcoded primers followed by PCR clean-up and paired-end sequencing, 5) joining paired-end reads to eliminate sequencing errors, 6) additional

quality control steps to further eliminate errors based on Phred scores and error-correcting barcodes, 7) demultiplexing of reads based on their

barcode sequence and linking sequences to the sample they originate from, 8) gathering reads from all samples and clustering them based on

sequence similarity into Operational Taxonomic Units (OTUs), 9) counting the number of sequences belonging to each OTU from each sample (i.e.

constructing an OTU table), and rarefying the table so that each sample is represented by the same total number of sequences, and denoising the

OTU table to eliminate OTUs with relative abundances below an experimentally determined reproducibility threshold, 10) visualizing phageprints

which are the relative abundance pro

les of OTUs (1 through N) in a given sample, 11) performing various downstream diversity analysis using the

constructed OTU table as the basis, 12) creating machine learning models based on full and downsampled OTU tables. These model types include

Logistic Regression (LR), Multi-Layer Perceptron (MLP), K-nearest Neighbor (KNN) and Gradient Boosting Classi

er (GBC). Note that these steps are

performed separately for HA and HB1 sequences.

Mahmoudabadi et al.

10.3389/frmbi.2024.1408203

Frontiers in

Microbiomes

frontiersin.org

versus technical. To that end, we obtained 3 different samples from

a subject

’

s tongue dorsum. We then performed DNA extraction and

PCR separately on each sample and sequenced these samples. The

logic behind this experiment was to capture a lumped measure of

noise arising from various experimental processes depicted in

Supplementary Figure S4

. We show that the relative abundance of

the variants making up each phageprint across these three samples

are highly reproducible, and the maximum standard deviation for

OTU relative abundances was less than 0.007, with the majority less

than 0.002 and close to 0. Moreover, we

fl

agged OTUs that had

appeared in only one or two samples out of three. As expected, we

observed that the number of reproducible OTUs increases as a

FIGURE 2

The temporal dynamics of an individual

’

s phageprint over the course of a month (on average 25 daily samples were collected during this period).

(A, B)

HA phageprints from subject 37 at two different time points,

(A)

0th time point, right after brushing tongue dorsal and teeth surfaces and

(B)

24 hours after the initial time point (no brushing in between time points). Each phageprint is derived from the analysis of 4000 sequences. OTUs

are de

ned at 98% sequence similarity.

(C)

HB1 phageprint temporal dynamics on subject 1

’

s tongue dorsum. The x-axis contains OTUs ordered

according to the depicted phylogenetic tree of the OTU sequences (the phylogenetic tree is provided largely to serve as a schematic). Each OTU is

composed of identical sequences (i.e. 100% sequence similarity threshold). The y-axis depicts the relative abundance of each OTU, and the z-axis

shows the

uctuations in relative abundance of each OTU in time.

(D)

Depictions of HB1 phageprint temporal dynamics in different subjects. The

format of these plots is the same as that panel

(C)

, and the order of OTUs is based on their phylogenetic distance and identical across all plots. All

samples are collected from the tongue dorsum. Note that subject 2 and 4 are partners, and their phageprints share some main features.

Mahmoudabadi et al.

10.3389/frmbi.2024.1408203

Frontiers in

Microbiomes

frontiersin.org

function of the relative abundance threshold, and all OTUs with

greater than 0.001 relative abundance were reproducible across all

three samples (

Supplementary Figure S5

). Thus, we arrived at 0.001

relative abundance as the reproducibility threshold for OTUs, and

denoised OTU tables by eliminating OTUs that did not meet this

threshold across any of the samples. We have performed similar

benchmarking studies on a larger number of subjects and included

separate sequencing runs to account for any variation that may be

introduced by a sequencing run (

Supplementary Figure S6

). In

short, through stringent quality control

fi

lters and benchmarking of

our experimental and bioinformatic work

fl

ow, we showed that

phageprints are highly reproducible (see Materials and Methods).

To further explore the temporal dynamics of these phageprints,

ten subjects collected bio

fi

lm from the tongue dorsum every 24

hours for a month though on average subjects returned samples

from 25 days as they missed to sample some days. The HB1

phageprint temporal dynamics on a subject

’

s tongue dorsum is

depicted in

Figure 2

. Here, to provide a more detailed view, we

cluster the HB1 sequences into OTUs based on 100% sequence

similarity, or in other words, we are depicting the relative

abundance of individual sequences.

Given the dynamic nature of an ecosystem like the human

mouth, it is counter-intuitive that over a month, the main features

of each phageprint is preserved in all subjects. However, as we will

investigate further, there are

fl

uctuations that are biological rather

than technical. A global trend is that the dominant OTUs typically

remain dominant throughout the sampling period in all subjects

(

Figure 2

). This observation is especially interesting in light of the

wide range of diets and oral hygiene practices across subjects

(

Supplementary Figure S7

To make quantitative pairwise comparisons between

phageprints we employed several commonly used metrics such as

Bray-Curtis and Unifrac, and in doing so, we distill the comparison

of thousands of sequences from any two samples to a single score.

All distance metrics paint similar pictures of the HB1 terminase

family, depicting it as highly individual-speci

fi

c and persistent in

time (

Supplementary Figure S8

;

Figure 3

). Because phageprints in

different individuals have such distinct compositions, abundance-

based metrics are especially suitable for describing them. However,

even the binary Jaccard distance metric which does not consider

variant abundances point to a similar conclusion. As is expected

from the heat maps shown in

Supplementary Figure S8

, the intra-

personal distances are markedly lower than the inter-personal, with

the notable exception being subjects 2 and 4, who are

partners (

Figure 3

Machine learning models detect with high

precision and recall an individual

’

phageprint even when phageprints are

heavily downsampled

In addition to these distance metrics, we were motivated to

build machine learning models whose performance could further

quantify the predictability of an individual

’

s phageprint within the

temporal cohort. We

fi

rst built several types of machine learning

models, including Logistic Regression (LR), K-Nearest Neighbor

(KNN), Gradient Boosting Classi

fi

er (GBC), and Multi-Layer

Perceptron (MLP), each of which perform a binary classi

fi

cation

of an individual

’

s phageprint from the rest (i.e. one-versus-rest

models). The input to these models was the OTU table, where the

rows are samples (i.e. day 1 to 30 for each subject) and the

columns are the OTUs. Across the temporal cohort consisting

of ten individuals, ~7300 HB1 OTU

s were collectively detected.

This table was split for training (70%) and testing (30%) such that

models would be trained on 70% of the time points from each

individual. To quantify the performance of the models, we

performed ten iterations of random train/test splits and report

the median and the 95% con

fi

dence intervals for the Area Under

the Precision-Recall curve (AUPR) and the Area under the

Receiver-Operator Curve (AUROC).

All model types performed remarkably well with very high

performances for both the Logistic Regression and the Multi-Layer

Perceptron model types (

Figure 4

;

Supplementary Table S2

). We

performed the same exercise on an OTU table built from HA

terminase family OTUs, and arrived at similarly high model

performances (

Supplementary Figures S9

;

Supplementary Tables

). It is important to note that we excluded subject 4 from this

particular analysis because we wanted to measure the model

’

performance for unrelated individuals, as partners

’

coevolving

phageprints would be a confounding factor. We also provide models

built that include both partners and

demonstrate that they have high

performances even when highly simi

lar phageprints are included in the

dataset (

Supplementary Figure S11

). For example, using the GBC

model type, the lowest AUPR and AUROC median values obtained

across subjects were 0.98

and 0.92, respectively.

Given that phageprints are dominated by one or two OTUs, it

is reasonable to assume that the exclusion of these dominant

OTUs would dissolve the individual-speci

fi

c and time-persistent

nature of phageprints. To formally test this assumption, we

removed the top ten most abun

dant OTUs of each sample from

the entire dataset. A total of ~600 OTUs were removed from the

dataset, removing on average two thirds of the reads from each

sample. Upon removing these OT

Us, we rescaled the OTU table

such that the relative abundan

ce of the remaining OTUs would

again add up to 1. To our surprise, the exclusion of the top most

abundant OTUs still resulted in nearly perfect classi

fi

cation

(

Supplementary Tables S6

). We further randomly

downsampled to 2% of the total remaining OTUs, resulting in

just 226 OTUs, and rescaled the resulting OTU table as previously

described. The performance of the models still remained nearly as

high as before (

Supplementary Tables S8

The reason for the repeated observation of phageprints even

when drastically subsampled, is due to the fact that many low-

abundance OTUs have individual-speci

fi

c patterns of occurrence.

By hierarchical clustering of this small subset of the original OTU

table (

Supplementary Figure S12

), most samples from the same

individual cluster together, and thus, machine learning models can

easily pick out an individual

’

s phageprint from others even using a

small fraction of the total data for each subject.

Mahmoudabadi et al.

10.3389/frmbi.2024.1408203

Frontiers in

Microbiomes

frontiersin.org

Less than 1% of OTUs are shared across

all subjects

We measured the sharing of OTUs across subjects by

collapsing the OTU table into a table of subjects by OTUs

rather than samples by OTUs, such that if an OTU was

identi

fi

ed at any point within the sampling period (~30 days), it

is given a value of 1, and 0 otherwise. With this binary table, we

created an UpSet plot where the number of OTUs unique to each

subject as well as the number of OTUs shared between different

sets of subjects is shown (

Supplementary Figure S13

Less than 1% (~0.8%) of all OTUs were detected across all

subjects. The relative abundance of these generalist OTUs per

subject is hierarchically clustered and shown in

Supplementary

Figure S14

. Again, we see that partners cluster most closely together

even based on this small subset of OTUs. Finally, a much higher

percentage of total OTUs, about 85%, are detected in at least two

subjects, and the rest are only detected in one individual. Based on

these results, we can conclude that while the same variants may

appear in different subjects, the individual speci

fi

city of phageprints

emerge in large part because the relative abundances of variants is

often individual-speci

fi

FIGURE 3

HB1 phageprint temporal dynamics quanti

ed using pairwise distance metrics and visualized using

(A)

heatmaps and

(B)

box-and-whisker plots. The

pairwise distance metrics include: Pearson distance (1- Pearson correlation), Binary Jaccard, Abundance Jaccard, Bray Curtis and unweighted

Unifrac. Top: The heatmap scale applies to all heatmaps shown. Subjects 02 and 04 are partners. Samples from each subject are chronologically

ordered. Bottom: Intra-and inter-personal distances between HB1 phageprints in 10 subjects, over the span of a month. The outliers de

ned as

those outside of the 1.5 x IQR (inter-quartile range) are denoted by

“

”

. The box-plots corresponding to the comparisons between the couple in this

study are highlighted.

Mahmoudabadi et al.

10.3389/frmbi.2024.1408203

Frontiers in

Microbiomes

frontiersin.org