A machine-readable specification for
genomics assays
A. Sina Booeshaghi
, Xi Chen
, and Lior Pachter
1. Division of Biology and Biological Engineering, California Institute of Technology,
Pasadena, California
2. School of Life Sciences, Southern University of Science and Technology, Shenzhen,
3. Department of Computing and Mathematical Sciences, California Institute of Technology,
Pasadena, California
*Address correspondence to
& lpachter@caltech.edu
Understanding the structure of sequenced fragments from genomics libraries is essential for
accurate read preprocessing. Currently, different assays and sequencing technologies require
custom scripts and programs that do not leverage the common structure of sequence elements
present in genomics libraries. We present
a machine-readable specification for
libraries produced by genomics assays that facilitates standardization of preprocessing and
enables tracking and comparison of genomics assays. The specification and associated
command line tool is available at
The proliferation of genomics assays (Ogbeide et al. 2022) has resulted in a corresponding
increase in software for processing the data (Zappia, Phipson, and Oshlack 2018). Frequently,
custom scripts must be created and tailored to the specifics of assays, where developers
reimplement solutions for common preprocessing tasks such as adapter trimming, barcode
identification, error correction, and read alignment (Wu et al. 2022; Ma et al. 2020; Cheow et al.
2016; Healey, Bassham, and Cresko 2022). When software tools are assay specific, parameter
choices in these methods can diverge, making it difficult to perform apples-to-apples
comparisons of data produced by different assays. Furthermore, the lack of preprocessing
standardization makes reanalysis of published data in the context of new data challenging.
While genomics protocols can vary greatly from each other, the libraries they generate share
many common elements. Typically, sequenced fragments will contain one or several “technical
sequences” such as barcodes and unique molecular identifiers (UMIs), as well as biological
sequences that may be aligned to a genome or transcriptome. Standard library preparation kits
generally require that DNA from the libraries is cut, repaired, and ligated to sequencing adapters
(Figure 1). Primers bind to the sequencing adapters, and initiate DNA sequencing whereby
reads are subsequently generated. Illumina sequencing employs a sequencing by synthesis
approach where fluorescently labeled nucleotides are incorporated into single-stranded DNA,
and imaged, while PacBio uses zero-mode waveguides for single-molecule detection of dNTP
incorporation. Oxford Nanopore on the other hand binds sequencing adapters to pores in a flow
cell and DNA is sequenced by changes in electrical resistance across the pore (Iizuka,
Yamazaki, and Uemura 2022).
Figure 1: The structure of reads sequenced from genomics libraries.
Sequencing libraries
are constructed by combining Atomic
to form an adapter-insert-adapter construct. The
for the assay annotates the construct with
Meta Regions.
Many single-cell genomics assays introduce additional library complexity further complicating
preprocessing. For example, the inDropsv3 (Klein et al. 2015) assay produces variable length
barcodes while the 10x Genomics scRNA-seq assay (Zheng et al. 2017) produces fixed-length
barcodes that are derived from a known list of possibilities.
Current file formats such as FASTQ, Genbank, FASTA, and workflow-specific files (Parekh et al.
2018) lack the flexibility to annotate sequenced reads that contain these complex features. In
the absence of sequence annotations, processing can be challenging, limiting the reuse of data
that is stored in publicly accessible databases such as the Sequence Read Archive (Katz et al.
2022). To facilitate utilization of genomics data, a database of assays along with a description of
their associated library structures was assembled in (Chen 2020). While this database has
proved to be very useful, the HTML descriptors are not machine readable. Moreover, the lack of
a formal specification limits the utility and expandability of the database.
specification defines a machine-readable file format, based on YAML, that enables
sequence read annotation. Reads are annotated by
which can be nested and
appended to create a
are annotated with a variety of properties that simplify
the downstream identification of sequenced elements. The following are a list of properties that
can be associated with a
● Region ID: unique identifier for the
in the
● Region type: the type of region
● Name: A descriptive name for the
● Sequence: The specific nucleotide sequence for the
● Sequence type: The type of sequence (fixed, onlist, random, joined)
● Minimum length: The minimum length of the sequence for the
● Maximum length: The maximum length of the sequence for the
● Onlist: The list of permissible sequences from which the Sequence is derived
, known as meta
, can contain
; a property that is useful for
grouping and identifying sequence types that are contained in reads. The YAML format is a
natural language to represent nested meta-
in a human-readable fashion. Python-style
indentation and syntax can be used to create a human-readable file format without the
excessive grouping delimiters of alternative languages such as JSON. Additionally, nested
allow Assays to be represented as an Ordered Tree where the ordering of subtrees is
significant: atomic
are “glued” together in an order that is concordant with the design of
the sequencing library.
files are machine-readable, and
data can be parsed, processed,
and extracted with the
command-line tool. The tool contains six subcommands that
enable various tasks such as specification checking, finding, formatting, and indexing,
1. seqspec check
: check the correctness of attributes against the
2. seqspec find
: print
3. seqspec format
: auto populate
metadata for meta
4. seqspec index
: extract the 0-indexed position of
5. seqspec print
: print an html or markdown file that visualizes the
6. seqspec split
: split a FASTQ file by
Figure 2: Uniform processing enabled with
seqspec index
command produces
a technology string that identifies appropriate sequence elements and can be passed into
processing tools.
To illustrate how
can be used to facilitate processing and analysis of single-cell
RNA-seq reads, we implemented in the
seqspec index
command the facility to produce the
relevant technology string for three single-cell RNA-seq preprocessing tools: kallisto bustools
(Melsted et al. 2021), simpleaf/alevin-fry (He et al. 2022), and STARsolo (Kaminow, Yunusov,
and Dobin 2021) (Figure 2).
associated with barcodes, UMIs, and cDNA are extracted,
positionally indexed and formatted on a per-tool basis. The modularity of
makes it
simple to produce tool-compatible technology strings for other assay types.
Standardized annotation of sequencing reads in a human- and machine-readable format serves
several purposes including the enablement of uniform processing, organization of sequencing
assays by constitutive components, and transparency for users. The flexibility of
should allow it to be used for all current sequence census assays (Wold and Myers 2008), and
specifications should be readily adaptable to different sequencing platforms; our initial release of
contains specifications for 38 assays (see
Comparison of
s for different assays, immediately reveals shared similarities and
differences. For example, the SPLiT-seq single-cell RNA and the multimodal SHARE-seq
single-cell assays are aimed at different modalities and utilize different protocols to produce
libraries, but the resultant structures are very similar (Figure 1) since they both rely on split-pool
barcoding (Rosenberg et al. 2018). The
for the sci-CAR-seq assay (Cao et al. 2018),
from which split-pool assays such as SHARE-seq are derived, shows that the cell barcoding is
encoded in the Illumina indices. It should be possible to develop an ontology of assays by
comparing the
specifications of assays and quantifying their similarities and
In demonstrating that
can be used to define options for preprocessing tools, we have
shown that
is immediately useful for uniform processing of genomics data. The
preprocessing applications will hopefully incentivize data generators to define and deposit
files alongside sequencing reads in public archives such as the Sequence Read
Archive. While
is not a suitable format for general metadata storage, the precise
specification of sequence elements present in reads, including sequencer-specific constructs,
should be helpful in identifying batch effects even when metadata is missing or inaccurate.
We thank Delaney Sullivan for helpful discussions and Rahma Elsiesy for helpful feedback on
Figure 1. Discussions with the Impact of Genomics Variation on Function (IGVF) Single-Cell
Focus Group helped to shape some features of
. Thanks to Idan Gabdank for useful
feedback on
and for suggesting the md5 checksum
Meichen Fang contributed the
A.S.B. and L.P. were supported in part by NIH
