Flexible parsing, interpretation, and editing of technical sequences with splitcode

Sequence

analysis

Flexible

parsing,

interpretation,

and editing

of technical

sequences

with

splitcode

Delaney

K. Sullivan

1,2

and Lior Pachter

2,3,

�

1

UCLA-Caltech

Medical

Scientist

Training

Program,

David

Geffen

School

of Medicine,

University

of California,

Los Angeles,

CA

90095,

United

States

2

Division

of Biology

and Biological

Engineering,

California

Institute

of Technology,

Pasadena,

CA 91125,

United

States

3

Department

of Computing

and Mathematical

Sciences,

California

Institute

of Technology,

Pasadena,

CA 91125,

United

States

�

Corresponding

author.

Division

of Biology

and Biological

Engineering

and Department

of Computing

and Mathematical

Sciences,

California

Institute

of

Technology,

1200 E California

Blvd,

Pasadena,

CA, 91125,

United

States.

E-mail:

lpachter@caltech.edu

(L.P.)

Associate

Editor:

Can Alkan

Abstract

Motivation:

Next-generation

sequencing

libraries

are constructed

with numerous

synthetic

constructs

such

as sequencing

adapters,

barcodes,

and unique

molecular

identifiers.

Such

sequences

can be essential

for interpreting

results

of sequencing

assays,

and when

they

contain

infor

-

mation

pertinent

to an experiment,

they must

be processed

and analyzed.

Results:

We present

a tool called

splitcode

,

that enables

flexible

and efficient

parsing,

interpreting,

and editing

of sequencing

reads.

This versa

-

tile tool facilitates

simple,

reproducible

preprocessing

of reads

from

libraries

constructed

for a large

array

of single-cell

and bulk

sequenc

-

ing assays.

Availability

and implementation:

The

splitcode

program

is available

at http://github.com/pachterlab/splitcode.

1 Introduction

The reads

that

result

from

next-generation

sequencing

librar

-

ies can contain

many

types

of synthetic

constructs,

or techni

-

cal sequences,

including

adapters,

primers,

indices,

barcodes,

and unique

molecular

identifiers

(UMIs)

(

Kivioja

et al.

2011

,

Martin

2011

,

Kebschull

and

Zador

2018

,

Melsted

et al.

2019

,

Johnson

et al.

2023

,

Booeshaghi

et al.

2024

). These

oli

-

gonucleotide

sequences

are defined

by the technicalities

of

sequencing-based

assays

and

experiments,

with

each

se

-

quence

being

either

a completely

unknown

sequence,

a

known

sequence,

or an unknown

sequence

that

is a member

of a set of known

sequences.

There

are many

read

preprocess

-

ing tools

for editing

and

extracting

information

from

such

sequences,

including

the widely

used

tools

cutadapt

(

Martin

2011

), fastp

(

Chen

et al.

2018

), and

Trimmomatic

(

Bolger

et al.

2014

) for adapter

and

quality

trimming,

UMI-tools

(

Smith

et al.

2017

), and zUMIs

(

Parekh

et al.

2018

) for UMI

processing,

BBTools

(https://sourceforge.net/projects/bbtools/)

(

Bushnell

et al.

2017

), and reaper

for more

general

filtering

operations,

INTERSTELLAR

for read

structure

interpreta

-

tion

(

Kijima

et al.

2023

), Picard

(https://github.com/broadin

stitute/picard),

and

fgbio

(https://github.com/fulcrumgenom

ics/fgbio)

for many

read

manipulation

operations,

among

many

other

tools

(

Kong

2011

,

Roehr

et al.

2017

,

Liu 2019

,

Battenberg

et al.

2022

,

Cheng

et al.

2024

). Many

of these

tools

define

a “read

structure”

to describe

the layout

of a

read;

e.g. fgbio

uses

a sequence

of

<

number

><

operator

>

operators

where

the number

of the length

of a segment

and

the operator

describes

how

the segment

should

be processed.

However,

no one tool can adequately

address

all technical

se

-

quence

preprocessing

tasks.

Some

methods,

such

as adapter

trimming

methods,

can

only

remove

identified

technical

sequences

from

reads

but lack

the ability

to store

information

about

technical

sequences

that are relevant

to the provenance

of the read.

Other

methods

can extract

and

store

technical

sequences

from

reads

but

are limited

to only

extracting

sequences

at defined

positions

of defined

lengths

within

reads,

and may

present

limited

options

for handling

variable

position

and variable

length

segments.

Still other

methods

are

designed

for only

a specific

type

of assay,

such

as single-cell

RNA-seq.

Technologies

such

as (long-read)

SPLiT-seq

(

Rosenberg

et al.

2018

,

Rebboah

et al.

2021

), SPRITE

(

Quinodoz

et al.

2018

,

2022

), and Smart-seq3

(

Hagemann-

Jensen

et al.

2020

), contain

complex,

multifaceted

technical

sequences

that

currently

are processed

by custom

scripts

or

specific

use-case

modifications

to existing

tools.

Here,

we present

splitcode

which

introduces

versatile

new

features

for general

preprocessing

needs.

splitcode

is a flexi

-

ble solution

with

a low memory

and computational

footprint

that

can reliably,

efficiently,

and error-tolerantly

preprocess

technical

sequences

based

on a user-supplied

structure

of

how

those

sequences

are organized

within

reads.

For exam

-

ple,

splitcode

can simultaneously

trim

technical

sequences,

parse

combinatorial

barcodes

that

are variable

in length

and

inconsistent

in location

within

a read,

and extract

UMIs

that

are defined

in location

with

respect

to other

technical

sequen

-

ces rather

than

at a set position

within

a read.

These

features

Received:

12 December

2023;

Revised:

14 March

2024;

Editorial

Decision:

16 April

2024;

Accepted:

12 June

2024

#

The Author(s)

2024.

Published

by Oxford

University

Press.

This

is an Open

Access

article

distributed

under

the terms

of the Creative

Commons

Attribution

License

(https://creativecommons.org/licenses/by/4.0/),

which

permits

unrestricted

reuse,

distribution,

and reproduction

in any medium,

provided

the original

work

is properly

cited.

Bioinformatics

,

2024,

40(6)

,

btae331

https://doi.org/10.1093/bioinformatics/btae331

Advance

Access

Publication

Date:

14 June

2024

Applications

Note

make

splitcode

a suitable

tool

for processing

variable

length

staggers

at the start

of reads;

such

staggers

are often

intro

-

duced

to enhance

nucleotide

diversity

during

the early

cycles

of sequencing,

preventing

monotemplate

issues

that

would

arise

from

sequencing

identical

nucleotides

during

those

cycles.

The

technical

sequences

that

splitcode

may

be useful

for identifying

include

not only

barcodes

or UMIs

but also

li

-

gation

linkers,

integrase

attachment

sites,

and Tn5

transpo

-

sase

mosaic

ends.

Moreover,

splitcode

can

seamlessly

interface

with

other

command-line

tools,

including

other

read

sequencing

read

preprocessors

as well

as read

mappers,

by streaming

the pre-processed

reads

into

those

tools.

Thus,

splitcode

can eliminate

the need

to write

an entirely

new

file

to disk

at every

step

of preprocessing,

a practice

that

cur

-

rently

results

in inefficient

use

of time

and

disk

space.

Furthermore,

splitcode

can stream

reads

into

itself,

enabling

multiple

preprocessing

steps

to be performed

in sequence

for

more

complicated

assays.

2 Materials

and methods

2.1 Tag sequence

identification

Each

sequence

in the

splitcode

config

file along

with

all

sequences

within

the sequence’s

allowable

hamming

distance

and/or

indel

error

tolerance

is indexed

in a hash

map.

Each

sequence

is associated

with

the tag(s)

from

which

it origi

-

nated.

Reads

in FASTQ

files

are scanned

from

start

to end to

identify

tags

based

on hash

map

lookups.

Additionally,

users

can specify

locations

and conditions

within

which

a specific

tag may

appear

and only

tags

satisfying

such

conditions

are

identified.

Further,

by restricting

tag identification

to only

specific

regions

of reads,

the number

of hash

map

queries

is

reduced

therefore

improving

runtime.

2.2 Final barcode

sequences

Each

combination

of tags

is assigned

a numerical

ID, which

begins

at 0 and is incremented

for every

newly

encountered

combination.

Each

numerical

ID, a 32-bit

unsigned

integer,

can be converted

to a unique

16-bp

final

barcode

sequence

by

mapping

each

nucleotide

to a 2-bit

binary

representation

as

follows:

A

¼

00, C

¼

01, G

¼

10, T

¼

11. It follows

that

the

numerical

ID can be represented

in nucleotide-space

based

on

the integer’s

binary

representation.

For example,

the numeri

-

cal ID 0 is AAAAAAAAAAAAAAAA,

the numerical

ID 1 is

AAAAAAAAAAAAAAAT,

and

the

numerical

ID 30 is

AAAAAAAAAAAAACTG.

This

interconversion

between

nu

-

merical

IDs and

nucleotide

sequences

facilitates

simplifying

complex

barcodes.

2.3 Software

The

splitcode

software

is written

in C

þþ

11

and

is freely

available

and

open

source

under

the BSD-2

clause

license.

The framework

for

splitcode

is a C

þþ

header

file making

the

direct

incorporation

of

splitcode

into

a software

project

that

involves

processing

sequencing

reads

possible.

The

GUI

for

the software

is implemented

as an HTML

webpage

and uses

Emscripten

for compilation

of the software

to WebAssembly.

No new

data

were

generated

or analyzed

for this article

de

-

scribing

the

splitcode

software.

Documentation

for the soft

-

ware

is available

at https://splitcode.readthedocs.io/.

3 Results

3.1 Framework

and usage

We refer

to the synthetic

constructs,

or technical

sequences

that

can be identified

in reads

as tags.

tags

as anchor

points

or can

be

extracted

at user-defined

positions

within

reads.

This

feature

is particularly

useful

for unique

molecular

identifier

(UMI)

sequences

which

are generally

unknown

sequences

that

exist

at defined

locations

within

reads.

Additionally,

in the config

file,

a user

can specify

read

editing

options

including

trim

-

ming

and whether

identified

tags

should

be replaced

with

a

particular

sequence.

Thus,

identified

technical

sequences

can

be modified

or trimmed

in situ

.

Taken

together,

this array

of

options

makes

it possible

for

splitcode

to parse

data

from

a

large

variety

of sequencing

assays,

including

those

with

many

levels

of multiplexing

(Fig.

1).

Following

construction

of the config

file (Fig.

2), users

can

supply

the config

file to the

splitcode

program

on the

command-line.

Users

can further

specify

the output

options

for how

the final

barcode,

the (possibly

edited)

reads,

the

extracted

subsequences

should

be outputted.

The

program

presents

many

options

for outputting

reads,

allowing

seam

-

less integration

with

many

downstream

tools.

Importantly,

the output

can be interleaved

and directed

to standard

out

-

put,

which

can then

be directly

piped

into

tools

(including

splitcode

itself

if another

round

of read

processing

is needed)

that

support

such

input.

This

feature

makes

it possible

to

send

processed

reads

directly

to a read

mapper,

therefore

eschewing

the inefficiencies

of creating

large

intermediate

files on disk.

3.2 Capabilities

The

splitcode

program

has many

options,

some

of which

can

be supplied

in the config

file and others

of which

(namely

the

output

options)

must

be supplied

on the command

line.

In

the config

file,

a user

can specify

“sequence

identification”

options

for finding

tags

in reads

as well

as editing

reads

in

situ

based

on identified

tags

as well

as “read

modification

and

extraction”

options

for general

read

trimming

and

extracting

UMI-like

sequences.

The

latter

option

group

is

supplied

in the header

of the config

file while

the “sequence

identification”

options

are supplied

as tab-separated

values

in a tabular

format

in the file, an example

of which

is shown

in Fig. 2. A list of some

of the

splitcode

config

file options

is

exhibited

in Supplementary

Table

S1.

A graphical

user

interface

(GUI)

for

splitcode

facilitates

the

usage

of

splitcode

(Supplementary

Fig.

S1).

This

GUI

exists

as a web

page

and helps

a user

create

a config

file which

can

then

be downloaded.

Additionally,

this GUI

enables

live test

-

ing

of configuration

options

on

user-supplied

sam

-

ple sequences.

2

Sullivan

and Pachter

Finally,

splitcode

is efficient

software:

On 150-bp

paired-

end

reads

in gzip

FASTQ

format,

splitcode

can

reach

throughputs

exceeding

10 million

reads

per minute

with

memory

usage

on the order

of a few hundred

megabytes

on a

standard

laptop,

although

these

performance

results

vary

depending

on the task

at hand.

4 Discussion

The preprocessing

of FASTQ

files is an important

first step in

bioinformatics

pipelines.

This

step

is frequently

inefficient,

involving

multiple

steps

with

the creation

of large

intermedi

-

ate files or writing

and running

of custom

unoptimized

scripts

which

can be challenging

with

large-scale

sequencing

data.

splitcode

alleviates

some

of these

inefficiencies

via a modular

and flexible

design

to effectively

and efficiently

handle

intri

-

cate,

hierarchical

read

structures

produced

by technologies

with

many

layers

of multiplexing.

While

many

of

splitcode’

s

features

overlap

with

those

of existing

bioinformatics

soft

-

ware,

splitcode

is not intended

to fully

recapitulate

all the fea

-

tures

of existing

tools

or to replace

or outperform

any one

tool.

Rather,

splitcode

is intended

to serve

as one additional,

flexible

and versatile

tool in a bioinformatics

arsenal,

and has

been

designed

to be interoperable

with

other

tools.

splitcode

operates

not as an alignment

algorithm,

but on a principle

of

dictionary

lookups.

In this

approach,

technical

sequences

along

with

their

permissible

mismatches

are cataloged

in a

hash

table.

This

makes

splitcode

apt for scenarios

requiring

identification,

interpretation,

and

modification

of short

sequences

within

reads,

and it effectively

manages

extensive

lists of lookup

sequences.

Algorithms

like cutadapt

which

use

dynamic

programming

score

matrix

to optimize

alignment,

are more

suitable

for cases,

such

as general

adapter

trimming,

that

require

finding

the best

possible

alignment

between

two

sequences

or for

finding

long

technical

sequences

(in

which

case,

storing

the allowable

mismatches

in a hash

table

is computationally

infeasible).

We anticipate

that

splitcode

will

be used

in tandem

with

other

preprocessing

tools

to

provide

an effective

solution

for many

bioinformatics

needs.

Furthermore,

we expect

that

splitcode

will

continue

to

expand

in functionality

based

on user

feedback,

user

needs,

and

possibly

the introduction

of more

complicated

read

structures

that

may

arise

from

the development

of novel

sequence

census

assays.

Acknowledgements

We thank

Benjamin

T. Yeh

(Caltech)

and the laboratory

of

Mitchell

Guttman

(Caltech)

for discussions

which

motivated

this

project.

Some

of the splitcode

source

code

is derived

from

source

code

written

by P

�

all Melsted

(University

of

Iceland),

and

we are grateful

to him

for sharing

his source

code

with

us. Thanks

to A. Sina

Booeshaghi

for helpful

dis

-

cussions.

Thanks

to Nils

Homer

(Fulcrum

Genomics

LLC)

and

two

other

anonymous

reviewers

for constructive

feed

-

back

on the manuscript

and the software.

Illustrations

were

created

with

BioRender:

http://biorender.com.

Supplementary

data

Supplementary

data

are available

at

Bioinformatics

online.

Conflict

of interest

None

declared.

Figure

1.

Overview

of the

splitcode

workflow.

The

splitcode

program

takes

in a set of FASTQ

files and a user-specified

config

file, which

serves

as a

recipe

describing

how the reads

should

be parsed.

The user executes

splitcode

on the command-line,

specifying

command-line

options

on how the

output

should

be formatted.

The output

consists

of one or more

of the following:

the original

FASTQ

files (possibly

edited),

the extracted

sequences

(e.g.

UMI sequences

which

are unknown

and need

to be extracted

by using

location

information

or anchor

points),

and the final barcodes

which

are unique

for

each

combination

of identified

tags.

The output

may take the form

of FASTQ

files,

gzip-compressed

FASTQ

files,

BAM

files,

or interleaved

sequences

directed

to standard

output,

depending

on what

the user specifies.

Flexible

parsing,

interpretation,

and editing

of technical

sequences

with

splitcode

3

Funding

This

work

was

supported

by the UCLA-Caltech

Medical

Scientist

Training

Program

[NIH

NIGMS

training

grant

T32

GM008042

to D.K.S.];

and in part by the National

Institutes

of

Health

(NIH)

[U19MH114830,

5UM1HG012077-02

to L.P.].

References

Battenberg

K, Kelly

ST, Ras RA

et al.

A flexible

cross-platform

single-

cell data

processing

pipeline.

Nat Commun

2022;

13

:6847.

Bolger

AM,

Lohse

M, Usadel

B. Trimmomatic:

a flexible

trimmer

for

Illumina

sequence

data.

Bioinformatics

2014;

30

:2114–20.

Booeshaghi

AS, Chen

X, Pachter

L. A machine-readable

specification

for genomics

assays.

Bioinformatics

2024;

40

:btae168.

https://doi.

org/10.1093/bioinformatics/btae168

Bushnell

B, Rood

J, Singer

E. BBMerge—accurate

paired

shotgun

read

merging

via overlap.

PLoS One

2017;

12

:e0185056.

Chen

S, Zhou

Y, Chen

Y

et al.

fastp:

an ultra-fast

all-in-one

FASTQ

pre

-

processor.

Bioinformatics

2018;

34

:i884–90.

Cheng

O, Ling

MH,

Wang

C

et al.

Flexiplex:

a versatile

demultiplexer

and

search

tool

for omics

data.

Bioinformatics

2024;

40

:btae102.

https://doi.org/10.1093/bioinformatics/btae102

Hagemann-Jensen

M, Ziegenhain

C, Chen

P

et al.

Single-cell

RNA

counting

at allele

and

isoform

resolution

using

Smart-seq3.

Nat

Biotechnol

2020;

38

:708–14.

Johnson

MS,

Venkataram

S, Kryazhimskiy

S. Best

practices

in design

-

ing, sequencing,

and identifying

random

DNA

barcodes.

J Mol Evol

2023;

91

:263–80.

Kebschull

JM, Zador

AM.

Cellular

barcoding:

lineage

tracing,

screen

-

ing and beyond.

Nat Methods

2018;

15

:871–9.

Kijima

Y, Evans-Yamamoto

D, Toyoshima

H

et al.

A universal

se

-

quencing

read

interpreter.

Sci Adv

2023;

9

:eadd2793.

Kivioja

T, V

€

ah

€

arautio

A, Karlsson

K

et al.

Counting

absolute

numbers

of molecules

using

unique

molecular

identifiers.

Nat Methods

2011;

9

:72–4.

Figure

2.

Example

of

splitcode

usage.

The structure

of the reads

from

this hypothetical

sequencing

technology

contains

multiple

regions

that need

to be

parsed,

including

some

of variable

length.

In the config

file, each

region

that needs

to be parsed

is organized

into groups

and each

“group”

contains

multiple

tags.

The tags in the grp_A

group

have

the value

1 in the “distance”

column,

meaning

a hamming

distance

1 error

tolerance.

The values

in the

“next”

column

indicate

that after

a grp_A

tag (i.e. Barcode_A1,

Barcode_A2,

or Barcode_A3)

is found,

we should

next search

only for tags in the grp_B

group.

The “maxFindsG”

values

of 1 mean

that the maximum

number

of times

a specific

group

can be found

is 1 (e.g. after

finding

a tag in grp_A,

stop

searching

for tags in grp_A).

The “location”

for grp_A

tags have

the value

0:0:5,

meaning

that the tag is found

in file #0 (i.e. the R1 file) within

positions

0–

5 of the read;

for grp_B

tags,

splitcode

searches

file #0 within

positions

5–100.

In the header

of the config

file, the

@

extract

option

contains

an expression

indicating

that we should

extract

an 8-bp sequence,

which

we name

umi, 3 bases

following

identification

of a grp_B

tag. The supplied

@

trim-3

option

means

that only 3

0

-end trimming

of 0 bases

and 4 bases

of the R1 file and the R2 file, respectively,

should

be performed.

Thus,

here,

the output

R1 file

will contain

the original

R1 sequences

(i.e. the entirety

of Barcode

A, Region

1, Barcode

B, NNN,

UMI,

and Region

2) while

the output

R2 file will contain

just the cDNA.

The output

“Final

Barcodes”

FASTQ

file will contain

a sequence

uniquely

identifying

a combination

of tags and the mapping

file allows

us

to map the final barcode

sequence

back

to the tag combination

(the numbers

in the right-most

column

of the mapping

file represent

how many

reads

that tag combination

was found

in). Finally,

it is important

to note

that this is simply

one of many

ways

to parse

this read structure

with splitcode

and

users

can configure

the options

how they see fit. Further,

users

can also customize

the output

options.

For example,

users

can choose

to output

reads

that contain

both

grp_A

and grp_B

tags into one set of files and direct

all other

reads

into a separate

set of files,

and users

can choose

whether

to output

the 8-bp UMI sequence

into an independent

file or to put it in the FASTQ

header

of the outputted

reads.

Users

also have

the option

to output

reads

as a

BAM

file with the 8-bp UMI sequence

encoded

in a SAM

tag.

4

Sullivan

and Pachter

Kong

Y. Btrim:

a fast,

lightweight

adapter

and quality

trimming

program

for next-generation

sequencing

technologies.

Genomics

2011;

98

:152–3.

Liu D. Fuzzysplit:

demultiplexing

and trimming

sequenced

DNA

with

a

declarative

language.

PeerJ

2019;

7

:e7170.

Martin

M. Cutadapt

removes

adapter

sequences

from

high-throughput

sequencing

reads.

EMBnet

J

2011;

17

:10.

Melsted

P, Ntranos

V, Pachter

L. The

barcode,

UMI,

set format

and

BUStools.

Bioinformatics

2019;

35

:4472–3.

Parekh

S, Ziegenhain

C, Vieth

B

et al.

zUMIs—a

fast and flexible

pipe

-

line to process

RNA

sequencing

data

with

UMIs.

Gigascience

2018;

7

:giy059.

https://doi.org/10.1093/gigascience/giy059

Quinodoz

SA, Bhat

P, Chovanec

P

et al.

SPRITE:

a genome-wide

method

for mapping

higher-order

3D interactions

in the nucleus

using

combi

-

natorial

split-and-pool

barcoding.

Nat Protoc

2022;

17

:36–75.

Quinodoz

SA,

Ollikainen

N, Tabak

B

et al.

Higher-order

inter-

chromosomal

hubs

shape

3D genome

organization

in the nucleus.

Cell

2018;

174

:744–57.e24.

Rebboah

E, Reese

F, Williams

K

et al.

Mapping

and modeling

the geno

-

mic basis

of differential

RNA

isoform

expression

at single-cell

reso

-

lution

with

LR-Split-seq.

Genome

Biol

2021;

22

:286.

Roehr

JT, Dieterich

C, Reinert

K. Flexbar

3.0 – SIMD

and multicore

parallelization.

Bioinformatics

2017;

33

:2941–2.

Rosenberg

AB, Roco

CM,

Muscat

RA

et al.

Single-cell

profiling

of the

developing

mouse

brain

and spinal

cord

with

split-Pool

barcoding.

Science

2018;

360

:176–82.

Smith

T, Heger

A, Sudbery

I. UMI-tools:

modeling

sequencing

errors

in

unique

molecular

identifiers

to improve

quantification

accuracy.

Genome

Res

2017;

27

:491–9.

#

The Author(s)

2024.

Published

by Oxford

University

Press.

This

is an Open

Access

article

distributed

under

the terms

of the Creative

Commons

Attribution

License

(https://creativecommons.org/licenses/by/4.0/),

which

permits

unrestricted

reuse,

distribution,

and reproduction

in any medium,

provided

the original

work

is properly

cited.

Bioinformatics,

2024,

40, 1–5

https://doi.org/10.1093/bioinformatics/btae331

Applications

Note

Flexible

parsing,

interpretation,

and editing

of technical

sequences

with

splitcode

5