A Caltech Library Service

Cloud-based Software for NGS Data Management and Analysis for Directed Evolution of Peptide-Based Delivery Vectors

Padia, Umesh and Brown, David and Ding, Xiaozhe and Chen, Xinhong and Kumar, Sripriya R. and Gradinaru, Viviana (2020) Cloud-based Software for NGS Data Management and Analysis for Directed Evolution of Peptide-Based Delivery Vectors. Molecular Therapy, 28 (4). pp. 434-435. ISSN 1525-0016. doi:10.1016/j.ymthe.2020.04.019.

Full text is not posted in this repository. Consult Related URLs below.

Use this Persistent URL to link to this item:


Adeno-associated viruses (AAVs) are widely used gene delivery vectors due to their ability to transduce dividing and non-dividing cells, their long-term persistence, and low immunogenicity. However, natural AAV serotypes have a limited set of tropisms. Directed evolution has been used to engineer recombinant AAVs to target specific cell types and tissues, leveraging next generation sequencing data. The deluge of data from these deep sequencing experiments has brought about data management and analysis challenges, for which there are no current commercially available solutions. Furthermore, classical approaches to analyzing data from directed evolution heavily involves manual inspection, and often overlooks patterns present in the larger datasets. To address these challenges, we developed robust cloud-based software that provides central management for next generation sequencing data, extracts variants, performs structural modeling, and can be extended to incorporate machine learning models to make predictions for variants with specific properties. The software is composed of a set of interconnected discrete components: a modern web user interface implemented in JavaScript with React, a relational database, a distributed task queue, task workers, and a Django-based API. This architecture allows computationally intensive tasks such as alignments, structural modeling, and machine learning to scale from a single machine to hundreds of machines, with minimal configuration. The software automatically imports and manages sequencing data from several different commercial and in-house sequencing providers. When the data is imported, sequence quality metrics are automatically generated and presented to the user. Variants are extracted by performing pairwise alignments between the natural serotype and the sequencing reads. The variants are further encoded into embeddings, grouped into families, and are analyzed for prevalent sequence motifs. We use the Rosetta software libraries to perform comparative modeling simulations on selected variants. Finally, we are developing and have extension support for Pytorch-based machine learning models to generate novel variants with desirable properties as well as to select candidate variants for additional rounds of optimization and characterization. This software represents a general tool for simple, scalable, and centralized analyses of next generation sequencing data for protein engineering by directed evolution, and could be generalized for all projects with large-scale deep sequencing datasets in the future.

Item Type:Article
Related URLs:
URLURL TypeDescription
Ding, Xiaozhe0000-0002-0267-0791
Chen, Xinhong0000-0003-0408-0813
Kumar, Sripriya R.0000-0001-6033-7631
Gradinaru, Viviana0000-0001-5868-348X
Additional Information:© 2020 American Society of Gene & Cell Therapy. Available online 28 April 2020.
Issue or Number:4
Record Number:CaltechAUTHORS:20200604-073231089
Persistent URL:
Official Citation:2020 ASGCT Annual Meeting Abstracts, Molecular Therapy, Volume 28, Issue 4, Supplement 1, 2020, Pages 1-592, ISSN 1525-0016, (
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:103685
Deposited By: Tony Diaz
Deposited On:04 Jun 2020 15:30
Last Modified:16 Nov 2021 18:23

Repository Staff Only: item control page