Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published April 28, 2020 | public
Journal Article

Cloud-based Software for NGS Data Management and Analysis for Directed Evolution of Peptide-Based Delivery Vectors


Adeno-associated viruses (AAVs) are widely used gene delivery vectors due to their ability to transduce dividing and non-dividing cells, their long-term persistence, and low immunogenicity. However, natural AAV serotypes have a limited set of tropisms. Directed evolution has been used to engineer recombinant AAVs to target specific cell types and tissues, leveraging next generation sequencing data. The deluge of data from these deep sequencing experiments has brought about data management and analysis challenges, for which there are no current commercially available solutions. Furthermore, classical approaches to analyzing data from directed evolution heavily involves manual inspection, and often overlooks patterns present in the larger datasets. To address these challenges, we developed robust cloud-based software that provides central management for next generation sequencing data, extracts variants, performs structural modeling, and can be extended to incorporate machine learning models to make predictions for variants with specific properties. The software is composed of a set of interconnected discrete components: a modern web user interface implemented in JavaScript with React, a relational database, a distributed task queue, task workers, and a Django-based API. This architecture allows computationally intensive tasks such as alignments, structural modeling, and machine learning to scale from a single machine to hundreds of machines, with minimal configuration. The software automatically imports and manages sequencing data from several different commercial and in-house sequencing providers. When the data is imported, sequence quality metrics are automatically generated and presented to the user. Variants are extracted by performing pairwise alignments between the natural serotype and the sequencing reads. The variants are further encoded into embeddings, grouped into families, and are analyzed for prevalent sequence motifs. We use the Rosetta software libraries to perform comparative modeling simulations on selected variants. Finally, we are developing and have extension support for Pytorch-based machine learning models to generate novel variants with desirable properties as well as to select candidate variants for additional rounds of optimization and characterization. This software represents a general tool for simple, scalable, and centralized analyses of next generation sequencing data for protein engineering by directed evolution, and could be generalized for all projects with large-scale deep sequencing datasets in the future.

Additional Information

© 2020 American Society of Gene & Cell Therapy. Available online 28 April 2020.

Additional details

August 19, 2023
December 22, 2023