Welcome to the new version of CaltechAUTHORS. Login is currently restricted to library staff. If you notice any issues, please email coda@library.caltech.edu
Published June 7, 2024 | in press
Journal Article Open

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Pardo-Palacios, Francisco J. ORCID icon
Wang, Dingjie
Reese, Fairlie ORCID icon
Diekhans, Mark ORCID icon
Carbonell-Sala, Sílvia ORCID icon
Williams, Brian ORCID icon
Loveland, Jane E. ORCID icon
De María, Maite ORCID icon
Adams, Matthew S. ORCID icon
Balderrama-Gutierrez, Gabriela ORCID icon
Behera, Amit K. ORCID icon
Gonzalez Martinez, Jose M. ORCID icon
Hunt, Toby ORCID icon
Lagarde, Julien ORCID icon
Liang, Cindy E. ORCID icon
Li, Haoran
Meade, Marcus Jerryd ORCID icon
Moraga Amador, David A. ORCID icon
Prjibelski, Andrey D. ORCID icon
Birol, Inanc ORCID icon
Bostan, Hamed ORCID icon
Brooks, Ashley M. ORCID icon
Çelik, Muhammed Hasan ORCID icon
Chen, Ying ORCID icon
Du, Mei R. M. ORCID icon
Felton, Colette ORCID icon
Göke, Jonathan ORCID icon
Hafezqorani, Saber ORCID icon
Herwig, Ralf ORCID icon
Kawaji, Hideya ORCID icon
Lee, Joseph
Li, Jian-Liang ORCID icon
Lienhard, Matthias ORCID icon
Mikheenko, Alla ORCID icon
Mulligan, Dennis
Nip, Ka Ming ORCID icon
Pertea, Mihaela ORCID icon
Ritchie, Matthew E. ORCID icon
Sim, Andre D. ORCID icon
Tang, Alison D. ORCID icon
Wan, Yuk Kei ORCID icon
Wang, Changqing ORCID icon
Wong, Brandon Y. ORCID icon
Yang, Chen
Barnes, If ORCID icon
Berry, Andrew E. ORCID icon
Capella-Gutierrez, Salvador ORCID icon
Cousineau, Alyssa
Dhillon, Namrita ORCID icon
Fernandez-Gonzalez, Jose M. ORCID icon
Ferrández-Peral, Luis ORCID icon
Garcia-Reyero, Natàlia ORCID icon
Götz, Stefan ORCID icon
Hernández-Ferrer, Carles ORCID icon
Kondratova, Liudmyla ORCID icon
Liu, Tianyuan ORCID icon
Martinez-Martin, Alessandra
Menor, Carlos ORCID icon
Mestre-Tomás, Jorge ORCID icon
Mudge, Jonathan M. ORCID icon
Panayotova, Nedka G.
Paniagua, Alejandro ORCID icon
Repchevsky, Dmitry ORCID icon
Ren, Xingjie ORCID icon
Rouchka, Eric ORCID icon
Saint-John, Brandon ORCID icon
Sapena, Enrique ORCID icon
Sheynkman, Leon
Smith, Melissa Laird ORCID icon
Suner, Marie-Marthe ORCID icon
Takahashi, Hazuki ORCID icon
Youngworth, Ingrid A. ORCID icon
Carninci, Piero ORCID icon
Denslow, Nancy D. ORCID icon
Guigó, Roderic ORCID icon
Hunter, Margaret E. ORCID icon
Maehr, Rene ORCID icon
Shen, Yin ORCID icon
Tilgner, Hagen U. ORCID icon
Wold, Barbara J.1 ORCID icon
Vollmers, Christopher ORCID icon
Frankish, Adam ORCID icon
Au, Kin Fai ORCID icon
Sheynkman, Gloria M. ORCID icon
Mortazavi, Ali ORCID icon
Conesa, Ana ORCID icon
Brooks, Angela N. ORCID icon
  • 1. ROR icon California Institute of Technology

Abstract

The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

Copyright and License

© The Author(s) 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Acknowledgement

We thank Lexogen, ONT and PacBio for helpful discussions. ONT provided partial support for flow cells and reagents. We thank T. Sasaki and D. Gilbert for providing the F121-9 hybrid mouse ES cells and K. M. Parsi for assistance with human H1-hES cells and H1-DE cells. We also thank M. Akeson and M. Jain for providing resources and technical advice for Nanopore sequencing. We thank J. Visser for contributing artwork that gives an overview of the LRGASP Consortium. The project is supported by the following grants: Pew Charitable Trust (A.N.B.), NIGMS R35GM138122 (A.N.B.), NHGRI R21HG011280 (A. Conesa, J.M.-T., A.M.-M., A.P. and L.F.-P.), Spanish Ministry of Science PID2020-119537RB-10 (A. Conesa and F.J.P.), NIGMS R35GM142647 (G.M.S.), NIGMS R35GM133569 (C.V.), NHGRI U41HG007234 (J. Lagarde, M.D., R.G., S.C.-S., J.E.L., J.M.G., T.H., I. Barnes, A.E.B., J.M.M. and A.F.), NHGRI F31HG010999 (A.D.T.) and UM1 HG009443 (A. Mortazavi and B.W.), NHGRI R01HG008759 and R01HG011469 (K.F.A., D.W. and H.L.), NHGRI R01HG007182 (I. Birol, K.M.N., S.H. and C.Y.), NHGRI UM1HG009402 (Y.S.), NHMRC Investigator Grant GNT2017257 (M.E.R.), Comunitat Valenciana Grant ACIF/2018/290 (F.J.P.), Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (grant no. 2019-002443 to M.E.R.), an institutional fund from the Department of Biomedical Informatics, The Ohio State University (K.F.A., D.W. and H.L.), an institutional fund from the Department of Computational Medicine and Bioinformatics, University of Michigan (K.F.A., D.W. and H.L.), SPBU 73023672 (A.P.), AMED 22kk0305013h9903, 23kk0305024h0001 (H.K.), Wellcome Trust (WT222155/Z/20/Z) and European Molecular Biology Laboratory (A.F.). P.C. acknowledges the contribution of funds from MEXT (Ministry of Education, Culture, Sports, Science and Technology of Japan) to RIKEN. We acknowledge M. T. Walsh (University of Florida) and E. Schiller (Homosassa Springs Park) for providing archive Lorelei blood samples. We acknowledge the support of the Spanish Ministry of Science and Innovation to the EMBL partnership, Centro de Excelencia Severo Ochoa and CERCA Programme/Generalitat de Catalunya and the support of the German Federal Ministry of Education and Research with grant no. 161L0242A (M.L. and R.H.). The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the US Government. The funders had no role in study design, data collection, analysis, decision to publish or preparation of the manuscript.

Contributions

These authors contributed equally: Francisco J. Pardo-Palacios, Dingjie Wang, Fairlie Reese, Mark Diekhans, Sílvia Carbonell-Sala, Brian Williams, Jane E. Loveland, Maite De María.

These authors jointly supervised this work: Christopher Vollmers, Adam Frankish, Kin Fai Au, Gloria M. Sheynkman, Ali Mortazavi, Ana Conesa, Angela N. Brooks.

Biosample collection and preparation was carried out by S.C.-S., B.W., M.D.M., A. Cousineau, X.R., M.E.H., R.M., Y.S. and A. Mortazavi. Library preparation and sequencing was carried out by S.C.-S., B.W., M.S.A., G.B.-G., A.K.B., J. Lagarde, C.E.L., D.A.M.A., N.G.P., R.G., B.J.W., C.V., A. Mortazavi and A.N.B. H.T. and P.C. carried out cDNA library technologies development. Data coordination and curation was carried out by F.J.P., F.R., M.D., B.W., G.B.-G., J. Lagarde, M.H.C., S.G., A.M.-M., C.M., I.A.Y., A. Mortazavi and A. Conesa. Quality control was carried out by F.R., B.W., M.D.M., G.B.-G., J. Lagarde, A. Cousineau, X.R., M.E.H., R.M., Y.S., C.V. and A. Mortazavi. Evaluation of Challenge 1 was carried out by F.J.P., J.E.L., J.M.G.M., S.C.-G., J.M.F.-G., C.H.-F., L.K., T.L., J.M.-T., J.M.M., D.R., E.S., A.F., A. Conesa and A.N.B. Evaluation of Challenge 2 was carried out by D.W., G.B.-G., H.L., B.J.W., K.F.A. and A.N.B. Evaluation of Challenge 3 was carried out by F.J.P., S.C.-G., J.M.F.-G., C.H.-F., T.L., C.M., A.P., D.R., E.S., A. Conesa and A.N.B. Validation was carried out by F.J.P., M.D., S.C.-S., M.D., M.J.M., N.D., L.F.-P., N.G.-R., E.R., B.S.-J., L.S., M.L.S., H.T., P.C., N.D.D., M.E.H., G.M.S., A. Mortazavi, A. Conesa and A.N.B. GENCODE benchmarks were carried out by J.E.L., J.M.G.M., T.H., I. Barnes, A.E.B., J.M.M., M.S., A.F., M.M.-T. and A. Conesa. Challenge and submission logistics were carried out by F.J.P., F.R., M.D., J. Lagarde, A.D.T., A. Mortazavi, A. Conesa and A.N.B. Simulation was carried out by F.J.P., F.R., A.D.P. and A. Conesa. LRGASP Challenge Participant/Submitter was carried out by J. Lagarde, A.D.P., I. Birol, H.B., A.M.B., Y.C., M.R.M.D., C.F., J.G., S.H., R.H., H.K., J. Lee, J.-L.L., M.L., A. Mortazavi, A. Mikheenko, D.M., K.M.N., M.P., M.E.R., A.D.S., A.D.T., Y.W., C.W., B.Y.W., H.U.T. and C.Y. Writing was carried out by F.J.P., D.W., F.R., M.D., S.C.-S., B.W., M.D.M., M.A., A.K.B., J. Lagarde, C.E.L., A.D.P., L.F.-P., M.E.H., C.V., A.F., K.F.A., G.M.S., A. Mortazavi, A. Conesa and A.N.B. with input from all co-authors. M.S.A., G.B.-G., A.K.B., J.M.G.M., T.H., J. Lagarde, C.E.L., H.L., M.J.M., D.A.M.A. and A.D.P. contributed equally to this work. C.V., A.F., K.F.A., G.M.S., A. Mortazavi, A. Conesa and A.N.B. jointly supervised the work. More specifically, quality control and R2C2 sequencing was supervised by C.V. GENCODE benchmarks were supervised by A.F. Challenge 2 results were supervised by K.F.A. Validation was supervised by G.M.S. Obtaining human and mouse samples and PacBio sequencing was supervised by A. Mortazavi. Obtaining manatee samples and sequencing and Challenges 1 and 3 were supervised by A. Conesa. Submission logistics and ONT cDNA and dRNA sequencing were supervised by A.N.B. A. Mortazavi, A. Conesa and A.N.B. co-led the overall study.

Data Availability

An overview and documentation about the LRGASP Consortium can be found at https://www.gencodegenes.org/pages/LRGASP/. Biological sequencing data are available from the ENCODE Portal (https://www.encodeproject.org/) and are described in the RNA-seq data matrix (Supplementary Data 1). Experimental data used in GENCODE manual evaluation: ssCAGE WTC11 (Gene Expression Omnibus (GEO): GSE185917); WTC11 QuantSeq (ENCODE: ENCSR322MWL, GEO: GSE219685); H1 QuantSeq (ENCODE: ENCSR813AOB, GEO: GSE219788); and H1-DE QuantSeq (ENCODE: ENCSR198UNH, GEO: GSE219571). Reads generated for experimental validation are available in the NCBI Sequence Read Archive: SRR24680099, manatee whole-blood RT–PCR mixed with human WTC11; GCA_030013775.1, manatee Nanopore genome assembly, BioProject PRJNA939417 (a pre-submission version of the assembly, along with SIRVs, was used in LRGASP at https://cgl.gi.ucsc.edu/data/LRGASP/data/references/lrgasp_manatee_sirv1.fasta.gz); SRR24680098, human WTC11 mixed with manatee whole-blood RT–PCR; and SRR23881262, LRGASP WTC11 experimental validation RT–PCR/ONT. Other data provided to participants, participant submissions, evaluation results and data for generating the paper figures are available from the LRGASP project at https://cgl.gi.ucsc.edu/data/LRGASP/. A UCSC Browser hub with the consolidated models and other data is also available here. LRGASP reference genomes and annotations: https://cgl.gi.ucsc.edu/data/LRGASP/data/references/. LRGASP simulation data: https://cgl.gi.ucsc.edu/data/LRGASP/data/simulation/. Participant submissions: https://cgl.gi.ucsc.edu/data/LRGASP/submissions/. Evaluation results for all challenges: https://cgl.gi.ucsc.edu/data/LRGASP/results/. Spearman correlations of TPMs for each Challenge 2 pipeline: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Spearman_correlation_of_TPM_values.zip. Non-redundant genome annotations derived from the submitted annotations: https://cgl.gi.ucsc.edu/data/LRGASP/annotations/. UCSC Browser Hub with LRGASP evaluation data for human, mouse and manatee: LRGASP HubHub URL. LRGASP-consolidated models description and BED files: https://cgl.gi.ucsc.edu/data/LRGASP/consolidated-models/LRGASP-consolidated-models.html. Simulation ground truth, including lists of incorrectly duplicated artificial transcripts: human simulation ground truth and mouse simulation ground truth. Data for generating Challenge 1 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge1_Figures_Data.zip. Data for generating Challenge 2 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge2_Figures_Data.zip. Data for generating Challenge 3 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge3_Figures_Data.zip.

Code Availability

LRGASP-specific code is available at the GitHub LRGASP project (https://github.com/LRGASP/). LRGASP submission commands, which include documentation on submission metadata and data files: https://github.com/LRGASP/lrgasp-submissions/. Read simulation pipeline: https://github.com/LRGASP/lrgasp-simulation/. Challenge 1 evaluation code: https://github.com/LRGASP/lrgasp-challenge-1-evaluation/. Challenge 2 evaluation code: https://github.com/LRGASP/lrgasp-challenge-2-evaluation/. Challenge 3 evaluation code: https://github.com/LRGASP/lrgasp-challenge-3-evaluation/. Code to generate Challenge 1 figures for the paper: https://github.com/LRGASP/Challenge1_Figures_Code/. Code to generate Challenge 2 figures for the paper: https://github.com/LRGASP/Challenge2_Figures_Code/. Code to generate Challenge 3 figures for the paper: https://github.com/LRGASP/Challenge3_Figures_Code/. Primers-Juju source code is available at https://github.com/diekhans/PrimerS-JuJu/ and was developed by The University of California, Santa Cruz and El Centre de Regulació Genòmica. Code used for analysis of long-read RNA-seq data used by submitters is described in the ‘Computational pipeline description from submitters’ section in the Supplementary Information.

Extended Data Fig. 1 SQANTI3 classifications of LRGASP submissions on the WTC11 dataset.

Extended Data Fig. 2 Percentage of transcript models with different ranges of sequence coverage by long reads.

Extended Data Fig. 3 Positional coverage of long unspliced SIRV transcript sequences by long reads for each sample type.

Extended Data Fig. 4 Properties of GENCODE manually annotated loci for WTC11 sample.

Extended Data Fig. 5 Properties of GENCODE manually annotated loci for mouse ES sample.

Extended Data Fig. 6 Overall evaluation results of eight quantification tools.

Extended Data Fig. 7 Top three performance on quantification tools.

Extended Data Fig. 8 SQANTI category classification of transcript models.

Extended Data Fig. 9 Fraction of experimentally validated WTC11 transcripts.

Supplementary Results, Discussion, Methods, Tables 1–13 and Figs. 1–79

Conflict of Interest

The design of the project was discussed with ONT, PacBio and Lexogen. ONT provided partial support for flow cells and reagents. H.U.T. and A. Conesa have, in the past, presented at events organized by PacBio and have received reimbursement or support for travel, accommodation and conference fees. H.U.T. has also spoken at local ONT events during the duration of this project and received food. Unrelated to this project, the laboratory of H.U.T. has purchased reagents from Illumina, PacBio and ONT at discounted prices. S.C.-S., A.N.B. and J.G. have received reimbursement for travel, accommodation and conference fees to speak at events organized by ONT. A.N.B. is a consultant for Remix Therapeutics. A. Conesa is the founder of Biobam Bioinformatics. The other authors declare no competing interests.

Files

s41592-024-02298-3.pdf
Files (46.3 MB)
Name Size Download all
md5:0ba17c48f21e4aae8adbf14590fb90c3
202.3 kB Preview Download
md5:a36b58c9c83cb465569a0d2ddb5c4697
27.5 kB Download
md5:b8cf2f2aa5e1ce3ce337cc7330333a06
67.7 kB Download
md5:67b8a31681167bc049beb0f676e8e6bd
6.0 MB Preview Download
md5:e536f4087aac560423eead28f7866e98
37.8 MB Preview Download
md5:a1d1245f1f08d59351d8e8d0330323d5
38.1 kB Download
md5:d4cafa9a9ef725f3e9969002c2a03b33
77.4 kB Download
md5:20d0ddbfb42ea9bf70fe0a5d8fe578a2
77.4 kB Download
md5:5426788e7d09c85b344ce50b1eb29cf9
9.6 kB Download
md5:a8706ba876a73b51913547284ba01c10
11.1 kB Download
md5:d3618e85c9c76e6a7b93a74b2a47213c
10.1 kB Download
md5:615623d9be37f790717768da99dd6f87
38.5 kB Download
md5:3fef079aa7d2aaf83f8d5dece762b3dc
47.8 kB Download
md5:04f94ca378e27eb7fc9c31516c441be8
245.9 kB Preview Download
md5:6b705a56bc5c23c2d06e619a3780978c
24.5 kB Download
md5:28b2d83079c2fabe7099970726a21289
11.0 kB Download
md5:53e9428c5d74ef769f7f6b465dbcbcdc
17.2 kB Download
md5:8270e27cf568cddffce564ecfbd6a2cd
7.3 kB Download
md5:b29511aefddf52e769eaf5c79194aac8
50.8 kB Download
md5:ed1fa6b6373f2b749551ed78cebbd76c
45.3 kB Preview Download
md5:3c1b342f81ace667b73dfe5253c7bd51
111.0 kB Preview Download
md5:e8e40dfa28b16ecfa0e223e4a4558f9f
166.5 kB Preview Download
md5:49c6302d1e277e3a3efac3375fc68df3
220.3 kB Preview Download
md5:0b4964ab7e67f306aef2c68a1e2e4a60
267.9 kB Preview Download
md5:baa8a6204f308dd52fe91bf6c7fef563
18.7 kB Download
md5:82b8d8b7a9abebcb6a50df9e8e708a80
36.6 kB Download
md5:1e97df72b26e4c4fe9be3233b99ba518
314.2 kB Preview Download
md5:d706a2b5d2f400f65e504c4835eb4ca1
77.7 kB Download
md5:0f86243b66939ab036ec578672edb035
92.4 kB Download
md5:83f0fa7ec8693ad4f3cc7198b9bd2f5c
16.8 kB Download
md5:0fe1c315dd4c019b7779adeace924c8f
170.7 kB Preview Download

Additional details

Created:
June 12, 2024
Modified:
June 12, 2024