Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Pardo-Palacios, Francisco J.; Wang, Dingjie; Reese, Fairlie; Diekhans, Mark; Carbonell-Sala, Sílvia; Williams, Brian; Loveland, Jane E.; De María, Maite; Adams, Matthew S.; Balderrama-Gutierrez, Gabriela; Behera, Amit K.; Gonzalez Martinez, Jose M.; Hunt, Toby; Lagarde, Julien; Liang, Cindy E.; Li, Haoran; Meade, Marcus Jerryd; Moraga Amador, David A.; Prjibelski, Andrey D.; Birol, Inanc; Bostan, Hamed; Brooks, Ashley M.; Çelik, Muhammed Hasan; Chen, Ying; Du, Mei R. M.; Felton, Colette; Göke, Jonathan; Hafezqorani, Saber; Herwig, Ralf; Kawaji, Hideya; Lee, Joseph; Li, Jian-Liang; Lienhard, Matthias; Mikheenko, Alla; Mulligan, Dennis; Nip, Ka Ming; Pertea, Mihaela; Ritchie, Matthew E.; Sim, Andre D.; Tang, Alison D.; Wan, Yuk Kei; Wang, Changqing; Wong, Brandon Y.; Yang, Chen; Barnes, If; Berry, Andrew E.; Capella-Gutierrez, Salvador; Cousineau, Alyssa; Dhillon, Namrita; Fernandez-Gonzalez, Jose M.; Ferrández-Peral, Luis; Garcia-Reyero, Natàlia; Götz, Stefan; Hernández-Ferrer, Carles; Kondratova, Liudmyla; Liu, Tianyuan; Martinez-Martin, Alessandra; Menor, Carlos; Mestre-Tomás, Jorge; Mudge, Jonathan M.; Panayotova, Nedka G.; Paniagua, Alejandro; Repchevsky, Dmitry; Ren, Xingjie; Rouchka, Eric; Saint-John, Brandon; Sapena, Enrique; Sheynkman, Leon; Smith, Melissa Laird; Suner, Marie-Marthe; Takahashi, Hazuki; Youngworth, Ingrid A.; Carninci, Piero; Denslow, Nancy D.; Guigó, Roderic; Hunter, Margaret E.; Maehr, Rene; Shen, Yin; Tilgner, Hagen U.; Wold, Barbara J.; Vollmers, Christopher; Frankish, Adam; Au, Kin Fai; Sheynkman, Gloria M.; Mortazavi, Ali; Conesa, Ana; Brooks, Angela N.

doi:10.1038/s41592-024-02298-3

Published June 7, 2024 | Version in press

Journal Article Open

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

1. California Institute of Technology

The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

Copyright and License

© The Author(s) 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Acknowledgement

We thank Lexogen, ONT and PacBio for helpful discussions. ONT provided partial support for flow cells and reagents. We thank T. Sasaki and D. Gilbert for providing the F121-9 hybrid mouse ES cells and K. M. Parsi for assistance with human H1-hES cells and H1-DE cells. We also thank M. Akeson and M. Jain for providing resources and technical advice for Nanopore sequencing. We thank J. Visser for contributing artwork that gives an overview of the LRGASP Consortium. The project is supported by the following grants: Pew Charitable Trust (A.N.B.), NIGMS R35GM138122 (A.N.B.), NHGRI R21HG011280 (A. Conesa, J.M.-T., A.M.-M., A.P. and L.F.-P.), Spanish Ministry of Science PID2020-119537RB-10 (A. Conesa and F.J.P.), NIGMS R35GM142647 (G.M.S.), NIGMS R35GM133569 (C.V.), NHGRI U41HG007234 (J. Lagarde, M.D., R.G., S.C.-S., J.E.L., J.M.G., T.H., I. Barnes, A.E.B., J.M.M. and A.F.), NHGRI F31HG010999 (A.D.T.) and UM1 HG009443 (A. Mortazavi and B.W.), NHGRI R01HG008759 and R01HG011469 (K.F.A., D.W. and H.L.), NHGRI R01HG007182 (I. Birol, K.M.N., S.H. and C.Y.), NHGRI UM1HG009402 (Y.S.), NHMRC Investigator Grant GNT2017257 (M.E.R.), Comunitat Valenciana Grant ACIF/2018/290 (F.J.P.), Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (grant no. 2019-002443 to M.E.R.), an institutional fund from the Department of Biomedical Informatics, The Ohio State University (K.F.A., D.W. and H.L.), an institutional fund from the Department of Computational Medicine and Bioinformatics, University of Michigan (K.F.A., D.W. and H.L.), SPBU 73023672 (A.P.), AMED 22kk0305013h9903, 23kk0305024h0001 (H.K.), Wellcome Trust (WT222155/Z/20/Z) and European Molecular Biology Laboratory (A.F.). P.C. acknowledges the contribution of funds from MEXT (Ministry of Education, Culture, Sports, Science and Technology of Japan) to RIKEN. We acknowledge M. T. Walsh (University of Florida) and E. Schiller (Homosassa Springs Park) for providing archive Lorelei blood samples. We acknowledge the support of the Spanish Ministry of Science and Innovation to the EMBL partnership, Centro de Excelencia Severo Ochoa and CERCA Programme/Generalitat de Catalunya and the support of the German Federal Ministry of Education and Research with grant no. 161L0242A (M.L. and R.H.). The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the US Government. The funders had no role in study design, data collection, analysis, decision to publish or preparation of the manuscript.

Contributions

These authors contributed equally: Francisco J. Pardo-Palacios, Dingjie Wang, Fairlie Reese, Mark Diekhans, Sílvia Carbonell-Sala, Brian Williams, Jane E. Loveland, Maite De María.

These authors jointly supervised this work: Christopher Vollmers, Adam Frankish, Kin Fai Au, Gloria M. Sheynkman, Ali Mortazavi, Ana Conesa, Angela N. Brooks.

Biosample collection and preparation was carried out by S.C.-S., B.W., M.D.M., A. Cousineau, X.R., M.E.H., R.M., Y.S. and A. Mortazavi. Library preparation and sequencing was carried out by S.C.-S., B.W., M.S.A., G.B.-G., A.K.B., J. Lagarde, C.E.L., D.A.M.A., N.G.P., R.G., B.J.W., C.V., A. Mortazavi and A.N.B. H.T. and P.C. carried out cDNA library technologies development. Data coordination and curation was carried out by F.J.P., F.R., M.D., B.W., G.B.-G., J. Lagarde, M.H.C., S.G., A.M.-M., C.M., I.A.Y., A. Mortazavi and A. Conesa. Quality control was carried out by F.R., B.W., M.D.M., G.B.-G., J. Lagarde, A. Cousineau, X.R., M.E.H., R.M., Y.S., C.V. and A. Mortazavi. Evaluation of Challenge 1 was carried out by F.J.P., J.E.L., J.M.G.M., S.C.-G., J.M.F.-G., C.H.-F., L.K., T.L., J.M.-T., J.M.M., D.R., E.S., A.F., A. Conesa and A.N.B. Evaluation of Challenge 2 was carried out by D.W., G.B.-G., H.L., B.J.W., K.F.A. and A.N.B. Evaluation of Challenge 3 was carried out by F.J.P., S.C.-G., J.M.F.-G., C.H.-F., T.L., C.M., A.P., D.R., E.S., A. Conesa and A.N.B. Validation was carried out by F.J.P., M.D., S.C.-S., M.D., M.J.M., N.D., L.F.-P., N.G.-R., E.R., B.S.-J., L.S., M.L.S., H.T., P.C., N.D.D., M.E.H., G.M.S., A. Mortazavi, A. Conesa and A.N.B. GENCODE benchmarks were carried out by J.E.L., J.M.G.M., T.H., I. Barnes, A.E.B., J.M.M., M.S., A.F., M.M.-T. and A. Conesa. Challenge and submission logistics were carried out by F.J.P., F.R., M.D., J. Lagarde, A.D.T., A. Mortazavi, A. Conesa and A.N.B. Simulation was carried out by F.J.P., F.R., A.D.P. and A. Conesa. LRGASP Challenge Participant/Submitter was carried out by J. Lagarde, A.D.P., I. Birol, H.B., A.M.B., Y.C., M.R.M.D., C.F., J.G., S.H., R.H., H.K., J. Lee, J.-L.L., M.L., A. Mortazavi, A. Mikheenko, D.M., K.M.N., M.P., M.E.R., A.D.S., A.D.T., Y.W., C.W., B.Y.W., H.U.T. and C.Y. Writing was carried out by F.J.P., D.W., F.R., M.D., S.C.-S., B.W., M.D.M., M.A., A.K.B., J. Lagarde, C.E.L., A.D.P., L.F.-P., M.E.H., C.V., A.F., K.F.A., G.M.S., A. Mortazavi, A. Conesa and A.N.B. with input from all co-authors. M.S.A., G.B.-G., A.K.B., J.M.G.M., T.H., J. Lagarde, C.E.L., H.L., M.J.M., D.A.M.A. and A.D.P. contributed equally to this work. C.V., A.F., K.F.A., G.M.S., A. Mortazavi, A. Conesa and A.N.B. jointly supervised the work. More specifically, quality control and R2C2 sequencing was supervised by C.V. GENCODE benchmarks were supervised by A.F. Challenge 2 results were supervised by K.F.A. Validation was supervised by G.M.S. Obtaining human and mouse samples and PacBio sequencing was supervised by A. Mortazavi. Obtaining manatee samples and sequencing and Challenges 1 and 3 were supervised by A. Conesa. Submission logistics and ONT cDNA and dRNA sequencing were supervised by A.N.B. A. Mortazavi, A. Conesa and A.N.B. co-led the overall study.

Data Availability

An overview and documentation about the LRGASP Consortium can be found at https://www.gencodegenes.org/pages/LRGASP/. Biological sequencing data are available from the ENCODE Portal (https://www.encodeproject.org/) and are described in the RNA-seq data matrix (Supplementary Data 1). Experimental data used in GENCODE manual evaluation: ssCAGE WTC11 (Gene Expression Omnibus (GEO): GSE185917); WTC11 QuantSeq (ENCODE: ENCSR322MWL, GEO: GSE219685); H1 QuantSeq (ENCODE: ENCSR813AOB, GEO: GSE219788); and H1-DE QuantSeq (ENCODE: ENCSR198UNH, GEO: GSE219571). Reads generated for experimental validation are available in the NCBI Sequence Read Archive: SRR24680099, manatee whole-blood RT–PCR mixed with human WTC11; GCA_030013775.1, manatee Nanopore genome assembly, BioProject PRJNA939417 (a pre-submission version of the assembly, along with SIRVs, was used in LRGASP at https://cgl.gi.ucsc.edu/data/LRGASP/data/references/lrgasp_manatee_sirv1.fasta.gz); SRR24680098, human WTC11 mixed with manatee whole-blood RT–PCR; and SRR23881262, LRGASP WTC11 experimental validation RT–PCR/ONT. Other data provided to participants, participant submissions, evaluation results and data for generating the paper figures are available from the LRGASP project at https://cgl.gi.ucsc.edu/data/LRGASP/. A UCSC Browser hub with the consolidated models and other data is also available here. LRGASP reference genomes and annotations: https://cgl.gi.ucsc.edu/data/LRGASP/data/references/. LRGASP simulation data: https://cgl.gi.ucsc.edu/data/LRGASP/data/simulation/. Participant submissions: https://cgl.gi.ucsc.edu/data/LRGASP/submissions/. Evaluation results for all challenges: https://cgl.gi.ucsc.edu/data/LRGASP/results/. Spearman correlations of TPMs for each Challenge 2 pipeline: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Spearman_correlation_of_TPM_values.zip. Non-redundant genome annotations derived from the submitted annotations: https://cgl.gi.ucsc.edu/data/LRGASP/annotations/. UCSC Browser Hub with LRGASP evaluation data for human, mouse and manatee: LRGASP Hub, Hub URL. LRGASP-consolidated models description and BED files: https://cgl.gi.ucsc.edu/data/LRGASP/consolidated-models/LRGASP-consolidated-models.html. Simulation ground truth, including lists of incorrectly duplicated artificial transcripts: human simulation ground truth and mouse simulation ground truth. Data for generating Challenge 1 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge1_Figures_Data.zip. Data for generating Challenge 2 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge2_Figures_Data.zip. Data for generating Challenge 3 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge3_Figures_Data.zip.

Code Availability

LRGASP-specific code is available at the GitHub LRGASP project (https://github.com/LRGASP/). LRGASP submission commands, which include documentation on submission metadata and data files: https://github.com/LRGASP/lrgasp-submissions/. Read simulation pipeline: https://github.com/LRGASP/lrgasp-simulation/. Challenge 1 evaluation code: https://github.com/LRGASP/lrgasp-challenge-1-evaluation/. Challenge 2 evaluation code: https://github.com/LRGASP/lrgasp-challenge-2-evaluation/. Challenge 3 evaluation code: https://github.com/LRGASP/lrgasp-challenge-3-evaluation/. Code to generate Challenge 1 figures for the paper: https://github.com/LRGASP/Challenge1_Figures_Code/. Code to generate Challenge 2 figures for the paper: https://github.com/LRGASP/Challenge2_Figures_Code/. Code to generate Challenge 3 figures for the paper: https://github.com/LRGASP/Challenge3_Figures_Code/. Primers-Juju source code is available at https://github.com/diekhans/PrimerS-JuJu/ and was developed by The University of California, Santa Cruz and El Centre de Regulació Genòmica. Code used for analysis of long-read RNA-seq data used by submitters is described in the ‘Computational pipeline description from submitters’ section in the Supplementary Information.

Extended Data Fig. 1 SQANTI3 classifications of LRGASP submissions on the WTC11 dataset.

Extended Data Fig. 2 Percentage of transcript models with different ranges of sequence coverage by long reads.

Extended Data Fig. 3 Positional coverage of long unspliced SIRV transcript sequences by long reads for each sample type.

Extended Data Fig. 4 Properties of GENCODE manually annotated loci for WTC11 sample.

Extended Data Fig. 5 Properties of GENCODE manually annotated loci for mouse ES sample.

Extended Data Fig. 6 Overall evaluation results of eight quantification tools.

Extended Data Fig. 7 Top three performance on quantification tools.

Extended Data Fig. 8 SQANTI category classification of transcript models.

Extended Data Fig. 9 Fraction of experimentally validated WTC11 transcripts.

Supplementary Results, Discussion, Methods, Tables 1–13 and Figs. 1–79

Conflict of Interest

The design of the project was discussed with ONT, PacBio and Lexogen. ONT provided partial support for flow cells and reagents. H.U.T. and A. Conesa have, in the past, presented at events organized by PacBio and have received reimbursement or support for travel, accommodation and conference fees. H.U.T. has also spoken at local ONT events during the duration of this project and received food. Unrelated to this project, the laboratory of H.U.T. has purchased reagents from Illumina, PacBio and ONT at discounted prices. S.C.-S., A.N.B. and J.G. have received reimbursement for travel, accommodation and conference fees to speak at events organized by ONT. A.N.B. is a consultant for Remix Therapeutics. A. Conesa is the founder of Biobam Bioinformatics. The other authors declare no competing interests.

Files

s41592-024-02298-3.pdf

Files (46.3 MB)

Name	Size	Download all
41592_2024_2298_Fig10_ESM.jpg md5:e8e40dfa28b16ecfa0e223e4a4558f9f	166.5 kB	Preview Download
41592_2024_2298_Fig11_ESM.jpg md5:1e97df72b26e4c4fe9be3233b99ba518	314.2 kB	Preview Download
41592_2024_2298_Fig12_ESM.jpg md5:0b4964ab7e67f306aef2c68a1e2e4a60	267.9 kB	Preview Download
41592_2024_2298_Fig13_ESM.jpg md5:3c1b342f81ace667b73dfe5253c7bd51	111.0 kB	Preview Download
41592_2024_2298_Fig14_ESM.jpg md5:ed1fa6b6373f2b749551ed78cebbd76c	45.3 kB	Preview Download
41592_2024_2298_Fig6_ESM.jpg md5:0ba17c48f21e4aae8adbf14590fb90c3	202.3 kB	Preview Download
41592_2024_2298_Fig7_ESM.jpg md5:49c6302d1e277e3a3efac3375fc68df3	220.3 kB	Preview Download
41592_2024_2298_Fig8_ESM.jpg md5:04f94ca378e27eb7fc9c31516c441be8	245.9 kB	Preview Download
41592_2024_2298_Fig9_ESM.jpg md5:0fe1c315dd4c019b7779adeace924c8f	170.7 kB	Preview Download
41592_2024_2298_MOESM10_ESM.xlsx md5:a1d1245f1f08d59351d8e8d0330323d5	38.1 kB	Download
41592_2024_2298_MOESM11_ESM.xlsx md5:82b8d8b7a9abebcb6a50df9e8e708a80	36.6 kB	Download
41592_2024_2298_MOESM12_ESM.xlsx md5:baa8a6204f308dd52fe91bf6c7fef563	18.7 kB	Download
41592_2024_2298_MOESM13_ESM.xlsx md5:b8cf2f2aa5e1ce3ce337cc7330333a06	67.7 kB	Download
41592_2024_2298_MOESM14_ESM.xlsx md5:d3618e85c9c76e6a7b93a74b2a47213c	10.1 kB	Download
41592_2024_2298_MOESM15_ESM.xlsx md5:b29511aefddf52e769eaf5c79194aac8	50.8 kB	Download
41592_2024_2298_MOESM16_ESM.xlsx md5:83f0fa7ec8693ad4f3cc7198b9bd2f5c	16.8 kB	Download
41592_2024_2298_MOESM17_ESM.xlsx md5:8270e27cf568cddffce564ecfbd6a2cd	7.3 kB	Download
41592_2024_2298_MOESM18_ESM.xlsx md5:0f86243b66939ab036ec578672edb035	92.4 kB	Download
41592_2024_2298_MOESM19_ESM.xlsx md5:53e9428c5d74ef769f7f6b465dbcbcdc	17.2 kB	Download
41592_2024_2298_MOESM1_ESM.pdf md5:e536f4087aac560423eead28f7866e98	37.8 MB	Preview Download
41592_2024_2298_MOESM20_ESM.xlsx md5:28b2d83079c2fabe7099970726a21289	11.0 kB	Download
41592_2024_2298_MOESM21_ESM.xlsx md5:6b705a56bc5c23c2d06e619a3780978c	24.5 kB	Download
41592_2024_2298_MOESM22_ESM.xlsx md5:5426788e7d09c85b344ce50b1eb29cf9	9.6 kB	Download
41592_2024_2298_MOESM23_ESM.xlsx md5:a8706ba876a73b51913547284ba01c10	11.1 kB	Download
41592_2024_2298_MOESM4_ESM.xlsx md5:a36b58c9c83cb465569a0d2ddb5c4697	27.5 kB	Download
41592_2024_2298_MOESM5_ESM.xlsx md5:3fef079aa7d2aaf83f8d5dece762b3dc	47.8 kB	Download
41592_2024_2298_MOESM6_ESM.xlsx md5:d706a2b5d2f400f65e504c4835eb4ca1	77.7 kB	Download
41592_2024_2298_MOESM7_ESM.xlsx md5:20d0ddbfb42ea9bf70fe0a5d8fe578a2	77.4 kB	Download
41592_2024_2298_MOESM8_ESM.xlsx md5:d4cafa9a9ef725f3e9969002c2a03b33	77.4 kB	Download
41592_2024_2298_MOESM9_ESM.xlsx md5:615623d9be37f790717768da99dd6f87	38.5 kB	Download
s41592-024-02298-3.pdf md5:67b8a31681167bc049beb0f676e8e6bd	6.0 MB	Preview Download

Additional details

ISSN: 1548-7105

Pew Charitable Trusts
National Institutes of Health
R35GM138122
National Institutes of Health
R21HG011280
Ministerio de Ciencia, Innovación y Universidades
PID2020-119537RB-10
National Institutes of Health
R35GM142647
National Institutes of Health
R35GM133569
National Institutes of Health
U41HG007234
National Institutes of Health
NIH Postdoctoral Fellowship F31HG010999
National Institutes of Health
UM1 HG009443
National Institutes of Health
R01HG008759
National Institutes of Health
R01HG011469
National Institutes of Health
R01HG007182
National Institutes of Health
UM1HG009402
National Health and Medical Research Council
GNT2017257
Generalitat Valenciana
ACIF/2018/290
Chan Zuckerberg Initiative (United States)
2019-002443
The Ohio State University
University of Michigan–Ann Arbor
SPBU 73023672
Japan Agency for Medical Research and Development
22kk0305013h9903
Japan Agency for Medical Research and Development
23kk0305024h0001
Wellcome Trust
WT222155/Z/20/Z
European Molecular Biology Laboratory
Ministry of Education, Culture, Sports, Science and Technology
Federal Ministry of Education and Research
161L0242A

Caltech groups: Division of Biology and Biological Engineering (BBE)

	All versions	This version
Views	1	1
Downloads	198	198
Data volume	290.3 MB	290.3 MB

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Copyright and License

Acknowledgement

Contributions

Data Availability

Code Availability

Conflict of Interest

Files

s41592-024-02298-3.pdf

Files (46.3 MB)

Additional details

Identifiers

Funding

Caltech Custom Metadata

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification

Creators

Abstract

Copyright and License

Acknowledgement

Contributions

Data Availability

Code Availability

Conflict of Interest

Files

s41592-024-02298-3.pdf

Files (46.3 MB)

Additional details

Identifiers

Funding

Caltech Custom Metadata