Systematic assessment of long-read RNA-seq methods for transcript identification and quantification
- Creators
- Pardo-Palacios, Francisco J.
- Wang, Dingjie
- Reese, Fairlie
- Diekhans, Mark
- Carbonell-Sala, Sílvia
- Williams, Brian
- Loveland, Jane E.
- De María, Maite
- Adams, Matthew S.
- Balderrama-Gutierrez, Gabriela
- Behera, Amit K.
- Gonzalez Martinez, Jose M.
- Hunt, Toby
- Lagarde, Julien
- Liang, Cindy E.
- Li, Haoran
- Meade, Marcus Jerryd
- Moraga Amador, David A.
- Prjibelski, Andrey D.
- Birol, Inanc
- Bostan, Hamed
- Brooks, Ashley M.
- Çelik, Muhammed Hasan
- Chen, Ying
- Du, Mei R. M.
- Felton, Colette
- Göke, Jonathan
- Hafezqorani, Saber
- Herwig, Ralf
- Kawaji, Hideya
- Lee, Joseph
- Li, Jian-Liang
- Lienhard, Matthias
- Mikheenko, Alla
- Mulligan, Dennis
- Nip, Ka Ming
- Pertea, Mihaela
- Ritchie, Matthew E.
- Sim, Andre D.
- Tang, Alison D.
- Wan, Yuk Kei
- Wang, Changqing
- Wong, Brandon Y.
- Yang, Chen
- Barnes, If
- Berry, Andrew E.
- Capella-Gutierrez, Salvador
- Cousineau, Alyssa
- Dhillon, Namrita
- Fernandez-Gonzalez, Jose M.
- Ferrández-Peral, Luis
- Garcia-Reyero, Natàlia
- Götz, Stefan
- Hernández-Ferrer, Carles
- Kondratova, Liudmyla
- Liu, Tianyuan
- Martinez-Martin, Alessandra
- Menor, Carlos
- Mestre-Tomás, Jorge
- Mudge, Jonathan M.
- Panayotova, Nedka G.
- Paniagua, Alejandro
- Repchevsky, Dmitry
- Ren, Xingjie
- Rouchka, Eric
- Saint-John, Brandon
- Sapena, Enrique
- Sheynkman, Leon
- Smith, Melissa Laird
- Suner, Marie-Marthe
- Takahashi, Hazuki
- Youngworth, Ingrid A.
- Carninci, Piero
- Denslow, Nancy D.
- Guigó, Roderic
- Hunter, Margaret E.
- Maehr, Rene
- Shen, Yin
- Tilgner, Hagen U.
- Wold, Barbara J.1
- Vollmers, Christopher
- Frankish, Adam
- Au, Kin Fai
- Sheynkman, Gloria M.
- Mortazavi, Ali
- Conesa, Ana
- Brooks, Angela N.
Abstract
The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.
Copyright and License
© The Author(s) 2024. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Acknowledgement
We thank Lexogen, ONT and PacBio for helpful discussions. ONT provided partial support for flow cells and reagents. We thank T. Sasaki and D. Gilbert for providing the F121-9 hybrid mouse ES cells and K. M. Parsi for assistance with human H1-hES cells and H1-DE cells. We also thank M. Akeson and M. Jain for providing resources and technical advice for Nanopore sequencing. We thank J. Visser for contributing artwork that gives an overview of the LRGASP Consortium. The project is supported by the following grants: Pew Charitable Trust (A.N.B.), NIGMS R35GM138122 (A.N.B.), NHGRI R21HG011280 (A. Conesa, J.M.-T., A.M.-M., A.P. and L.F.-P.), Spanish Ministry of Science PID2020-119537RB-10 (A. Conesa and F.J.P.), NIGMS R35GM142647 (G.M.S.), NIGMS R35GM133569 (C.V.), NHGRI U41HG007234 (J. Lagarde, M.D., R.G., S.C.-S., J.E.L., J.M.G., T.H., I. Barnes, A.E.B., J.M.M. and A.F.), NHGRI F31HG010999 (A.D.T.) and UM1 HG009443 (A. Mortazavi and B.W.), NHGRI R01HG008759 and R01HG011469 (K.F.A., D.W. and H.L.), NHGRI R01HG007182 (I. Birol, K.M.N., S.H. and C.Y.), NHGRI UM1HG009402 (Y.S.), NHMRC Investigator Grant GNT2017257 (M.E.R.), Comunitat Valenciana Grant ACIF/2018/290 (F.J.P.), Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (grant no. 2019-002443 to M.E.R.), an institutional fund from the Department of Biomedical Informatics, The Ohio State University (K.F.A., D.W. and H.L.), an institutional fund from the Department of Computational Medicine and Bioinformatics, University of Michigan (K.F.A., D.W. and H.L.), SPBU 73023672 (A.P.), AMED 22kk0305013h9903, 23kk0305024h0001 (H.K.), Wellcome Trust (WT222155/Z/20/Z) and European Molecular Biology Laboratory (A.F.). P.C. acknowledges the contribution of funds from MEXT (Ministry of Education, Culture, Sports, Science and Technology of Japan) to RIKEN. We acknowledge M. T. Walsh (University of Florida) and E. Schiller (Homosassa Springs Park) for providing archive Lorelei blood samples. We acknowledge the support of the Spanish Ministry of Science and Innovation to the EMBL partnership, Centro de Excelencia Severo Ochoa and CERCA Programme/Generalitat de Catalunya and the support of the German Federal Ministry of Education and Research with grant no. 161L0242A (M.L. and R.H.). The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Any use of trade, firm or product names is for descriptive purposes only and does not imply endorsement by the US Government. The funders had no role in study design, data collection, analysis, decision to publish or preparation of the manuscript.
Contributions
These authors contributed equally: Francisco J. Pardo-Palacios, Dingjie Wang, Fairlie Reese, Mark Diekhans, Sílvia Carbonell-Sala, Brian Williams, Jane E. Loveland, Maite De María.
These authors jointly supervised this work: Christopher Vollmers, Adam Frankish, Kin Fai Au, Gloria M. Sheynkman, Ali Mortazavi, Ana Conesa, Angela N. Brooks.
Biosample collection and preparation was carried out by S.C.-S., B.W., M.D.M., A. Cousineau, X.R., M.E.H., R.M., Y.S. and A. Mortazavi. Library preparation and sequencing was carried out by S.C.-S., B.W., M.S.A., G.B.-G., A.K.B., J. Lagarde, C.E.L., D.A.M.A., N.G.P., R.G., B.J.W., C.V., A. Mortazavi and A.N.B. H.T. and P.C. carried out cDNA library technologies development. Data coordination and curation was carried out by F.J.P., F.R., M.D., B.W., G.B.-G., J. Lagarde, M.H.C., S.G., A.M.-M., C.M., I.A.Y., A. Mortazavi and A. Conesa. Quality control was carried out by F.R., B.W., M.D.M., G.B.-G., J. Lagarde, A. Cousineau, X.R., M.E.H., R.M., Y.S., C.V. and A. Mortazavi. Evaluation of Challenge 1 was carried out by F.J.P., J.E.L., J.M.G.M., S.C.-G., J.M.F.-G., C.H.-F., L.K., T.L., J.M.-T., J.M.M., D.R., E.S., A.F., A. Conesa and A.N.B. Evaluation of Challenge 2 was carried out by D.W., G.B.-G., H.L., B.J.W., K.F.A. and A.N.B. Evaluation of Challenge 3 was carried out by F.J.P., S.C.-G., J.M.F.-G., C.H.-F., T.L., C.M., A.P., D.R., E.S., A. Conesa and A.N.B. Validation was carried out by F.J.P., M.D., S.C.-S., M.D., M.J.M., N.D., L.F.-P., N.G.-R., E.R., B.S.-J., L.S., M.L.S., H.T., P.C., N.D.D., M.E.H., G.M.S., A. Mortazavi, A. Conesa and A.N.B. GENCODE benchmarks were carried out by J.E.L., J.M.G.M., T.H., I. Barnes, A.E.B., J.M.M., M.S., A.F., M.M.-T. and A. Conesa. Challenge and submission logistics were carried out by F.J.P., F.R., M.D., J. Lagarde, A.D.T., A. Mortazavi, A. Conesa and A.N.B. Simulation was carried out by F.J.P., F.R., A.D.P. and A. Conesa. LRGASP Challenge Participant/Submitter was carried out by J. Lagarde, A.D.P., I. Birol, H.B., A.M.B., Y.C., M.R.M.D., C.F., J.G., S.H., R.H., H.K., J. Lee, J.-L.L., M.L., A. Mortazavi, A. Mikheenko, D.M., K.M.N., M.P., M.E.R., A.D.S., A.D.T., Y.W., C.W., B.Y.W., H.U.T. and C.Y. Writing was carried out by F.J.P., D.W., F.R., M.D., S.C.-S., B.W., M.D.M., M.A., A.K.B., J. Lagarde, C.E.L., A.D.P., L.F.-P., M.E.H., C.V., A.F., K.F.A., G.M.S., A. Mortazavi, A. Conesa and A.N.B. with input from all co-authors. M.S.A., G.B.-G., A.K.B., J.M.G.M., T.H., J. Lagarde, C.E.L., H.L., M.J.M., D.A.M.A. and A.D.P. contributed equally to this work. C.V., A.F., K.F.A., G.M.S., A. Mortazavi, A. Conesa and A.N.B. jointly supervised the work. More specifically, quality control and R2C2 sequencing was supervised by C.V. GENCODE benchmarks were supervised by A.F. Challenge 2 results were supervised by K.F.A. Validation was supervised by G.M.S. Obtaining human and mouse samples and PacBio sequencing was supervised by A. Mortazavi. Obtaining manatee samples and sequencing and Challenges 1 and 3 were supervised by A. Conesa. Submission logistics and ONT cDNA and dRNA sequencing were supervised by A.N.B. A. Mortazavi, A. Conesa and A.N.B. co-led the overall study.
Data Availability
An overview and documentation about the LRGASP Consortium can be found at https://www.gencodegenes.org/pages/LRGASP/. Biological sequencing data are available from the ENCODE Portal (https://www.encodeproject.org/) and are described in the RNA-seq data matrix (Supplementary Data 1). Experimental data used in GENCODE manual evaluation: ssCAGE WTC11 (Gene Expression Omnibus (GEO): GSE185917); WTC11 QuantSeq (ENCODE: ENCSR322MWL, GEO: GSE219685); H1 QuantSeq (ENCODE: ENCSR813AOB, GEO: GSE219788); and H1-DE QuantSeq (ENCODE: ENCSR198UNH, GEO: GSE219571). Reads generated for experimental validation are available in the NCBI Sequence Read Archive: SRR24680099, manatee whole-blood RT–PCR mixed with human WTC11; GCA_030013775.1, manatee Nanopore genome assembly, BioProject PRJNA939417 (a pre-submission version of the assembly, along with SIRVs, was used in LRGASP at https://cgl.gi.ucsc.edu/data/LRGASP/data/references/lrgasp_manatee_sirv1.fasta.gz); SRR24680098, human WTC11 mixed with manatee whole-blood RT–PCR; and SRR23881262, LRGASP WTC11 experimental validation RT–PCR/ONT. Other data provided to participants, participant submissions, evaluation results and data for generating the paper figures are available from the LRGASP project at https://cgl.gi.ucsc.edu/data/LRGASP/. A UCSC Browser hub with the consolidated models and other data is also available here. LRGASP reference genomes and annotations: https://cgl.gi.ucsc.edu/data/LRGASP/data/references/. LRGASP simulation data: https://cgl.gi.ucsc.edu/data/LRGASP/data/simulation/. Participant submissions: https://cgl.gi.ucsc.edu/data/LRGASP/submissions/. Evaluation results for all challenges: https://cgl.gi.ucsc.edu/data/LRGASP/results/. Spearman correlations of TPMs for each Challenge 2 pipeline: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Spearman_correlation_of_TPM_values.zip. Non-redundant genome annotations derived from the submitted annotations: https://cgl.gi.ucsc.edu/data/LRGASP/annotations/. UCSC Browser Hub with LRGASP evaluation data for human, mouse and manatee: LRGASP Hub, Hub URL. LRGASP-consolidated models description and BED files: https://cgl.gi.ucsc.edu/data/LRGASP/consolidated-models/LRGASP-consolidated-models.html. Simulation ground truth, including lists of incorrectly duplicated artificial transcripts: human simulation ground truth and mouse simulation ground truth. Data for generating Challenge 1 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge1_Figures_Data.zip. Data for generating Challenge 2 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge2_Figures_Data.zip. Data for generating Challenge 3 figures for the paper: https://cgl.gi.ucsc.edu/data/LRGASP/paper/Challenge3_Figures_Data.zip.
Code Availability
LRGASP-specific code is available at the GitHub LRGASP project (https://github.com/LRGASP/). LRGASP submission commands, which include documentation on submission metadata and data files: https://github.com/LRGASP/lrgasp-submissions/. Read simulation pipeline: https://github.com/LRGASP/lrgasp-simulation/. Challenge 1 evaluation code: https://github.com/LRGASP/lrgasp-challenge-1-evaluation/. Challenge 2 evaluation code: https://github.com/LRGASP/lrgasp-challenge-2-evaluation/. Challenge 3 evaluation code: https://github.com/LRGASP/lrgasp-challenge-3-evaluation/. Code to generate Challenge 1 figures for the paper: https://github.com/LRGASP/Challenge1_Figures_Code/. Code to generate Challenge 2 figures for the paper: https://github.com/LRGASP/Challenge2_Figures_Code/. Code to generate Challenge 3 figures for the paper: https://github.com/LRGASP/Challenge3_Figures_Code/. Primers-Juju source code is available at https://github.com/diekhans/PrimerS-JuJu/ and was developed by The University of California, Santa Cruz and El Centre de Regulació Genòmica. Code used for analysis of long-read RNA-seq data used by submitters is described in the ‘Computational pipeline description from submitters’ section in the Supplementary Information.
Extended Data Fig. 1 SQANTI3 classifications of LRGASP submissions on the WTC11 dataset.
Conflict of Interest
The design of the project was discussed with ONT, PacBio and Lexogen. ONT provided partial support for flow cells and reagents. H.U.T. and A. Conesa have, in the past, presented at events organized by PacBio and have received reimbursement or support for travel, accommodation and conference fees. H.U.T. has also spoken at local ONT events during the duration of this project and received food. Unrelated to this project, the laboratory of H.U.T. has purchased reagents from Illumina, PacBio and ONT at discounted prices. S.C.-S., A.N.B. and J.G. have received reimbursement for travel, accommodation and conference fees to speak at events organized by ONT. A.N.B. is a consultant for Remix Therapeutics. A. Conesa is the founder of Biobam Bioinformatics. The other authors declare no competing interests.
Files
Name | Size | Download all |
---|---|---|
md5:0ba17c48f21e4aae8adbf14590fb90c3
|
202.3 kB | Preview Download |
md5:a36b58c9c83cb465569a0d2ddb5c4697
|
27.5 kB | Download |
md5:b8cf2f2aa5e1ce3ce337cc7330333a06
|
67.7 kB | Download |
md5:67b8a31681167bc049beb0f676e8e6bd
|
6.0 MB | Preview Download |
md5:e536f4087aac560423eead28f7866e98
|
37.8 MB | Preview Download |
md5:a1d1245f1f08d59351d8e8d0330323d5
|
38.1 kB | Download |
md5:d4cafa9a9ef725f3e9969002c2a03b33
|
77.4 kB | Download |
md5:20d0ddbfb42ea9bf70fe0a5d8fe578a2
|
77.4 kB | Download |
md5:5426788e7d09c85b344ce50b1eb29cf9
|
9.6 kB | Download |
md5:a8706ba876a73b51913547284ba01c10
|
11.1 kB | Download |
md5:d3618e85c9c76e6a7b93a74b2a47213c
|
10.1 kB | Download |
md5:615623d9be37f790717768da99dd6f87
|
38.5 kB | Download |
md5:3fef079aa7d2aaf83f8d5dece762b3dc
|
47.8 kB | Download |
md5:04f94ca378e27eb7fc9c31516c441be8
|
245.9 kB | Preview Download |
md5:6b705a56bc5c23c2d06e619a3780978c
|
24.5 kB | Download |
md5:28b2d83079c2fabe7099970726a21289
|
11.0 kB | Download |
md5:53e9428c5d74ef769f7f6b465dbcbcdc
|
17.2 kB | Download |
md5:8270e27cf568cddffce564ecfbd6a2cd
|
7.3 kB | Download |
md5:b29511aefddf52e769eaf5c79194aac8
|
50.8 kB | Download |
md5:ed1fa6b6373f2b749551ed78cebbd76c
|
45.3 kB | Preview Download |
md5:3c1b342f81ace667b73dfe5253c7bd51
|
111.0 kB | Preview Download |
md5:e8e40dfa28b16ecfa0e223e4a4558f9f
|
166.5 kB | Preview Download |
md5:49c6302d1e277e3a3efac3375fc68df3
|
220.3 kB | Preview Download |
md5:0b4964ab7e67f306aef2c68a1e2e4a60
|
267.9 kB | Preview Download |
md5:baa8a6204f308dd52fe91bf6c7fef563
|
18.7 kB | Download |
md5:82b8d8b7a9abebcb6a50df9e8e708a80
|
36.6 kB | Download |
md5:1e97df72b26e4c4fe9be3233b99ba518
|
314.2 kB | Preview Download |
md5:d706a2b5d2f400f65e504c4835eb4ca1
|
77.7 kB | Download |
md5:0f86243b66939ab036ec578672edb035
|
92.4 kB | Download |
md5:83f0fa7ec8693ad4f3cc7198b9bd2f5c
|
16.8 kB | Download |
md5:0fe1c315dd4c019b7779adeace924c8f
|
170.7 kB | Preview Download |
Additional details
- ISSN
- 1548-7105
- Pew Charitable Trusts
- National Institutes of Health
- R35GM138122
- National Institutes of Health
- R21HG011280
- Ministerio de Ciencia, Innovación y Universidades
- PID2020-119537RB-10
- National Institutes of Health
- R35GM142647
- National Institutes of Health
- R35GM133569
- National Institutes of Health
- U41HG007234
- National Institutes of Health
- NIH Postdoctoral Fellowship F31HG010999
- National Institutes of Health
- UM1 HG009443
- National Institutes of Health
- R01HG008759
- National Institutes of Health
- R01HG011469
- National Institutes of Health
- R01HG007182
- National Institutes of Health
- UM1HG009402
- National Health and Medical Research Council
- GNT2017257
- Generalitat Valenciana
- ACIF/2018/290
- Chan Zuckerberg Initiative (United States)
- 2019-002443
- The Ohio State University
- University of Michigan–Ann Arbor
- SPBU 73023672
- Japan Agency for Medical Research and Development
- 22kk0305013h9903
- Japan Agency for Medical Research and Development
- 23kk0305024h0001
- Wellcome Trust
- WT222155/Z/20/Z
- European Molecular Biology Laboratory
- Ministry of Education, Culture, Sports, Science and Technology
- Federal Ministry of Education and Research
- 161L0242A
- Caltech groups
- Division of Biology and Biological Engineering