Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
Abstract
High-throughput mRNA sequencing (RNA-Seq) promises simultaneous transcript discovery and abundance estimation. However, this would require algorithms that are not restricted by prior gene annotations and that account for alternative transcription and splicing. Here we introduce such algorithms in an open-source software program called Cufflinks. To test Cufflinks, we sequenced and analyzed >430 million paired 75-bp RNA-Seq reads from a mouse myoblast cell line over a differentiation time series. We detected 13,692 known transcripts and 3,724 previously unannotated ones, 62% of which are supported by independent expression data or by homologous genes in other species. Over the time series, 330 genes showed complete switches in the dominant transcription start site (TSS) or splice isoform, and we observed more subtle shifts in 1,304 other genes. These results suggest that Cufflinks can illuminate the substantial regulatory flexibility and complexity in even this well-studied model of muscle development and that it can improve transcriptome-based genome annotation.
Additional Information
© 2010 Nature Publishing Group. Received 02 February 2010; Accepted 22 March 2010; Published online 02 May 2010. This work was supported in part by the US National Institutes of Health (NIH) grants R01-LM006845 and ENCODE U54-HG004576, as well as the Beckman Foundation, the Bren Foundation, the Moore Foundation (Cell Center Program) and the Miller Research Institute. We thank I. Antosechken and L. Schaeffer of the Caltech Jacobs Genome Center for DNA sequencing, and D. Trout, B. King and H. Amrhein for data pipeline and database design, operation and display. We are grateful to R. K. Bradley, K. Datchev, I. Hallgrímsdóttir, J. Landolin, B. Langmead, A. Roberts, M. Schatz and D. Sturgill for helpful discussions. Author Contributions: C.T. and L.P. developed the mathematics and statistics and designed the algorithms; B.A.W. and G.K. performed the RNA-Seq and B.A.W. designed and executed experimental validations; C.T. implemented Cufflinks and Cuffdiff; G.P. implemented Cuffcompare; M.J.v.B. and A.M. tested the software; C.T., G.P. and A.M. performed the analysis; L.P., A.M. and B.J.W. conceived the project; C.T., L.P., A.M., B. J.W. and S.L.S. wrote the manuscript. Software availability. TopHat (http://tophat.cbcb.umd.edu) is freely available as source code. It takes a reference genome (as a Bowtie29 index) and RNA-Seq reads as FASTA or FASTQ and produces alignments in SAM30 format. TopHat is distributed under the Artistic License and runs on Linux and Mac OS X. The Cufflinks assembler and abundance estimation algorithms (http://cufflinks.cbcb.umd.edu/) are open-source C++ programs and are freely available in both source and binary. The package includes the assembler along with utilities to structurally compare Cufflinks output between samples (Cuffcompare) and to perform differential expression testing (Cuffdiff). Cufflinks is distributed under the Boost License and runs on Linux and Mac OS X. The source code for Cufflinks version 0.8.0 is provided in Supplementary Data 3. The authors declare no competing financial interests.Attached Files
Accepted Version - nihms190938.pdf
Supplemental Material - nbt.1621-S1.pdf
Supplemental Material - nbt.1621-S2.xls
Supplemental Material - nbt.1621-S3.zip
Files
Additional details
- Alternative title
- Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms
- PMCID
- PMC3146043
- Eprint ID
- 18505
- Resolver ID
- CaltechAUTHORS:20100601-111602154
- NIH
- R01-LM006845
- NIH
- ENCODE U54-HG004576
- Arnold and Mabel Beckman Foundation
- Bren Foundation
- Gordon and Betty Moore Foundation
- Miller Institute for Basic Research in Science
- Created
-
2010-06-28Created from EPrint's datestamp field
- Updated
-
2021-11-08Created from EPrint's last_modified field