A Caltech Library Service

Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

McGee, Warren A. and Pimentel, Harold and Pachter, Lior and Wu, Jane Y. (2019) Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data. . (Unpublished)

[img] PDF - Submitted Version
Creative Commons Attribution.


Use this Persistent URL to link to this item:


*Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the total RNA present. Thus, datasets carry only relative information, even though absolute RNA copy numbers are often of interest. Current normalization methods assume most features are not changing, which can lead to misleading conclusions when there are large shifts. However, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when such large shifts occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We tested several tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). For these tools, we compared their standard normalization to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. We show that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used by a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into account the compositional nature of the data. We conclude that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more broadly used in a compositional manner to minimize misleading conclusions from differential analyses.

Item Type:Report or Paper (Discussion Paper)
Related URLs:
URLURL TypeDescription Paper
McGee, Warren A.0000-0003-4301-6689
Pachter, Lior0000-0002-9164-6231
Wu, Jane Y.0000-0003-1794-1213
Additional Information:The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license. We are grateful to Rosemary Braun and David Kuo for helpful suggestions and critical reading of the manuscript. WAM and JYW are supported by the NIH (F30 NS090893 to WAM; R01CA175360 and RO1NS107396 to JYW). HP is supported by the Howard Hughes Medical Institute Hanna Gray Fellowship. Author’s Contributions: WAM conceived the idea, designed the approach, and wrote the software for sleuth-ALR and absSimSeq. WAM and HP wrote the code for the analysis pipeline. JYW and LP provided supervision. WAM and JYW wrote the manuscript. Availability of data and code: The yeast starvation dataset was taken from Marguerat et al [26] from ArrayExpress at accession E-MTAB-1154, and the absolute counts were taken from Supplementary Table S2 from [26]. The GEUVADIS Finnish data can be found at ArrayExpress using accession E-GEUV-1, using the samples with the population code “FIN” and sex “female”. The Bottomly et al data [35] can be found on the Sequence Read Archive (SRA) using the accession SRP004777. Human annotations were taken from Gencode v. 25 and Ensembl v. 87, mouse annotations were taken from Gencode v. M12 and Ensembl v. 87, and yeast annotations were taken from Ensembl Genomes Fungi release 37. The code and vignette for absSimSeq can be found on GitHub at, the code and vignette for using sleuth-ALR can be found at, and the full code to reproduce the analyses in this paper can be found at Here are the versions of each of the software used: kallisto v. 0.44.0, limma v. 3.34.9, edgeR v. 3.20.9, RUVSeq 1.12.0, and DESeq2 1.18.1; the version of polyester used is a forked branch that modified version 1.14.1 with significant speed improvements (found here:; the version of sleuth used is a forked branch that modified version 0.29.0 with speed improvements and modifications to allow for sleuth-ALR (found here:; the version of ALDEx2 used is a forked branch that modified version 1.10.0 to make some speed improvements and to fix a bug that prevented getting effects if the ALR transformation with one feature was used (found here: All R code was run using R version 3.4.4, and the full pipeline was run using snakemake. The authors declare no competing financial interests.
Funding AgencyGrant Number
NIHF30 NS090893
Howard Hughes Medical Institute (HHMI)UNSPECIFIED
Subject Keywords:Compositional Data Analysis, sleuth-ALR, absSimSeq, *Seq, Differential Analysis, Normalization, spike-ins
Record Number:CaltechAUTHORS:20190304-085432513
Persistent URL:
Official Citation:Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data Warren A McGee, Harold Pimentel, Lior Pachter, Jane Y Wu bioRxiv 564955; doi:
Usage Policy:No commercial reproduction, distribution, display or performance rights in this work are provided.
ID Code:93419
Deposited By: George Porter
Deposited On:04 Mar 2019 17:41
Last Modified:03 Oct 2019 20:54

Repository Staff Only: item control page