Compositional Data Analysis is necessary for simulating and analyzing RNA-Seq data

*Seq techniques (e.g. RNA-Seq) generate compositional datasets, i.e. the number of fragments sequenced is not proportional to the sample’s total RNA content. Thus, datasets carry only relative information, even though absolute RNA copy numbers are of interest. Current normalization methods assume most features do not change, which can lead to misleading conclusions when there are many changes. Furthermore, there are few real datasets and no simulation protocols currently available that can directly benchmark methods when many changes occur. We present absSimSeq, an R package that simulates compositional data in the form of RNA-Seq reads. We compared absSimSeq with several existing tools used for RNA-Seq differential analysis: sleuth, DESeq2, edgeR, limma, sleuth and ALDEx2 (which explicitly takes a compositional approach). We compared the standard normalization of these tools to either “compositional normalization”, which uses log-ratios to anchor the data on a set of negative control features, or RUVSeq, another tool that directly uses negative control features. Our analysis shows that common normalizations result in reduced performance with current methods when there is a large change in the total RNA per cell. Performance improves when spike-ins are included and used with a compositional approach, even if the spike-ins have substantial variation. In contrast, RUVSeq, which normalizes count data rather than compositional data, has poor performance. Further, we show that previous criticisms of spike-ins did not take into consideration the compositional nature of the data. We demonstrate that absSimSeq can generate more representative datasets for testing performance, and that spike-ins should be more frequently used in a compositional manner to minimize misleading conclusions in differential analyses. Author Summary A critical question in biomedical research is “Is there any change in the RNA transcript abundance when cellular conditions change?” RNA Sequencing (RNA-Seq) is a powerful tool that can help answer this question, but two critical parts of obtaining accurate measurements are (A) understanding the kind of data that RNA-Seq produces, and (B) “normalizing” the data between samples to allow for a fair comparison. Most tools assume that RNA-Seq data is count data, but in reality it is “compositional” data, meaning only percentages/proportions are available, which cannot directly answer the critical question. This leads to distorted results when attempting to simulate or analyze data that has a large global change. To address this problem, we designed a new simulation protocol called absSimSeq that can more accurately represent RNA-Seq data when there are large changes. We also proposed a “compositional normalization” method that can utilize “negative control” features that are known to not change between conditions to anchor the data. When there are many features changing, this approach improves performance over commonly used normalization methods across multiple tools. This work highlights the importance of having negative controls features available and of treating RNA-Seq data as compositional.


Up-Regulated Genes
303 SPAC1142.01) as a denominator; this gene had the most consistent abundance (TPM value) across all 304 samples. The compositional normalization methods (all tools in red: "ALDEx2 ALR overlap"; "sleuth-ALR"; all 305 "+C.N." tools) used opt3 (Pombase: SPCC1840.12) as a denominator; this gene was considered a "validated 306 reference gene". "RUVg" for edgeR and DESeq2 (in blue) also used opt3 as a negative control gene. Only 307 compositional normalization methods, using opt3, were able to accurately reflect the severe global decrease 308 observed in the data, as shown by the number of genes showing down-regulation of the absolute counts. Note 309 that ALDEx2 Welch and Wilcoxon statistics yielded <5 significant hits.

311 Discussion
312 Compositional normalization performed similarly to current normalization methods when there was only a small 313 change to total RNA (Fig 2A; Fig 4; Fig 5). This similarity is expected because, in this scenario, it is valid to 314 assume that most features are not changing. However, when there was a large change in global RNA, 315 compositional normalization using negative control features (spike-ins) had much better performance compared 316 to the current normalization methods, for both simulated data (Fig 2B-2C) and real data ( Table 1). In this case, 317 the assumption held by the current methods is violated, and this is likely what greatly reduced their performance.
318 Further, although the IQLR transformation in ALDEx2 was designed to be robust to large changes in global RNA, 319 it only modestly improved performance. This indicates that at least some of the features it selected and assumed 320 to be unchanging were indeed changing, in both the simulated data and the real data.

321
The worse performance of current normalization methods is likely related to how fold changes behave in the 322 absolute case versus the relative case. Current normalization methods assume that most features are not 323 changing; one could equivalently assume that the total RNA per cell is unchanged [23]. Thus, if the total RNA 324 content changes, current methods will anchor the data on whatever this global change is. This results in a shift 325 in the observed distribution of fold changes (see Supplementary Figure S16 of [4]). Extreme changes will still be 326 observed to have the same direction, but there will be a group of features that are changing less dramatically 327 than the global change that will be observed to have the wrong sign, and many unchanged features will appear 328 to be changing.
329 Interestingly, the choice of normalization method had a much greater impact on performance than the choice of 330 differential analysis tool, validating a finding from an older study [14]. This was true both for the simulated data 331 (Fig 2B-2C) and for the yeast dataset (Table 1). In both cases, compositional normalization clearly outperformed 332 current methods when analyzing a dataset with substantial changes to the total RNA content, whether simulated 333 or real. However, this was only true when negative control features were used ( Table 1). This indicates that the 334 choice of tool is much less important than the choice of normalization and the availability of negative control 335 features (spike-ins, validated reference gene) to properly anchor the data.
336 Surprisingly, RUVg had poor performance in both the simulated data and the experimental yeast dataset. It is 337 unclear why this occurred, other than the likely possibility that it treats the data as count data rather than 338 compositional data. More work would need to be done to see if RUVg could be modified to more accurately 339 capture the global trend in the data from negative control features.
340 In our simulations, the total RNA content was either decreased by 33% or tripled, respectively. The overall 341 change in the "up" study was less extreme than what was observed after c-Myc overexpression [24,39]. In that 342 context, the researchers found a general transcriptional activation that was not captured by the traditional 343 analysis of the RNA-Seq data, and required cell number normalization using spike-ins to see the overall trend 344 of increasing gene expression; the total RNA content increase observed by RNA-Seq was ~5.5-fold (see Table   345 S2 of [24]). The overall change in the "down" group was less extreme than that observed after the Marguerat et 346 al dataset, which observed an 88% decrease total RNA content when using normalized RNA-Seq data. How 347 often large shifts occur in real datasets is unclear because of how infrequently spike-ins or validated reference 348 genes are used when generating data. Future work should determine more carefully how drastic composition 349 changes need to be before performance starts to degrade for methods which assume that most features are not 350 changing.

Results from Bottomly et al. self-consistency test and GEUVADIS null experiment
352 Sleuth-ALR had the best self-consistency (Fig 5), and sleuth-ALR and limma had the best performance in the 353 negative control dataset (Fig 6). ALDEx2 was unable to identify any hits using three samples per group with the 354 standard statistical methods (Wilcoxon and Welch), and its "overlap" statistic showed a very high FDR, indicating 355 that its results were inconsistent between the "training" and larger "validation" datasets. This indicates that  388 is used to validate differential analysis results in this scenario, a predicted reference gene after using spike-ins 512 parameters for the spike-ins was set to the median dispersion of all transcript that had a mean TPM within 5% 513 of the TPM for the spike-in.
514 S1 Table summarizes the simulation parameters and the number of transcripts that are differentially expressed, 515 and S2 Table shows the average global copy numbers per cell per condition for each of the fifteen runs. Note 516 that the experimental group in the "up" study had a ~2.8-fold increase, on average, in the total RNA copy 517 numbers per cell. This is less than the 5.5-fold increase in total mRNA observed after over-expressing the 518 oncogene c-Myc (See the normalized data in Table S2 in Lovén et al., 2012)). The experimental group in the 519 "down" study had a ~33% decrease in the total RNA copy numbers per cell. This is less than the decrease in 520 total RNA observed in the yeast dataset (See Supplementary Table S2  553 Thus, if there are one or more features which are known a priori to be negative controls-a validated reference 554 gene, a pool of spike-ins-these are natural choices for use as a denominator in sleuth-ALR, either using a 555 single feature, or using the geometric mean of multiple features. Since the copy numbers of these features are 556 expected to be constant between samples, there is now an anchor for the relative proportions between samples. 557 If negative controls are unavailable, we propose identifying one or more features that have the most consistent 558 proportion across all samples. For sleuth-ALR, we chose the coefficient of variation as the metric to measure 559 consistency in proportion. Without external information, though, it is unknown if this "consistent feature" is indeed 560 also consistent in copy numbers. If there is a global change in RNA content, this feature would represent the 561 average change. All current normalization methods assume there is no such change, and so make corrections 562 to the data to remove any perceived change; they are therefore mathematically equivalent to this proposed 563 approach (see [5] for a full mathematical proof). However, this approach has the advantage of making explicit