Paper intensive reading (十五):Mitigating the adverse impact of batch effects in sample detection

论文题目:Mitigating the adverse impact of batch effects in sample pattern detection 

scholar 引用:6


发表时间:1 March 2018


作者:Teng Fei1, Tengjiao Zhang2, Weiyang Shi3,* and Tianwei Yu1,*

Emory University, 同济大学,跟Paper intensive reading (十二)同一个作者


Motivation: It is well known that batch effects exist in RNA-seq data and other profiling data. Although some methods do a good job adjusting for batch effects by modifying the data matrices, it is still difficult to remove the batch effects entirely. The remaining batch effect can cause artifacts in the detection of patterns in the data.

Results: In this study, we consider the batch effect issue in the pattern detection among the samples, such as clustering, dimension reduction and construction of networks between subjects. Instead of adjusting the original data matrices, we design an adaptive method to directly adjust the dissimilarity matrix between samples. In simulation studies, the method achieved better results recovering true underlying clusters, compared to the leading batch effect adjustment method ComBat. In real data analysis, the method effectively corrected distance matrices and improved the performance of clustering algorithms.





  • In this paper, we proposed novel approaches based on the interpolating quantile normalization. As the data become challenging, i.e. true clusters are closer to each other, and the batch effect is heterogeneous on different clusters, our methods outperform ComBat. 本文提出的方法的新颖之处。
  • 但是,ComBat还是一种更通用的方法,ComBat is a more general method, which adjusts the data matrix for many kinds of down-stream analysis, while our method focuses on adjusting the dissimilarity matrix between samples, mainly serving the purpose of pattern detection in the samples. It does not correct the raw count matrix to adjust for batch effects.
  • our method modifies the dissimilarity matrix so that various clustering approaches can achieve better performance. 本文的方法可以与多种聚类方法相结合。
  • 本方法的缺陷:On the one hand, the vectorization approach may suffer from insufficient discrimination due to the lack of extreme values. On the other hand, the row/column iterative approach is more easily affected by the wrong extreme values since each column and each row are polarized.
  • the vectorization approach performed better on data with high similarity between batches 本方法适用于不同批次之间具有高度相似性的数据。
  • the preprocessing method can affect the result of the clustering analysis. 预处理方法
  • the choice of the two preprocessing strategies may depend on data 预处理策略取决于数据
  • Although the iterative approach seems to have limitations ex- plained earlier, we generally recommend this approach. 可以提高方法的鲁棒性


  • The existence of batch effects increases the difficulty in comparing the data from different labs, platforms and processing times. 不同实验室,不同平台,不同时间
  • 如果忽略批次效应,会得到错误的结果。比如说对小鼠和人的基因表达进行聚类分析,得出两个物种而非两种组织的结论,但是调整了批次效应以后,得到了相反的结论。
  • 大量的方法被提出:
  1. Johnson et al. (2007) proposed the empirical Bayes algorithm of ComBat, which removes the additive and multiplicative batch effects for each gene from each batch. 当前的黄金标准方法
  2. Gagnon-Bartsch and Speed (2012) applied the removal of unwanted variation method to make adjustments according to the variations of the control genes, which are not differentially expressed (DE) among the batches.
  • 大部分方法包含ComBat的缺点:attempt to modify the data matrix (N subjects􏰂p genes) so that the measurements from different batches become comparable. ComBat appears to be more effective for the microarray data, which is less skewed than RNA-seq data. 对microarray data更有效,但是对RNA-seq也许没有那么有效
  • Moreover, real data may have high irregularity such that the additive and the multiplicative parameters are insufficient to capture all batch effects. 加法和乘法参数不足以捕获所有的批次效应
  • ad hoc approaches based on quantile normalization are introduced in this manuscript 本文提出了一种分位数归一化的方法
  • According to simulation results, clustering based on the normalized dissimilarity matrix obtained by our methods outperformed ComBat in recapturing the underlying cluster structure in the data, especially when the data were more challenging as the percentage of genes that differentiate the underlying clusters was small. 仿真实验结果显示优于ComBat。尤其适用于一些有挑战的数据集。
  • In real data analysis, we analyzed two datasets with dominating batch effects (Gilad and Mizrahi- Man, 2015; Zhang et al., 2016) and two scRNA-seq datasets where the batch effects are relatively weak (Muraro et al., 2016; Usoskin et al., 2015). Our methods improved the clustering accuracy and outperformed ComBat in both situations。在几个实际数据集上,也优于ComBat。


1. Introduction

2. Materials and methods

2.1 Problem setup

2.2 Preprocessing

2.3 Interpolating quantile(内插分位数归一化) normalization for vectors of different lengths

2.4 Dissimilarity matrix correction

2.5 Clustering and evaluation methods

3. Results

3.1 Simulation study

3.2 ENCODE data for human and mouse tissues

3.3 Human-mouse brain RNA-seq data

3.4 Mouse neuron scRNA-seq data

3.5 Human pancreas scRNA-seq data

4. Discussion






