论文题目:Batch Effect Correction of RNA-seq Data through Sample 2 Distance Matrix Adjustment
scholar 引用:0
页数:25
发表时间:June 2019
发表刊物:preprint
作者:Teng Fei and Tianwei Yu
摘要:
Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. We present scBatch, a numerical algorithm that conducts batch effect correction on the count matrix of RNA sequencing (RNA-seq) data. Different from traditional methods, scBatch starts with establishing an ideal correction of the sample distance matrix that effectively reflect the underlying biological subgroups, without considering the actual correction of the raw count matrix itself. It then seeks an optimal linear transformation of the count matrix to approximate the established sample pattern. The benefit of such an approach is the final result is not restricted by assumptions on the mechanism of the batch effect. As a result, the method yields good clustering and gene differential expression (DE) results. We compared the new method, scBatch, with leading batch effect removal methods ComBat and mnnCorrect on simulated data, real bulk RNA-seq data, and real single-cell RNA-seq data. The comparisons demonstrated that scBatch achieved better sample clustering and DE gene detection results.
本文提出了一个新的解决batch effect的方法:scBatch,与业界较认可的两种方法:ComBat和mnnCorrect进行了比较,新方法取得了更好的结果。
结论:
-
the proposed method, scBatch, can obtain better clustering pattern, maintain crucial marker information and detect more DE genes. 这个方法,可以获得更好的聚类模式,维护重要的标记信息,检测更多的差异基因(差异基因检测的越多越好吗?)
-
The method assumes roughly balanced sample population among batches.这个方法有一个假设,各批次的样本数量大致平衡,有文献表明该假设是合理的。即便这个假设不那么准确,该方法也具有一定的鲁棒性。
-
与Combat和MNN相比,scBatch需要更多的时间来获得最优结果。这个时间跟sample size也是有关系,非线性。即便sample size同样,计算时间跟batch effect的复杂程度有关。虽然这个时间长,但是在可接受的范围内。
-
这个方法还有很多提升的空间。主要分为两个方面:1. 这个方法用的是simplest linear transformation of raw count matrix,但是其实可以尝试一些non-linear transformation的方法;2. 这个衡量距离用的是Pearson correlation matrix, 因为这个易于解释,便于进行梯度计算,但是其他的distance metric比如说Spearman correlation应该也可以尝试一下,也许可以带来新的视野。a more universal numerical gradient descent algorithm may be applied。
Introduction:
- RNA-seq, a major tool for transcriptomics
- the limitation of sequencing technology and sample preparations, technical variations exist among reads from different batches of experiments. batch effect 可能的原因
- the severity of batch effects varies in different datasets 不同的数据集,批次效应的严重程序有所不同
-
The correction of the batch effects can yield better clustering results
-
Johnson et al. (2007) proposed an empirical Bayes algorithm, ComBat 2007年提出的ComBat, 在2020年的review中还在讨论这个方法的效果最好。。。是不是技术有点更新的慢?which continued to be a successful method in RNA-seq data.
-
Combat 主要的作用:to normalize the data by removing additive and multiplicative batch effects
-
Researchers also attempted to find and correct unknown batch effects by utilizing control genes in microarray (Gagnon-Bartsch and Speed, 2012) and RNA-seq data (Leek, 2014; Risso et al., 2014; Chen and Zhou, 2017). 也有很多研究者用control genes的方法去发现和减弱一些未知的批次效应。这些方法和Combat都是基于回归方法的.
-
近年的很多方法提出的策略allow for more complex batch effect mechanisms
-
To achieve better clustering performance, Fei et al. (2018) developed a non-parametric approach, named QuantNorm, to correct sample distance matrix by quantile normalization; 2018年提出的QuantNorm, 用分位数归一化,但是后面新方法对比的时候却没有跟这个方法进行?QuantNorm不支持DE detection
-
Haghverdi et al. (2018) utilized the mutual nearest neighbor relationships among samples from different batches to establish the MNN correction scheme. 2018年提出来的MNN方法。不同批次的相互最近邻关系。
-
这两个方法,作者认为在sample pattern detection方面目前已经获得了reasonable performances。比如说finding clusters或者conducting dimension reduction。
-
但是最近的方法不是很关注DE detection。比如QuantNorm。MNN虽然有支持DE tests,但是作者不推荐用校正后的矩阵去做DE analysis。
-
本文挑战了DE detection,提出一种新的方法,to utilize the corrected sample distance matrix to further correct the count matrix.
-
we seek a linear transformation to the count matrix, such that the Pearson correlation matrix of the transformed matrix approximates the corrected correlation matrix obtained from QuantNorm.
-
we propose a random block coordinate descent algorithm to conduct linear transformation on the 𝑝 (genes) ×𝑛 (samples) count matrix.
-
Simulation studies demonstrate that in terms of DE gene detection, our method corrects the count matrix better compared to ComBat and MNN, with consistently higher area under the receiver operating characteristic curve (AUC) and area under the precision-recall curve (PRAUC).
-
In real data analyses, the proposed method also show strong performances in clustering and DE detection in a bulk RNA-seq dataset (Lin et al., 2014) and two scRNA-seq datasets (Usoskin et al., 2015; Xin et al., 2016). 经过了很多数据集的测试。主要就是这两个clustering and DE detection。
正文组织架构:
1. Introduction
2. Results
2.1 Batch effect correction based on corrected sample correlation matrix
2.2 scBatch achieve better DE. detection in simulation
2.3 scBatch obtained better sample patterns for bulk RNA-seq data
2.4 scBatch shows strong performance in cell heterogeneity investigation
2.5 Mouse neuron dataset GSE59739
2.6 Human pancreas data GSE81608
3. Discussion
4. Methods
4.1 Main algorithm
4.2 Simulation design
4.3 Datasets and preprocessing
4.4 Analysis and performance evaluation scheme
正文部分内容摘录:
- 也不是说简单的看图,DE detection或者clustering效果好,还有指标的。