RNA 38. SCI文章基于测序数据去除批次效应(SVA)

最新推荐文章于 2024-12-23 16:56:02 发布

桓峰基因

最新推荐文章于 2024-12-23 16:56:02 发布

阅读量841

点赞数

文章标签： linux 运维服务器

本文链接：https://blog.csdn.net/weixin_41368414/article/details/134797273

版权

这期推荐软件包SVA: 基于高通量测序数据去除批次效应，下面我们就看看怎么来实现吧！

简介

异质性和潜在变量现在被广泛认为是高通量实验中偏差和变异性的主要来源。基因组实验中最著名的潜在变异来源是批效应——当样本在不同的日子、不同的小组或由不同的人处理时。然而，还有大量其他变量可能对高通量测量产生重大影响。在这里，我们描述了用于识别、估计和去除高通量实验中不需要的变异源的sva包。sva包支持使用sva函数进行代理变量估计，使用ComBat函数直接调整已知批处理效果，以及使用fsva函数调整预测问题中的批处理和潜在变量。

sva包可以通过两种方式去除批次效应:

(1)在高通量实验中识别和估计未知变异源的替代变量，

(2)使用ComBat直接去除已知的批效应。

批效应的定义如下:

批次效应是测量的子组，它们在不同条件下具有不同的性质，与研究中的生物或科学变量无关。例如，如果一组实验在星期一进行，另一组实验在星期二进行，如果两个技术人员负责不同的实验子集，或者如果两个不同批次的试剂，可能会发生批处理效应，使用芯片或仪器。

软件包安装

软件包安装如下：

if(!require(sva))
  BiocManager::install("sva")
if(!require(bladderbatch))  
  BiocManager::install("bladderbatch")
if(!require(zebrafishRNASeq))
  BiocManager::install("zebrafishRNASeq")

数据读取

数据准备需要主要几个文件：

表达矩阵（a matrix with features (genes, transcripts, voxels) the rows and samples in the columns）；
表现数据（phenotype data）。

library(sva)
library(bladderbatch)
library(pamr)
library(limma)

实例操作

设置来自ExpressionSet的数据

接下来，我们创建完整的模型矩阵——包括调整变量和感兴趣的变量(癌症状态)。在这种情况下，我们只有感兴趣的变量。由于癌症状态有多个层次，我们将其视为一个因素变量。

data(bladderdata)
bladderEset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 57 samples 
##   element names: exprs, se.exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM71019.CEL GSM71020.CEL ... GSM71077.CEL (57 total)
##   varLabels: sample outcome batch cancer
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu133a
pheno = pData(bladderEset)
head(pheno)
##              sample outcome batch cancer
## GSM71019.CEL      1  Normal     3 Normal
## GSM71020.CEL      2  Normal     2 Normal
## GSM71021.CEL      3  Normal     2 Normal
## GSM71022.CEL      4  Normal     3 Normal
## GSM71023.CEL      5  Normal     3 Normal
## GSM71024.CEL      6  Normal     3 Normal
table(pheno$batch)
## 
##  1  2  3  4  5 
## 11 18  4  5 19
edata = exprs(bladderEset)
head(edata[, 1:5])
##           GSM71019.CEL GSM71020.CEL GSM71021.CEL GSM71022.CEL GSM71023.CEL
## 1007_s_at    10.115170     8.628044     8.779235     9.248569    10.256841
## 1053_at       5.345168     5.063598     5.113116     5.179410     5.181383
## 117_at        6.348024     6.663625     6.465892     6.116422     5.980457
## 121_at        8.901739     9.439977     9.540738     9.254368     8.798086
## 1255_g_at     3.967672     4.466027     4.144885     4.189338     4.078509
## 1294_at       7.775183     7.110154     7.248430     7.017220     7.896419
mod = model.matrix(~as.factor(cancer), data = pheno)
mod0 = model.matrix(~1, data = pheno)

应用sva函数来估计批处理和人工因素

sva函数执行两个不同的步骤。首先，它确定的数量需要估计的潜在因素。如果调用sva函数时没有指定的n.sv参数，将为您估计因子的数量。因子的数量也可以使用num.sv来估计。接下来，我们应用sva函数来估计代理变量:

n.sv = num.sv(edata, mod, method = "leek")
n.sv
## [1] 2
svobj = sva(edata, mod, mod0, n.sv = n.sv)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

使用f.pvalue函数调整代理变量

f.pvalue函数可用于计算数据矩阵每一行的参数f检验p值。在膀胱研究的情况下，这将对应于为矩阵的22,283行中的每一行计算参数f检验p值。f检验比较模型mod和模型mo0。它们必须是嵌套模型，因此mod0中的所有变量都必须出现在mod中。首先，我们可以计算相对于癌症状态的差异表达的f检验p值，不需要调整替代变量，根据多次检验进行调整，并计算q值小于0.05的显著数量。请注意，近70%的基因在组间FDR小于5%的情况下表达强烈差异。这个数字似乎是人为的高，即使对于像癌症这样的强表现型。现在我们可以执行相同的分析，但是要对代理变量进行调整。第一步是在null和full模型中都包含代理变量。原因是我们想要调整代理变量，所以我们将它们视为必须包含在两个模型中的调整变量。然后p值和q值可以像之前一样计算。

pValues = f.pvalue(edata, mod, mod0)
qValues = p.adjust(pValues, method = "BH")
modSv = cbind(mod, svobj$sv)
mod0Sv = cbind(mod0, svobj$sv)
pValuesSv = f.pvalue(edata, modSv, mod0Sv)
qValuesSv = p.adjust(pValuesSv, method = "BH")

使用limma包调整代理变量

limma包是差分表达式分析中最常用的包之一。sva包可以很容易地与limma包一起使用，以执行调整的差异表达分析。这个过程的第一步是拟合包含代理变量的线性模型。从这里开始，您可以使用limma函数来执行通常的分析。举个例子，假设我们想计算癌症的微分表达式。要做到这一点，我们首先计算癌症/正常项之间的对比。我们没有在对比中包括替代变量，因为它们只是用来调整分析。

fit = lmFit(edata, modSv)
contrast.matrix <- cbind(C1 = c(-1, 1, 0, rep(0, svobj$n.sv)), C2 = c(0, -1, 1, rep(0,
    svobj$n.sv)), C3 = c(-1, 0, 1, rep(0, svobj$n.sv)))
fitContrasts = contrasts.fit(fit, contrast.matrix)
eb = eBayes(fitContrasts)
topTableF(eb, adjust = "BH")
##                    C1          C2        C3   AveExpr        F      P.Value
## 207783_x_at -13.45607  0.26592268 -13.19015 12.938786 8622.529 1.207531e-69
## 201492_s_at -13.27594  0.15357702 -13.12236 13.336090 8605.649 1.274450e-69
## 208834_x_at -12.76411  0.06134018 -12.70277 13.160201 6939.501 4.749368e-67
## 212869_x_at -13.77957  0.26008165 -13.51948 13.452076 6593.346 1.939773e-66
## 212284_x_at -13.59977  0.29135767 -13.30841 13.070844 5495.716 2.893287e-64
## 208825_x_at -12.70979  0.08250821 -12.62728 13.108072 5414.741 4.350100e-64
## 211445_x_at -10.15890 -0.06633356 -10.22523  9.853817 5256.114 9.845076e-64
## 213084_x_at -12.59345  0.03015520 -12.56329 13.046529 4790.107 1.260201e-62
## 201429_s_at -13.33686  0.28358293 -13.05328 12.941208 4464.995 8.675221e-62
## 214327_x_at -12.60146  0.20934783 -12.39211 11.832607 4312.087 2.257025e-61
##                adj.P.Val
## 207783_x_at 1.419929e-65
## 201492_s_at 1.419929e-65
## 208834_x_at 3.527673e-63
## 212869_x_at 1.080599e-62
## 212284_x_at 1.289423e-60
## 208825_x_at 1.615555e-60
## 211445_x_at 3.133969e-60
## 213084_x_at 3.510132e-59
## 201429_s_at 2.147888e-58
## 214327_x_at 5.029329e-58

应用ComBat()调整已知批次

ComBat函数使用经验贝叶斯框架对已知批次进行调整。为了使用该函数，必须在数据集中有一个已知的批处理变量。

batch = pheno$batch
modcombat = model.matrix(~1, data = pheno)
combat_edata = ComBat(dat = edata, batch = batch, mod = modcombat, par.prior = TRUE,
    prior.plots = FALSE)
pValuesComBat = f.pvalue(combat_edata, mod, mod0)
qValuesComBat = p.adjust(pValuesComBat, method = "BH")

library(pheatmap)
pheatmap(as.matrix(edata),annotation_col =pheno,show_rownames = F)

pheatmap(as.matrix(combat_edata),annotation_col =pheno,show_rownames = F)

ComBat-Seq用于批量调整RNA-Seq计数数据

count_matrix <- matrix(rnbinom(400, size = 10, prob = 0.1), nrow = 50, ncol = 8)
batch <- c(rep(1, 4), rep(2, 4))
adjusted <- ComBat_seq(count_matrix, batch = batch, group = NULL)
## Found 2 batches
## Using null model in ComBat-seq.
## Adjusting for 0 covariate(s) or covariate level(s)
## Estimating dispersions
## Fitting the GLM model
## Shrinkage off - using GLM estimates for parameters
## Adjusting the data
group <- rep(c(0, 1), 4)
adjusted_counts <- ComBat_seq(count_matrix, batch = batch, group = group)
## Found 2 batches
## Using full model in ComBat-seq.
## Adjusting for 1 covariate(s) or covariate level(s)
## Estimating dispersions
## Fitting the GLM model
## Shrinkage off - using GLM estimates for parameters
## Adjusting the data
cov1 <- rep(c(0, 1), 4)
cov2 <- c(0, 0, 1, 1, 0, 0, 1, 1)
covar_mat <- cbind(cov1, cov2)
adjusted_counts <- ComBat_seq(count_matrix, batch = batch, group = NULL, covar_mod = covar_mat)
## Found 2 batches
## Using null model in ComBat-seq.
## Adjusting for 2 covariate(s) or covariate level(s)
## Estimating dispersions
## Fitting the GLM model
## Shrinkage off - using GLM estimates for parameters
## Adjusting the data

用线性模型去除已知的批处理效果

modBatch = model.matrix(~as.factor(cancer) + as.factor(batch), data = pheno)
mod0Batch = model.matrix(~as.factor(batch), data = pheno)
pValuesBatch = f.pvalue(edata, modBatch, mod0Batch)
qValuesBatch = p.adjust(pValuesBatch, method = "BH")

替代变量vs直接调整

方差滤波在特征数量较大时加快计算速度(m>100,000)

n.sv = num.sv(edata, mod, vfilter = 2000, method = "leek")
svobj = sva(edata, mod, mod0, n.sv = n.sv, vfilter = 2000)
## Number of significant surrogate variables is:  2 
## Iteration (out of 5 ):1  2  3  4  5

应用fsva函数去除批处理效果进行预测

set.seed(12354)
trainIndicator = sample(1:57, size = 30, replace = FALSE)
testIndicator = (1:57)[-trainIndicator]
trainData = edata[, trainIndicator]
testData = edata[, testIndicator]
trainPheno = pheno[trainIndicator, ]
testPheno = pheno[testIndicator, ]
mydata = list(x = trainData, y = trainPheno$cancer)
mytrain = pamr.train(mydata)
## 123456789101112131415161718192021222324252627282930
table(pamr.predict(mytrain, testData, threshold = 2), testPheno$cancer)
##         
##          Biopsy Cancer Normal
##   Biopsy      4      0      0
##   Cancer      0     15      0
##   Normal      1      4      3
trainMod = model.matrix(~cancer, data = trainPheno)
trainMod0 = model.matrix(~1, data = trainPheno)
trainSv = sva(trainData, trainMod, trainMod0)
## Number of significant surrogate variables is:  5 
## Iteration (out of 5 ):1  2  3  4  5
fsvaobj = fsva(trainData, trainMod, trainSv, testData)
mydataSv = list(x = fsvaobj$db, y = trainPheno$cancer)
mytrainSv = pamr.train(mydataSv)
## 123456789101112131415161718192021222324252627282930

用于测序的Sva (svaseq)

library(zebrafishRNASeq)
data(zfGenes)
filter = apply(zfGenes, 1, function(x) length(x[x > 5]) >= 2)
filtered = zfGenes[filter, ]
genes = rownames(filtered)[grep("^ENS", rownames(filtered))]
controls = grepl("^ERCC", rownames(filtered))
group = as.factor(rep(c("Ctl", "Trt"), each = 3))
dat0 = as.matrix(filtered)
## Set null and alternative models (ignore batch)
mod1 = model.matrix(~group)
mod0 = cbind(mod1[, 1])
svseq = svaseq(dat0, mod1, mod0, n.sv = 1)$sv
## Number of significant surrogate variables is:  1 
## Iteration (out of 5 ):1  2  3  4  5
plot(svseq, pch = 19, col = "blue")

## Supervised sva
sup_svseq = svaseq(dat0, mod1, mod0, controls = controls, n.sv = 1)$sv
## sva warning: controls provided so supervised sva is being performed.
## Number of significant surrogate variables is:  1
plot(sup_svseq, svseq, pch = 19, col = "blue")

Reference

W.E. Johnson, C. Li, and A. Rabinovic. Adjusting batch effects in microarray data using empirical bayes methods. Biostatistics, 8(1):118–127, 2007.
J.T. Leek and J.D. Storey. Capturing heterogeneity in gene expression studies by ‘surrogate variable analysis’. PLoS Genetics 3:e161, 2007.
J.T. Leek and J.D. Storey. A general framework for multiple testing dependence. Proceedings of the National Academy of Sciences 105:18718-18723, 2008.
J. T. Leek, R. B. Scharpf, H. C. Bravo, D. Simcha, B. Langmead, W. E. Johnson, D. Geman, K. Baggerly, and R. A. Irizarry. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet., 11:733–739, Oct 2010.
L. Dyrskjot, M. Kruhoffer, T. Thykjaer, N. Marcussen, J. L. Jensen, K. Moller, and T. F. Orntoft. Gene expression in the urinary bladder: a common carcinoma in situ gene expression signature exists disregarding histopathological classification. Cancer Res., 64:4040–4048, Jun 2004.
Yuqing Zhang, David F Jenkins, Solaiappan Manimaran, and W Evan Johnson. Alternative empirical bayes models for adjusting for batch effects in genomic studies. BMC bioinformatics, 19(1):262, 2018.
Johann A Gagnon-Bartsch and Terence P Speed. Using control genes to correct for unwanted variation in microarray data. Biostatistics,13(3):539–552, 2012.