主要涉及两种校正方式(NormlizeData/Sctansform)以及针对大数据集(including >200,000 cells)进行的数据预处理。
数据集整合是基于anchor,而anchor的产生需要对不同数据集进行校正后计算,不同的校正方式,产生不同的anchor。
基于NormalizeData()的校正(首选)
LC.list <- lapply(X = LC.list, FUN = function(x) {
x[["percent.mt"]] <- PercentageFeatureSet(x, pattern = "^mt-")
x <- subset(x, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 20)
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
features <- SelectIntegrationFeatures(object.list = LC.list, nfeatures=2000)
LC.anchors <- FindIntegrationAnchors(object.list = LC.list, anchor.features = features, normalization.method = "LogNormalize")
LC.integrated <- IntegrateData(anchorset = LC.anchors, normalization.method = "LogNormalize")
#校正后的降维、聚类
DefaultAssay(LC.integrated) <- "integrated"
LC.integrated <- ScaleData(LC.integrated, verbose = FALSE)
LC.integrated <- RunPCA(LC.integrated, npcs = 50, verbose = FALSE)
ElbowPlot(LC.integrated) # 肘部图:根据主成分中每个变量的方差比。对pca的维度进行选择
LC.integrated <- RunHarmony(LC.integrated, reduction="pca", group.by.vars="orig.ident",assay.use = "integrated")
# 下面是可以选择pca/harmony不同的两种降维方式(这里以harmony为例)
LC.integrated <- RunUMAP(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- RunTSNE(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindNeighbors(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindClusters(LC.integrated, resolution = 1)
基于SCTransform()的校正
基于正则化负二项回归的改进方法,代替函数NormalizeData()
、ScaleData()
、FindVariableFeatures()。
针对SCT矩阵,该校正执行之后,SCT矩阵后续会变成默认矩阵。
使用该校正方法时,feature选择最好是3000及以上。比NormalizeData()校正方法中的feature要多。在这里,官网给的解释是,多出来的这些feature并不是因为细胞间校正技术的不同,而是代表更细微的生物学变化。
校正后的结果保存在SCT@scale.data中。这个矩阵中为非稀疏矩阵,在SCTransform()中默认参数为‘return.only.var.genes = TRUE’,使得这个矩阵中保存的是高变feature的相关归一化值。
SCT@counts 通过SCT@scale.data来计算。SCT@counts的对数标准化数据保存在SCT@data中,常用于可视化。
原则上,对数据进行差异表达与整合,最佳的选择是用SCT@scale.data。
LC.list <- lapply(X = LC.list, FUN = function(x) {
x[["percent.mt"]] <- PercentageFeatureSet(x, pattern = "^mt-")
x <- subset(x, subset = nFeature_RNA > 500 & nFeature_RNA < 50 00 & percent.mt < 20)
x <- SCTransform(x, vars.to.regress = "percent.mt", verbose = FALSE)
})
features <- SelectIntegrationFeatures(object.list = LC.list, nfeatures = 3000)
LC.list <- PrepSCTIntegration(object.list = LC.list, anchor.features = features)
LC.anchors <- FindIntegrationAnchors(object.list = LC.list, anchor.features = features, normalization.method = "SCT")
LC.integrated <- IntegrateData(anchorset = LC.anchors, normalization.method = "SCT")
#校正后的降维、聚类
DefaultAssay(LC.integrated) <- "integrated"
LC.integrated <- RunPCA(LC.integrated, npcs = 50, verbose = FALSE)
ElbowPlot(LC.integrated) # 肘部图:根据主成分中每个变量的方差比。对pca的维度进行选择
#后续同上
大型数据集处理:
不再使用典型相关分析canonical correlation analysis (CCA),而是用相关主成分分析reciprocal PCA (RPCA)。
CCA适合于在细胞类型保守的情况下(实验条件或疾病状态引入非常强的表达变化,或跨模式和物种整合数据集)识别锚点,但在不同的实验中(当大比例的单元格在数据集之间不重叠时),基因表达存在非常大的差异。
RPCA适合情况:当不同生物状态的细胞在整合后不太可能“对齐”时,即:①一个数据集中的大部分细胞无法在其他数据集中匹配;②同一平台的多个通道(multiple lanes of 10x genomics);③数据集过大。
因为基于RPCA的整合方法比较保守,在FindIntegrationAnchors() 中可以增加k.anchor的数量(default=5),来对齐细胞亚群。例如Naive T与memory T区分难度较大。
注意点:在整合数据前需要对单个数据集进行主成分分析。
在FindIntegrationAnchors()
时,需要设置reduction = "rpca"。
SCTransform()做法类似,也要run PCA,但不需要ScaleData()。
LC.list <- list(LC05, LC09, LC10, LC11, LC12, LC19, LC21, LC22, LC24, LC25, LC27, LC28, LC29, LC30, LC31, LC33)
# 计算线粒体QC指标(计数百分比),[[表示增加一列,
LC.list <- lapply(X = LC.list, FUN = function(x) {
x[["percent.mt"]] <- PercentageFeatureSet(x, pattern = "^mt-")
x <- subset(x, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 20)
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
features <- SelectIntegrationFeatures(object.list = LC.list, nfeatures=2000)
LC.list <- lapply(X = LC.list, FUN = function(x) {
x <- ScaleData(x, features = features, verbose = FALSE)
x <- RunPCA(x, features = features, verbose = FALSE)
})
LC.anchors <- FindIntegrationAnchors(object.list = LC.list, anchor.features = features, normalization.method = "LogNormalize",reduction = "rpca", k.anchor = 20)
LC.integrated <- IntegrateData(anchorset = LC.anchors, normalization.method = "LogNormalize")
LC.integrated <- ScaleData(LC.integrated, verbose = FALSE)
LC.integrated <- RunPCA(LC.integrated, npcs = 50, verbose = FALSE)
LC.integrated <- RunHarmony(LC.integrated, reduction="pca", group.by.vars="orig.ident",assay.use = "integrated")
LC.integrated <- RunUMAP(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- RunTSNE(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindNeighbors(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindClusters(LC.integrated, resolution = 1)
校正方法基本按照这三种类型,可以搭配,尝试找出最适合自身数据集的方式进行校正。
harmony与pca是两种不同的降维方式,也值得记录一下。(还没深入探索过qaq)