Seurat | 多数据集预处理校正标准化及整合流程 | 大数据集处理

13 篇文章 0 订阅
3 篇文章 1 订阅

主要涉及两种校正方式(NormlizeData/Sctansform)以及针对大数据集(including >200,000 cells)进行的数据预处理。

数据集整合是基于anchor,而anchor的产生需要对不同数据集进行校正后计算,不同的校正方式,产生不同的anchor。

基于NormalizeData()的校正(首选)

LC.list <- lapply(X = LC.list, FUN = function(x) {
    x[["percent.mt"]] <- PercentageFeatureSet(x, pattern = "^mt-")
    x <- subset(x, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 20)
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

features <- SelectIntegrationFeatures(object.list = LC.list, nfeatures=2000)
LC.anchors <- FindIntegrationAnchors(object.list = LC.list, anchor.features = features, normalization.method = "LogNormalize")
LC.integrated <- IntegrateData(anchorset = LC.anchors, normalization.method = "LogNormalize")


#校正后的降维、聚类

DefaultAssay(LC.integrated) <- "integrated"
LC.integrated <- ScaleData(LC.integrated, verbose = FALSE)
LC.integrated <- RunPCA(LC.integrated, npcs = 50, verbose = FALSE)
ElbowPlot(LC.integrated) # 肘部图:根据主成分中每个变量的方差比。对pca的维度进行选择
LC.integrated <- RunHarmony(LC.integrated, reduction="pca", group.by.vars="orig.ident",assay.use = "integrated")

# 下面是可以选择pca/harmony不同的两种降维方式(这里以harmony为例)

LC.integrated <- RunUMAP(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- RunTSNE(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindNeighbors(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindClusters(LC.integrated, resolution = 1)

基于SCTransform()的校正

基于正则化负二项回归的改进方法,代替函数NormalizeData()ScaleData()FindVariableFeatures()。

针对SCT矩阵,该校正执行之后,SCT矩阵后续会变成默认矩阵。

使用该校正方法时,feature选择最好是3000及以上。比NormalizeData()校正方法中的feature要多。在这里,官网给的解释是,多出来的这些feature并不是因为细胞间校正技术的不同,而是代表更细微的生物学变化。

校正后的结果保存在SCT@scale.data中。这个矩阵中为非稀疏矩阵,在SCTransform()中默认参数为‘return.only.var.genes = TRUE’,使得这个矩阵中保存的是高变feature的相关归一化值。

SCT@counts 通过SCT@scale.data来计算。SCT@counts的对数标准化数据保存在SCT@data中,常用于可视化。

 原则上,对数据进行差异表达与整合,最佳的选择是用SCT@scale.data。

LC.list <- lapply(X = LC.list, FUN = function(x) {
    x[["percent.mt"]] <- PercentageFeatureSet(x, pattern = "^mt-")
    x <- subset(x, subset = nFeature_RNA > 500 & nFeature_RNA < 50 00 & percent.mt < 20)
    x <- SCTransform(x, vars.to.regress = "percent.mt", verbose = FALSE)
})

features <- SelectIntegrationFeatures(object.list = LC.list, nfeatures = 3000)
LC.list <- PrepSCTIntegration(object.list = LC.list, anchor.features = features)

LC.anchors <- FindIntegrationAnchors(object.list = LC.list, anchor.features = features, normalization.method = "SCT")
LC.integrated <- IntegrateData(anchorset = LC.anchors, normalization.method = "SCT")

#校正后的降维、聚类

DefaultAssay(LC.integrated) <- "integrated"
LC.integrated <- RunPCA(LC.integrated, npcs = 50, verbose = FALSE)
ElbowPlot(LC.integrated) # 肘部图:根据主成分中每个变量的方差比。对pca的维度进行选择

#后续同上

大型数据集处理:

不再使用典型相关分析canonical correlation analysis (CCA),而是用相关主成分分析reciprocal PCA (RPCA)。

CCA适合于在细胞类型保守的情况下(实验条件或疾病状态引入非常强的表达变化,或跨模式和物种整合数据集)识别锚点,但在不同的实验中(当大比例的单元格在数据集之间不重叠时),基因表达存在非常大的差异。

RPCA适合情况:当不同生物状态的细胞在整合后不太可能“对齐”时,即:①一个数据集中的大部分细胞无法在其他数据集中匹配;②同一平台的多个通道(multiple lanes of 10x genomics);③数据集过大。

因为基于RPCA的整合方法比较保守,在FindIntegrationAnchors() 中可以增加k.anchor的数量(default=5),来对齐细胞亚群。例如Naive T与memory T区分难度较大。

注意点:在整合数据前需要对单个数据集进行主成分分析。

FindIntegrationAnchors()时,需要设置reduction = "rpca"。

SCTransform()做法类似,也要run PCA,但不需要ScaleData()。

LC.list <- list(LC05, LC09, LC10, LC11, LC12, LC19, LC21, LC22, LC24, LC25, LC27, LC28, LC29, LC30, LC31, LC33)

# 计算线粒体QC指标(计数百分比),[[表示增加一列,
LC.list <- lapply(X = LC.list, FUN = function(x) {
    x[["percent.mt"]] <- PercentageFeatureSet(x, pattern = "^mt-")
    x <- subset(x, subset = nFeature_RNA > 500 & nFeature_RNA < 5000 & percent.mt < 20)
    x <- NormalizeData(x)
    x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})

features <- SelectIntegrationFeatures(object.list = LC.list, nfeatures=2000)
LC.list <- lapply(X = LC.list, FUN = function(x) {
    x <- ScaleData(x, features = features, verbose = FALSE)
    x <- RunPCA(x, features = features, verbose = FALSE)
})

LC.anchors <- FindIntegrationAnchors(object.list = LC.list, anchor.features = features, normalization.method = "LogNormalize",reduction = "rpca", k.anchor = 20)
LC.integrated <- IntegrateData(anchorset = LC.anchors, normalization.method = "LogNormalize")

LC.integrated <- ScaleData(LC.integrated, verbose = FALSE)
LC.integrated <- RunPCA(LC.integrated, npcs = 50, verbose = FALSE)
LC.integrated <- RunHarmony(LC.integrated, reduction="pca", group.by.vars="orig.ident",assay.use = "integrated")
LC.integrated <- RunUMAP(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- RunTSNE(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindNeighbors(LC.integrated, reduction = "harmony", dims = 1:50)
LC.integrated <- FindClusters(LC.integrated, resolution = 1)

校正方法基本按照这三种类型,可以搭配,尝试找出最适合自身数据集的方式进行校正。

harmony与pca是两种不同的降维方式,也值得记录一下。(还没深入探索过qaq)

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值