Seurat -- ScaleData学习

All_Will_Be_Fine噻

已于 2023-04-27 16:11:24 修改

阅读量3.2k

点赞数 5

分类专栏： bioinfo R scRNAseq 文章标签： r语言 scRNA-seq

于 2023-04-27 16:10:44 首次发布

本文链接：https://blog.csdn.net/jiangshandaiyou/article/details/130405786

版权

R 同时被 3 个专栏收录

84 篇文章

订阅专栏

bioinfo

69 篇文章

订阅专栏

scRNAseq

22 篇文章

订阅专栏

brief

seurat提供了一个教学，其中global scale normalization之后又对数据进行了scale。
默认是对上一步 selected highly variable features进行scale。

概要图以及系列博文可以参见链接。

在这里插入图片描述

如果是 SCTransform则不需要手动运行这一步。

下面是就是教程提供的流程：

library(dplyr)
library(Seurat)
library(patchwork)
library(sctransform)
library(ggplot2)

rm(list=ls())
# 使用read10X读取output of the cellranger pipeline from 10X，包括barcodes，genes，matrix.mtx三个文件
pbmc.data <- Read10X(data.dir = "D:/djs/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19")
# 使用 CreateSeuratObject函数构造seurat对象
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k",
                           min.cells = 3, min.features = 200,
                           names.delim = "-",names.field = 1)

# 计算 a percentage of cell reads originating from the mitochondrial genes
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# 计算 complexity of the RNA species
pbmc@meta.data$log10GenesPerUMI <- log10(pbmc$nFeature_RNA) / log10(pbmc$nCount_RNA)

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc,assay = "RNA" ,selection.method = "vst", nfeatures = 2000)

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes,assay = "RNA")

pbmc <- RunPCA(pbmc)

这里是该函数具体的参数以及意义：
- features
  Vector of features names to scale/center. Default is variable features.
- vars.to.regress
  Variables to regress out (previously latent.vars in RegressOut). For example, nUMI, or percent.mito.
- split.by
  Name of variable in object metadata or a vector or factor defining grouping of cells. See argument f in split for more details
- model.use
  Use a linear model or generalized linear model (poisson, negative binomial) for the regression. Options are ‘linear’ (default), ‘poisson’, and ‘negbinom’
- use.umi
  Regress on UMI count data. Default is FALSE for linear modeling, but automatically set to TRUE if model.use is ‘negbinom’ or ‘poisson’
- do.scale
  Whether to scale the data.
- do.center
  Whether to center the data.
- scale.max
  Max value to return for scaled data. The default is 10. Setting this can help reduce the effects of features that are only expressed in a very small number of cells. If regressing out latent variables and using a non-linear model, the default is 50.
- block.size
  Default size for number of features to scale at in a single computation. Increasing block.size may speed up calculations but at an additional memory cost.
- min.cells.to.block
  If object contains fewer than this number of cells, don’t block for scaling calculations.
- verbose
  Displays a progress bar for scaling procedure
- assay
  Name of Assay to scale

这个函数和基础函数scale()的结果一样吗？

没区别
可能是调用代码不一样~~

t(pbmc@assays$RNA@scale.data[1:3,1:6])

scale(t(as.array(pbmc@assays$RNA@data)),center = T,scale = T)[1:6,1:3]

在这里插入图片描述

为什么要scale？

一个gene 的表达量在不同细胞中的分布可以认为是正态分布，当你将这个gene的表达量中心化以及标准化成为标准正态分布后（z-score），不同gene的表达量分布就在同一个尺度上了，方便比较。

高表达的gene在下游的分析中和低表达gene在下游分析中权重也就一致了，不然高表达的gene在下游的分析中比如PCA就会占据主导地位，而细胞间的变异需要同时考虑gene的表达量以及gene的特异性表达，特异性表达的基因表达量通常不高。

ScaleData()后的数据存放和后续应用

scale标准化的数据储存在"RNA" assay的 seurat_obj[[‘RNA’]]@scale.data中
我们也注意到seurat_obj[[‘RNA’]]@data全是非负数，而且是针对基因矩阵的所有基因；而seurat_obj[[‘RNA’]]@scale.data则有正负数，默认情况，只针对高可变基因进行scale标准化；
那么，我们在seurat下游分析中，什么情况使用data，什么时候使用scale.data：
- 下游分析中的PCA线性降维聚类，umap、tsne聚类均是应用高可变基因的scale.data进行后续分析的；
- 在基因可视化分析中，FeaturePlot、FeatureScatter、VlnPlot、DotPlot等函数默认slot =“data”，只有DoHeatmap()默认使用slot = “scale.data”，多个基因跨细胞比较；
- FindAllMarkers()找差异基因是默认slot =“data”，它是针对所有基因找差异基因，而不是高可变基因集

多个数据集整合应该怎样调用ScaleData()

这里的多个数据集只包括scRNA-seq数据集。
如果仅仅是数据集之间的merge(需要做简单的QC验证没有批次效应)，那应该使用 RNA assays下面的data进行 scale。当然你也可以在运行ScaleData时加入split.by区分数据集以分别进行scale(没验证过会出现什么问题)。
如果数据集之间进行了integrated运算，那应该使用integrated assays下面的data进行 scale