Seurat包的打分函数AddModuleScore

本文介绍了Seurat包中的AddModuleScore函数,通过计算基因集在单个细胞中的表达得分,结合控制特征的随机比较,提供了基因集在细胞层次的量化评估。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

在单细胞数据分析的过程中,Seurat包提供了一个为一个基因集打分的函数AddModuleScore(自定基因集),为基因集进行打分常见的富集分析软件GSVA,今天我们来看看Seurat这个函数的用法和意义。

library(Seurat)
?AddModuleScore

给出的解释说明是Calculate module scores for feature expression programs in single cells,也就是计算单个细胞在一个基因集的score值。
描述是这样的

Calculate the average expression levels of each program (cluster) on 
single cell level, subtracted by the aggregated expression of control 
feature sets. All analyzed features are binned based on averaged 
expression, and the control features are randomly selected from each bin.

看完这个解释,云里雾里,不知所云。
再来看看用法和参数:

Usage:

     AddModuleScore(
       object,
       features,
       pool = NULL,
       nbin = 24,
       ctrl = 100,
       k = FALSE,
       assay = NULL,
       name = "Cluster",
       seed = 1,
       search = FALSE,
       ...
     )
Arguments:

  object: Seurat object

features: Feature expression programs in list

    pool: List of features to check expression levels agains, defaults
          to ‘rownames(x = object)’

    nbin: Number of bins of aggregate expression levels for all
          analyzed features

    ctrl: Number of control features selected from the same bin per
          analyzed feature

       k: Use feature clusters returned from DoKMeans

   assay: Name of assay to use

    name: Name for the expression programs

    seed: Set a random seed. If NULL, seed is not set.
   search: Search for symbol synonyms for features in ‘features’ that
          don't match features in ‘object’? Searches the HGNC's gene
          names database; see ‘UpdateSymbolList’ for more details

     ...: Extra parameters passed to ‘UpdateSymbolList’

计算结果会返回一个数值,有正有负,今天我们的任务就是来参透这个函数,既然看解释不明白,那就只能看原函数了。

function (object, features, pool = NULL, nbin = 24, ctrl = 100, 
    k = FALSE, assay = NULL, name = "Cluster", seed = 1, search = FALSE, 
    ...) 
{
    if (!is.null(x = seed)) {
        set.seed(seed = seed)
    }
    assay.old <- DefaultAssay(object = object)
    assay <- assay %||% assay.old
    DefaultAssay(object = object) <- assay
    assay.data <- GetAssayData(object = object)
    features.old <- features
    if (k) {
        .NotYetUsed(arg = "k")
        features <- list()
        for (i in as.numeric(x = names(x = table(object@kmeans.obj[[1]]$cluster)))) {
            features[[i]] <- names(x = which(x = object@kmeans.obj[[1]]$cluster == 
                i))
        }
        cluster.length <- length(x = features)
    }
    else {
        if (is.null(x = features)) {
            stop("Missing input feature list")
        }
        features <- lapply(X = features, FUN = function(x) {
            missing.features <- setdiff(x = x, y = rownames(x = object))
            if (length(x = missing.features) > 0) {
                warning("The following features are not present in the object: ", 
                  paste(missing.features, collapse = ", "), ifelse(test = search, 
                    yes = ", attempting to find updated synonyms", 
                    no = ", not searching for symbol synonyms"), 
                  call. = FALSE, immediate. = TRUE)
                if (search) {
                  tryCatch(expr = {
                    updated.features <- UpdateSymbolList(symbols = missing.features, 
                      ...)
                    names(x = updated.features) <- missing.features
                    for (miss in names(x = updated.features)) {
                      index <- which(x == miss)
                      x[index] <- updated.features[miss]
                    }
                  }, error = function(...) {
                    warning("Could not reach HGNC's gene names database", 
                      call. = FALSE, immediate. = TRUE)
                  })
                  missing.features <- setdiff(x = x, y = rownames(x = object))
                  if (length(x = missing.features) > 0) {
                    warning("The following features are still not present in the object: ", 
                      paste(missing.features, collapse = ", "), 
                      call. = FALSE, immediate. = TRUE)
                  }
                }
            }
            return(intersect(x = x, y = rownames(x = object)))
        })
        cluster.length <- length(x = features)
    }
    if (!all(LengthCheck(values = features))) {
        warning(paste("Could not find enough features in the object from the following feature lists:", 
            paste(names(x = which(x = !LengthCheck(values = features)))), 
            "Attempting to match case..."))
        features <- lapply(X = features.old, FUN = CaseMatch, 
            match = rownames(x = object))
    }
    if (!all(LengthCheck(values = features))) {
        stop(paste("The following feature lists do not have enough features present in the object:", 
            paste(names(x = which(x = !LengthCheck(values = features)))), 
            "exiting..."))
    }
    pool <- pool %||% rownames(x = object)
    data.avg <- Matrix::rowMeans(x = assay.data[pool, ])
    data.avg <- data.avg[order(data.avg)]
    data.cut <- cut_number(x = data.avg + rnorm(n = length(data.avg))/1e+30, 
        n = nbin, labels = FALSE, right = FALSE)
    names(x = data.cut) <- names(x = data.avg)
    ctrl.use <- vector(mode = "list", length = cluster.length)
    for (i in 1:cluster.length) {
        features.use <- features[[i]]
        for (j in 1:length(x = features.use)) {
            ctrl.use[[i]] <- c(ctrl.use[[i]], names(x = sample(x = data.cut[which(x = data.cut == 
                data.cut[features.use[j]])], size = ctrl, replace = FALSE)))
        }
    }
    ctrl.use <- lapply(X = ctrl.use, FUN = unique)
    ctrl.scores <- matrix(data = numeric(length = 1L), nrow = length(x = ctrl.use), 
        ncol = ncol(x = object))
    for (i in 1:length(ctrl.use)) {
        features.use <- ctrl.use[[i]]
        ctrl.scores[i, ] <- Matrix::colMeans(x = assay.data[features.use, 
            ])
    }
    features.scores <- matrix(data = numeric(length = 1L), nrow = cluster.length, 
        ncol = ncol(x = object))
    for (i in 1:cluster.length) {
        features.use <- features[[i]]
        data.use <- assay.data[features.use, , drop = FALSE]
        features.scores[i, ] <- Matrix::colMeans(x = data.use)
    }
    features.scores.use <- features.scores - ctrl.scores
    rownames(x = features.scores.use) <- paste0(name, 1:cluster.length)
    features.scores.use <- as.data.frame(x = t(x = features.scores.use))
    rownames(x = features.scores.use) <- colnames(x = object)
    object[[colnames(x = features.scores.use)]] <- features.scores.use
    CheckGC()
    DefaultAssay(object = object) <- assay.old
    return(object)
}

我们来一步一步解析这个函数
我们先用它的默认值

library(Seurat)
load(Rdata)
assay.old <- DefaultAssay(object = object)  ###结果为RNA
assay <- assay %||% assay.old   ### %||%的用法可自行查询,这里还是显示RNA
DefaultAssay(object = object) <- assay 
assay.data <- GetAssayData(object = object)
features.old <- features  ## 运行到这里,得到矩阵信息assay.data和gene list。
###这里我们慢慢运行
cluster.length <- length(x = features)
missing.features <- setdiff(x = features, y = rownames(x = object))  ##setdiff函数的用法 我们这里显然是不会有不同的。
###接下来我们就可以直接跳到
pool <- pool %||% rownames(x = object)
data.avg <- Matrix::rowMeans(x = assay.data[pool, ])
data.avg <- data.avg[order(data.avg)]    ###排序,从小到大
data.cut <- cut_number(x = data.avg + rnorm(n = length(data.avg))/1e+30, 
n = nbin, labels = FALSE, right = FALSE)
###rnorm(n, mean = 0, sd = 1) n 为产生随机值个数(长度),mean 是平均数, sd 是标准差 ,也就是说这里将数据切成了nbin份(由小到大)。
names(x = data.cut) <- names(x = data.avg)
ctrl.use <- vector(mode = "list", length = cluster.length)  ##空的列表
 for (i in 1:cluster.length) {
        features.use <- features[[i]]
        for (j in 1:length(x = features.use)) {
            ctrl.use[[i]] <- c(ctrl.use[[i]], names(x = sample(x = data.cut[which(x = data.cut == 
                data.cut[features.use[j]])], size = ctrl, replace = FALSE)))
        }
    }
ctrl.use <- lapply(X = ctrl.use, FUN = unique)
ctrl.scores <- matrix(data = numeric(length = 1L), nrow = length(x = ctrl.use), 
        ncol = ncol(x = object))   ####空的
for (i in 1:length(ctrl.use)) {
        features.use <- ctrl.use[[i]]
        ctrl.scores[i, ] <- Matrix::colMeans(x = assay.data[features.use, 
            ])
    }       #######细胞的平均值
features.scores.use <- features.scores - ctrl.scores
    rownames(x = features.scores.use) <- paste0(name, 1:cluster.length)
    features.scores.use <- as.data.frame(x = t(x = features.scores.use))
    rownames(x = features.scores.use) <- colnames(x = object)
    object[[colnames(x = features.scores.use)]] <- features.scores.use
    CheckGC()
    DefaultAssay(object = object) <- assay.old

走到这里,相信大家应该都明白了,也就是我们感兴趣的基因,抽出来,每一个细胞算一个这些基因表达的平均值,
背景基因的平均值在于找每个基因的所在的bin,在该bin内随机抽取相应的ctrl个基因作为背景,最后所有的目标基因算一个平均值,所有的背景基因算一个平均值,两者相减就是该gene set 的score值。
至于生物学意义,仁者见仁智者见智了。

> # 去除细胞周期影响 > s.genes <- cc.genes$s.genes > g2m.genes <- cc.genes$g2m.genes > control_data1 <- CellCycleScoring( + control_data1, + s.features = s.genes, + g2m.features = g2m.genes + ) 警告: The following features are not present in the object: MCM5, PCNA, TYMS, FEN1, MCM2, MCM4, RRM1, UNG, GINS2, MCM6, CDCA7, DTL, PRIM1, UHRF1, MLF1IP, HELLS, RFC2, RPA2, NASP, RAD51AP1, GMNN, WDR76, SLBP, CCNE2, UBR7, POLD3, MSH2, ATAD2, RAD51, RRM2, CDC45, CDC6, EXO1, TIPIN, DSCC1, BLM, CASP8AP2, USP1, CLSPN, POLA1, CHAF1B, BRIP1, E2F8, not searching for symbol synonyms 警告: The following features are not present in the object: HMGB2, CDK1, NUSAP1, UBE2C, BIRC5, TPX2, TOP2A, NDC80, CKS2, NUF2, CKS1B, MKI67, TMPO, CENPF, TACC3, FAM64A, SMC4, CCNB2, CKAP2L, CKAP2, AURKB, BUB1, KIF11, ANP32E, TUBB4B, GTSE1, KIF20B, HJURP, CDCA3, HN1, CDC20, TTK, CDC25C, KIF2C, RANGAP1, NCAPD2, DLGAP5, CDCA2, CDCA8, ECT2, KIF23, HMMR, AURKA, PSRC1, ANLN, LBR, CKAP5, CENPE, CTCF, NEK2, G2E3, GAS2L3, CBX5, CENPA, not searching for symbol synonyms 错误于AddModuleScore(object = object, features = features, name = name, : The following feature lists do not have enough features present in the object: S.Score exiting...The following feature lists do not have enough features present in the object: G2M.Score exiting... 此外: 警告信息: In AddModuleScore(object = object, features = features, name = name, : Could not find enough features in the object from the following feature lists: S.Score Attempting to match case...Could not find enough features in the object from the following feature lists: G2M.Score Attempting to match case...解析报错原因
最新发布
03-20
<think>嗯,用户在使用Seurat的CellCycleScoring函数时遇到了“not enough features present”的错误,提示特征基因缺失。这个问题通常是因为用户的数据集里缺少了Seurat内置的细胞周期相关基因。首先,我需要确认错误的原因,可能是基因名称不匹配,或者过滤步骤去除了这些基因。 接下来,我应该检查用户是否使用了正确的基因命名方式,比如大小写或者符号差异。例如,人类基因通常是全大写,而小鼠是首字母大写。然后,建议用户检查基因过滤步骤,是否在预处理时过于严格,导致细胞周期基因被过滤掉。可能需要调整过滤参数,或者在运行函数时暂时关闭过滤。 另外,用户可能需要扩展自己的基因列表,结合多个数据源的细胞周期基因,或者使用自定义的基因集。还可以尝试降低CellCycleScoring函数中的参数,比如减少需要的基因数量,或者使用其他方法如cyclone进行细胞周期评分。最后,提醒用户验证结果,确保细胞周期阶段划分的准确性。总结起来,解决方案括检查基因名称、调整过滤策略、使用自定义基因集、修改函数参数,以及结果验证。</think>### CellCycleScoring 错误 "not enough features present" 解决方案 #### 1. 错误原因诊断 该错误发生在以下情况时: - 数据集中缺少超过80%的细胞周期特征基因(s.genes 和 g2m.genes)[^1] - 基因命名格式不一致(如大小写、符号后缀差异) - 预处理过滤步骤过度删除了特征基因 #### 2. 分步解决方案 **步骤1:检查基因名称格式** ```r # 查看Seurat内置基因列表格式 cc.genes <- Seurat::cc.genes head(cc.genes$s.genes) # 人类示例:CDK11A, CCNH head(cc.genes$g2m.genes) # 小鼠示例:Ccnb2, Ube2c # 匹配数据集的基因命名 rownames(seurat_obj)[grep("CDK11A", rownames(seurat_obj), ignore.case = T)] ``` **步骤2:调整基因过滤策略** ```r # 临时关闭基因过滤验证 CellCycleScoring(object = seurat_obj, g2m.features = cc.genes$g2m.genes, s.features = cc.genes$s.genes, set.ident = TRUE, strict = FALSE) # 关闭严格匹配 # 或在预处理时保留更多基因 seurat_obj <- CreateSeuratObject(counts, min.cells = 3, # 原为默认值5时可放宽 min.features = 200) ``` **步骤3:扩展特征基因列表** ```r # 从其他数据库补充基因 custom.s.genes <- union(cc.genes$s.genes, c("NEW_GENE1","NEW_GENE2")) custom.g2m.genes <- union(cc.genes$g2m.genes, c("NEW_GENE3","NEW_GENE4")) # 使用扩展后的基因列表 seurat_obj <- CellCycleScoring(seurat_obj, s.features = custom.s.genes, g2m.features = custom.g2m.genes) ``` **步骤4:参数调整** ```r # 降低匹配阈值(默认nbin=24) seurat_obj <- CellCycleScoring(seurat_obj, s.features = cc.genes$s.genes, g2m.features = cc.genes$g2m.genes, nbin = 10) # 降低分箱数量 ``` #### 3. 替代方案 当特征基因严重缺失时,可改用其他方法: ```r # 使用cyclone方法 library(scran) hs.pairs <- readRDS(system.file("exdata", "human_cycle_markers.rds", package="scran")) assignments <- cyclone(sce, hs.pairs) ``` #### 4. 结果验证 ```r # 检查相位分配结果 table(seurat_obj@meta.data$Phase) # 验证细胞周期效应 seurat_obj <- ScaleData(seurat_obj, vars.to.regress = c("S.Score", "G2M.Score")) seurat_obj <- RunPCA(seurat_obj) DimPlot(seurat_obj, reduction = "pca", group.by = "Phase") ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值