打分模型 | 基因表达

最新推荐文章于 2022-08-11 17:39:02 发布

weixin_33709219

最新推荐文章于 2022-08-11 17:39:02 发布

阅读量918

点赞数

根据基因表达来打分的线性模型有很多很多，但是如何评估模型的优劣呢？

cell cycle score

Removal of cell cycle effect.

The normalization method described above aims to reduce the effect of technical factors in scRNA-seq data (primarily, depth) from downstream analyses. However, heterogeneity in cell cycle stage, particularly among mitotic cells transitioning between S and G2/M phases, also can drive substantial transcriptomic variation that can mask biological signal. To mitigate this effect, we use a two-step approach:

1) quantify cell cycle stage for each cell using supervised analyses with known stage-specific markers,

2) regress the effect of cell cycle stage using the same negative binomial regression as outlined above.

For the first step we use a previously published list of cell cycle dependent genes (43 S phase genes, 54 G2/M phase genes) for an enrichment analysis similar to that proposed in ref. 11.

For each cell, we compare the sum of phase-specific gene expression (log10 transformed UMIs) to the distribution of 100 random background genes sets, where the number of background genes is identical to the phase gene set, and the background genes are drawn from the same expression bins. Expression bins are defined by 50 non-overlapping windows of the same range based on log10(mean UMI). The phase-specific enrichment score is the expression z-score relative to the mean and standard deviation of the background gene sets. Our final ‘cell cycle score’ (Extended Data Fig. 1) is the difference between S-phase score and G2/M-phase score.

除了要normalize测序深度，还要考虑细胞周期的阶段（总共有四个细胞周期stage），尤其是S期和G2/M期。这里首先定量哪些已知的细胞周期的marker (43 S phase genes, 54 G2/M phase genes)，其次，比较总和/100个随机的背景基因；背景基因来自相同的bin（50 non-overlapping windows of the same range），最后计算了一个z-score。

note: A z-score for a sample indicates the number of standard deviations away from the mean of expression in the reference. The formula is : z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample.

For a final normalized dataset with cell cycle effect removed, we perform negative binomial regression with technical factors and cell cycle score as predictors. Although the cell cycle activity was regressed out of the data for downstream analysis, we stored the computed cell cycle score before regression, enabling us to remember the mitotic phase of each individual cell. Notably, our regression strategy is tailored to mitigate the effect of transcriptional heterogeneity within mitotic cells in different phases, and should not affect global differences between mitotic and non-mitotic cells that may be biologically relevant.

声称自己的normalization只去除了mitotic cells内部的差异。

# input S markers, and N=100, and gene bin (all genes are divided to 33 bins)
get.bg.lists <- function(goi, N, expr.bin) {
  res <- list()
  # check the marker local in which bin, and table the count
  goi.bin.tab <- table(expr.bin[goi])
  for (i in 1:N) {
  	# each time, random select 38 background genes and merge
    res[[i]] <- unlist(lapply(names(goi.bin.tab), function(b) {
      # find the genes in this bin & genes not in the markers
      sel <- which(expr.bin == as.numeric(b) & !(names(expr.bin) %in% goi))
      # random sample, count corespond to the number of marker
      sample(names(expr.bin)[sel], goi.bin.tab[b])
    }))
  }
  return(res)
}

# input log2 expr, S markers, 100 background genes
enr.score <- function(expr, goi, bg.lst) {
  # S markers mean of each cell
  goi.mean <- apply(expr[goi, ], 2, mean)
  # get background genes mean by 100 samples
  bg.mean <- sapply(1:length(bg.lst), function(i) apply(expr[bg.lst[[i]], ], 2, mean))
  # return z-score
  return((goi.mean - apply(bg.mean, 1, mean)) / apply(bg.mean, 1, sd))
}


get.cc.score <- function(cm, N=100, seed=42) {
  set.seed(seed)
  cat('get.cc.score, ')
  cat('number of random background gene sets set to', N, '\n')
  min.cells <- 5 # min detected gene number  
  cells.mols <- apply(cm, 2, sum) # total umi
  gene.cells <- apply(cm>0, 1, sum) # total detected gene number
  cm <- cm[gene.cells >= min.cells, ] # filter
  gene.mean <- apply(cm, 1, mean) # gene mean
  # divide all genes to 50 bins
  breaks <- unique(quantile(log10(gene.mean), probs = seq(0,1, length.out = 50)))
  gene.bin <- cut(log10(gene.mean), breaks = breaks, labels = FALSE) # table(gene.bin) to see count
  names(gene.bin) <- rownames(cm)
  gene.bin[is.na(gene.bin)] <- 0
  regev.s.genes <- read.table(file='./annotation/s_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1
  regev.g2m.genes <- read.table(file='./annotation/g2m_genes.txt', header=FALSE, stringsAsFactors=FALSE)$V1
  # list for go, mouse to human
  goi.lst <- list('S'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.s.genes))],
                  'G2M'=rownames(cm)[!is.na(match(toupper(rownames(cm)), regev.g2m.genes))])
  
  n <- min(40, min(sapply(goi.lst, length))) # 38
  # sort the marker by their expression
  goi.lst <- lapply(goi.lst, function(x) x[order(gene.mean[x], decreasing = TRUE)[1:n]])
  # double list, background gene list
  bg.lst <- list('S'=get.bg.lists(goi.lst[['S']], N, gene.bin),
                 'G2M'=get.bg.lists(goi.lst[['G2M']], N, gene.bin))
  # merge all genes, markers  + background, sort by name
  all.genes <- sort(unique(c(unlist(goi.lst, use.names=FALSE), unlist(bg.lst, use.names=FALSE))))
  expr <- log10(cm[all.genes, ]+1)
  # 
  s.score <- enr.score(expr, goi.lst[['S']], bg.lst[['S']])
  g2m.score <- enr.score(expr, goi.lst[['G2M']], bg.lst[['G2M']])
  
  phase <- as.numeric(g2m.score > 2 & s.score <= 2)
  phase[g2m.score <= 2 & s.score > 2] <- -1
  
  return(data.frame(score=s.score-g2m.score, s.score, g2m.score, phase))
}

问题：

1. 这确实是一种打分模型，但是如何评估模型是否准确呢？

想构建一个打分模型，用于scRNA-seq的细胞类型打分，以及bulk RNA-seq的细胞组成成分预测。

其实有篇文章已经做过这种工作了，做了第一部分，第二部分还没有评估。

weixin_33709219

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
打分模型 | 基因表达

根据基因表达来打分的线性模型有很多很多，但是如何评估模型的优劣呢？ cell cycle scoreRemoval of cell cycle effect.The normalization method described above aims to reduce the effect of technical factors in scRNA-seq data (primaril...
复制链接

扫一扫