单细胞数据处理流程-monocle2

weixin_52505487

已于 2022-09-03 10:38:14 修改

阅读量1.1w

点赞数 9

文章标签： python 机器学习人工智能

于 2022-09-01 21:03:37 首次发布

原文链接：https://www.jianshu.com/p/5d6fd4561bc0

版权

Monocle2

1.安装包

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install("monocle")
library(monocle)

2.处理数据-Seurat数据

library(monocle)
#monocle构建CDS需要3个矩阵：expr.matrix、pd、featuredata
#表达矩阵信息、基因信息和表型信息
#提取表型信息--细胞信息(建议载入细胞的聚类或者细胞类型鉴定信息，实验条件等信息)
expr_matrix <-  as(as.matrix(scRNAsub_nonMalignant@assays$RNA@counts),"sparseMatrix")
#提取表型的信息到p_data(phenotype_data)里面
p_data <- scRNAsub_nonMalignant@meta.data
#提取基因信息 如生物类型 gc含量等
f_data <- data.frame(gene_short_name = rownames(scRNAsub_nonMalignant),
                       row.names =  rownames(scRNAsub_nonMalignant))
##expr_matrix的行数与f_data的行数相同(gene number) 28321
##expr_matrix的列数与p_data的行数相同(cell number) 6972

#构建CDS对象
pd <- new("AnnotatedDataFrame",data = p_data)
fd <- new("AnnotatedDataFrame",data = f_data)
#将p_data和f_data从data.frame转换为AnnotatedDataFrame对象

有了matrix，pd，fd以后就可以构建monocle对象

3.monocle

3.1利用Seurat数据创建monocle对象

cds <- newCellDataSet(expr_matrix,
                      phenoData = pd,
                      featureData = fd,
                      lowerDetectionLimit = 0.1,
                      expressionFamily = VGAM::negbinomial.size())

FPKM/TPM值通畅都是对数正态分布的，而UMI或读计数使用负二项更好地建模，要处理计数的数据，需要将负二项分布指定为newCellDataSet的expressionFamily参数。

negbinomial.size()和negbinomial():输入的表达矩阵为UMI，一般用于10X的数据；negbinomial()的结果更加准确，但是计算比较耗时，一般建议采用negbinomial.size()
tobit():适用于输入的表达矩阵为FPKM或TPM，构建monocle2的class时会自动进行log化计算
gussianff()：适用于log化后的FPKM或者TPM（目前在单细胞测序的数据中，FPKM已经很少用了，smart-seq2平台数据一般采用TPM）

3.2还可以通过直接读取表达矩阵构建monocle2对象

library(data.table)
##读取数据
data <- fread("fpkm.txt",data.table = F, header = T)
pd <- fread("metadata.txt", data.tble = F, header = T)
fd <- fread("gene_annotation.txt", data.table = F, header = T)
#创建
pd <- new("AnnotatedDataFrame", data = pd)
fd <- new("AnnotatedDataFrame", data = fd)
cds <- newCellDataSet(as.matrix(data),pheneoData = pd, featureData = fd, expressionFamily = tobit())
#如果数据量过大的话，建议转化为稀疏矩阵
cds <- newCellDataSet(as(as.matrix(data),"sparseMatrix"), phenoData = pd, featureData = fd, expressionFamily = tobit())

3.3将Seurat对象直接转变为CellDataSet对象

importCDS(scRNAsub_nonMalignant)

4.估计size factor和离散度

cds <- estimateSizeFactors(cds)
cds <- estimateDispersions(cds)

在运行==cds <- estimateDispersions(cds)==时候报错：

cds <- estimateDispersions(object = cds)
Error: (converted from warning) `select_()` was deprecated in dplyr 0.7.0.
Please use `select()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

解决

update.packages("dplyr")
library(dplyr)

与seurat把标准化后的表达矩阵保存在对象中不同，monocle只保存一些中间结果在对象中，需要时候再把这些中间结果转化。经过上面三个函数的计算，mycds中多了SizeFactors、Dipersions、num_cells_expressed和num_genes_expressed等信息。

5.过滤低质量的细胞

大多数单细胞工作流程至少会包含一些死细胞或空孔组成的库。同样重要的是删除doublets。这些细胞可能会破坏下游步骤（如伪时间排序或聚类）。
*[doublets]：由两个或多个细胞意外生成的库

cds <- detectGenes(cds, min_expr = 0.1)
#这一操作会在fData(cds)中添加一列num_cells_expressed
print(head(fData(cds)))
#此时有28321基因
expressed_genes <- row.names(subset(fData(cds),num_cells_expressed >= 10))
##过滤掉小于在10个细胞中表达的基因，还剩17453个基因

6.细胞分类

monocle官网上有4个细胞分类的方法
①classifying cells by type
②clustering cells without marker genes
③clustering cells using marker genes
④imputing cell type

7.轨迹定义基因选择

step1：选择定义过程的基因
step2：降维
step2：拟时间轴轨迹构建和在拟时间内排列细胞

7.1 step1：选择定义过程的基因

#Step 1: 选择定义过程的基因
#Monocle官网教程提供了4个选择方法：
#选择发育差异表达基因
#选择clusters差异表达基因
#选择离散程度高的基因
#自定义发育marker基因

文献中提到monocle中基因选择的问题
A monocle function, DifferentialGeneTest, was used to detect genes with differential expression between clusters, and the top 2,000 with a q-value < 0.01 were selected to construct the cluster-based trajectories.
可能是选择cluster差异的基因

#使用Seurat选择高变基因
express_genes <- VariableFeatures(scRNAsub_nonMalignant)
cds <- setOrderingFilter(cds, express_genes)
plot_ordering_genes(cds)

##使用clusters差异基因
deg.cluster <- FindAllMarkers(scRNAsub_nonMalignant, logfc.threshold = 0)
express_genes <- subset(deg.cluster, p_val_adj < 0.01)$gene
cds <- setOrderingFilter(cds, express_genes)
 plot_ordering_genes(cds)

#使用monocle选高变基因
disp_table <- dispersionTable(cds)
disp.genes <- subset(disp_table, mean_expression >= 0.1 $ dispersion_empirical >= 1 * dispersion_fit)$gene_id
cds <- setOrderingFilter(cds, disp.genes)
plot_ordering_genes(cds)

根据文献中提到的cluster的差异基因进行后续分析

diff <- differentialGeneTest(cds[express_genes,],fullModelFormulaStr = "~celltype",cores = 1)
#~后面是表示对谁做差异分析的变量，理论上可以是p_data的任意列名

##差异表达基因作为轨迹构建的基因，差异基因的选择标准是qval<0.01,decreasing = F表示按照数值增加进行排序
deg <- subset(diff, qval < 0.01) #选择基因数量
deg <- deg[order(deg$qval,decreasing = F),]
head(deg)

#基因数量太多的话可以选择top基因
ordergene <- rownames(deg)
ordergene <- row.names(deg)[order(deg$qval)][1:2000]

保存差异基因的结果

write.table(deg, file = "train.monocle.DEG.xls", col.names = T, row.names = F, sep = "\t", quote = F)

筛选出轨迹基因以后进行可视化

cds <- setOrderingFilter(cds, ordergene)
#这一步很重要，在我很的到想要的基因列表以后，我们需要设置setOrderingFilter将基因嵌入到cds的对象中去，后续一系列操作都依靠这个list
#setOrderingFilter之后，这些基因都被存储在cds@featureData@data[["use_for_ordering"]],可以通过
#table(cds@featureData@data[["use_for_ordering"]])查看
pdf("train.ordergenes.pdf")
plot_ordering_genes(cds)
dev.off()

在这里插入图片描述
图中黑色的点表示用来构建轨迹的差异基因，灰色表示背景的基因，红色的线表示在《估计size factor和离散度》中计算的到，可以看到，找到的基因属于离散度比较高的基因。

7.2 step2降维

一旦细胞有序排列，我们就可以在降维空间中可视化轨迹，所以首先选择用于细胞排序的基因，然后使用反向图嵌入(DDRTree)算法对数据进行降维。

cds <- reduceDimension(cds, max_components = 2, method = "DDRTree")

7.3 step 3拟时间轨迹构建和在拟时间内排列细胞

将表达数据投射到更低的维度，通过机器学习描述细胞如何从一种状态过渡到另外一种状态的轨迹，假设轨迹具有树状结构，一端为“根”，一端为“叶”。在生物过程的开始阶段，细胞从根部开始，沿着主干前进，直到到达第一个分支，然后，细胞必须选择一条路径，沿着树干越走越远，直到到达一片叶子。一个细胞的伪时间值是它回到根的距离。
根据order gene的表达趋势，将细胞排序并完成轨迹构建

cds <- orderCells(cds)
#如果已经知道哪里是根的话，可以通过root_state = ?参数进行设置。
cds <- orderCells(cds, root_state = 5)

报错

Error in if (class(projection) != "matrix") projection <- as.matrix(projection) : 
  the condition has length > 1
In addition: Warning message:
In graph.dfs(dp_mst, root = root_cell, neimode = "all", unreachable = FALSE,  :
  Argument `neimode' is deprecated; use `mode' instead

原因
matrix构建的不对，代码如下

cds <- newCellDataSet(expr_matrix,
                      phenoData = pd,
                      featureData = fd,
                      lowerDetectionLimit = 0.1,
                      expressionFamily = VGAM::negbinomial.size())

解决-没有尝试
把matrix<-as.sparse(scRNAsub_nonMalignant@assays$RNA@counts)改成matrix<-as.sparse(scRNAsub_nonMalignant@assays$RNA@counts)后就好了。据说因为R版本的问题，函数的功能有些变动。
monocle版本

packageVersion('monocle')
[1] ‘2.24.1’

github—有用
1.download monocle from https://www.bioconductor.org/packages/3.15/bioc/src/contrib/Archive/monocle/
2.decompress the package ‘monocle_2.24.0.tar.gz’, and get a ‘monocle’ folder.
3.open the folder ‘monocle/R’, change the code in function ‘project2MST()’ in the ‘order_cell.R’ file:
change from

if(class(projection) != 'matrix') projection <- as.matrix(projection)

projection <- as.matrix(projection)

4.save the ‘order_cell.R’ file.
5.copy the ‘monocle’ folder and paste to the location where I usually install my packages.
6.In R studio, using the command to load the ‘monocle’ folder:

devtools::load_all("/usr/local/lib/R/site-library/monocle")

7.Done. and managed to run the orderCells(cds).

7.3.1可视化

可以根据表型信息对细胞上色

1.pseudotime
2.细胞类型
3.State
4.Seurat分群
5.指定基因可视化
6.寻找拟时相关基因（拟时差异基因）
7.单细胞轨迹的“分支”分析

plot_cell_trajectory()

提取感兴趣的细胞进行后续分析

#对ABC/GCB亚群的细胞感兴趣
pdata <- Biobase::pData(cds)
ABC.cells <- subset(pdata, COO == "ABC") %>% rownames()
ABC.cells <- subset(pdata, COO == "GCB") %>% rownames()
save(ABC.cells, file = "Moncle_ABC.rds")

③按State分群对细胞进行上色

colour=c("#DC143C","#0000FF","#20B2AA","#FFA500","#9370DB","#98FB98","#F08080","#1E90FF","#7CFC00","#FFFF00",  
            "#808000","#FF00FF","#FA8072","#7B68EE","#9400D3","#800080","#A0522D","#D2B48C","#D2691E","#87CEEB","#40E0D0","#5F9EA0",
            "#FF1493","#0000CD","#008B8B","#FFE4B5","#8A2BE2","#228B22","#E9967A","#4682B4","#32CD32","#F0E68C","#FFFFE0","#EE82EE",
            "#FF6347","#6A5ACD","#9932CC","#8B008B","#8B4513","#DEB887")
#细胞State上色
pdf(file = "train.monocle.State.pdf",height = 4,width = 4)
plot_cell_trajectory(cds, color_by = "State",cell_size = 0.5) + 
  theme_bw(base_rect_size = 1.5)+
  theme(panel.grid.major=element_blank(),
        panel.grid.minor=element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        legend.position = "none",
        axis.title = element_blank())+
  scale_color_manual(values = colour)
dev.off()

#细胞Celltype上色
#绘制等高线图
#等高线图需要结合RNA速率的计算

⑥拟时差异基因

#这里把排序基因(ordergene)提取出来做回归分析，来寻找她们与时间是否有显著关系
#如果不设置的话将会用所有的基因来做它们与拟时间的相关性
Time_diff <- differentialGeneTest(cds,fullModelFormulaStr = "~sm.ns(Pseudotime)")
Time_diff <- Time_diff[,c(5,2,3,4,1,6,7)]#把gene放在前面，也可以不该
write.csv(Time_diff, file = "Time_diff.csv",row.names = F)
Time_genes <- Time_diff %>% pull(gene_short_name) %>% as.character()
p <- plot_pseudotime_heatmap(cds[Time_genes,], num_cluster = 4, show_rownames = T, return_heatmap = T)
#num_cluster = 4 将热图聚类为四个cluster
ggsave("Time_heatmapAll.pdf", p, width = 5, height = 10)

⑥¹提取每一个cluster的基因进行单独的分析

p$tree_row
Call:
hclust(d = d, method = method)
Cluster method : ward.D2 
Number of objects: 2200

clusters <- cutree(p$tree_row, k = 3) #k值可以改变
clustering <- data.frame(clusters)
clustering[,1] <- as.character(clustering[,1])
colnames(clustering) <- "Gene_Clusters"
table(clustering)
write.csv(clustering, "Time_clustering_all.csv",row.names = F)

提取出每一个cluster的基因以后后续可以进行GO和KEGG分析

⑥²拟时差异基因提取前100个基因

Time_genes <- top_n(Time_diff, n = 100, desc(qval)) %>% pull(gene_short_name) %>% as.character()
p <- plot_pseudotime_heatmap(cds[Time_genes,], num_cluster = 4, show_rownames = T, return_heatmap = T)
ggsave("Time_heatmapTop100.pdf", p, width = 5, height = 10)

⑥³差异显著基因按热图结果排序并保存

hp.genes <- p$tree_row$labels[p$tree_row$order]
Time_diff_sig <- Time_diff[hp.genes, c("gene_short_name", "pval", "qval")]
write.csv(Time_diff_sig, "Time_diff_sig.csv",row.names = F)

⑥⁴可以手动选择基因绘制热图

marker_genes <- row.names(subset(fData(cds),gene_short_name %in% c("自己的marker基因")))
diff_test_res <- differentiaGeneTest(cds[marker_genes,], fullModelFormulaStr = "~sm.ns(Pseudotime)")
sig_gene_names <- row.names(subset(diff_test_res, qval < 0.1))
plot_pseudotime_heatmap(cds[sig_gene_names,], num_cluster = 6, show_rownames = T, return_heatmap = T)

⑦单细胞轨迹的分支分析

以上是寻找拟时相关的基因是全局的，找拟时起点和终点相关的基因，而这一步是寻找分叉点相关的基因
monocle提供了一种特殊的统计测试：分支表达式分析建模，或BEAM
BEAM(Branched expression analysis modeling)

BEAM——res <- BEAM[cds[ordergene,], branch_point = 1, cores = 2)
#这里用的ordergene是前面寻找构建轨迹的差异基因
BEAM_res <- BEAM_res[order(BEAM_res$qval),]
BEAM_res <- BEAM_res[,c("gene_short_name", "pval", "qval")]
head(BEAM_res)
write.csv(BEAM_res, "BEAM_resg.csv",row.names = F)
plot_genes_branched_heatmap(cds[row.names(subset(BEAM_res, qval < 1e-4)),], branch_point = 1, num_clusters = 4, cores = 1, use_gene_short_name = T, show_rownames = T)

⑦¹选前100个基因进行可视化

BEAM_genes <- top_n(BEAM_res, n = 100, desc(qval)) %>% pull(gene_short_name) %>% as.character()
p <- plot_pseudotime_heatmap(cds[BEAM_genes,], branch_point = 1, num_cluster = 3, show_rownames = T, return_heatmap = T)
ggsave("BEAM_heatmapTop100.pdf", p, width = 5, height = 10)

⑦²差异显著基因按热图结果排序并保存

hp.genes <- p$ph_res$tree_row$labels[p$ph_res$tree_row$order]
BEAM_diff_sig <- BEAM_res[hp.genes, c("gene_short_name", "pval", "qval")]
write.csv(BEAM_diff_sig, "Time_diff_sig.csv",row.names = F)

⑦⁴可以手动选择基因绘制热图

BEAM_genes <- row.names(subset(fData(cds),gene_short_name %in% c("自己的marker基因")))
plot_genes_branched_pseudotime(cds[BEAM_genes,], branch_point = 1, color_by = "State", ncol = 1)