10X单细胞（10X空间转录组）多样本批次效应去除分析之RCA2

最新推荐文章于 2024-08-08 20:24:08 发布

追风少年ii

最新推荐文章于 2024-08-08 20:24:08 发布

阅读量1k

点赞数 22

文章标签：机器学习人工智能空间转录组去批次

本文链接：https://blog.csdn.net/weixin_53637133/article/details/138409977

版权

hello，国庆上班了，不知道大家过的怎么样，今天我们来分享一个新的多样本整合批次去除的好方法，RCA2，很好的方法，文章在RCA2: a scalable supervised clustering algorithm that reduces batch effects in scRNA-seq data,2021年9月份发表于Nucleic Acids Research，影响因子17分，我们简单看看文献，最后看一下示例代码。

说起样本之间的批次效应，其实谈了很多了，但是想很好的去除却并非容易的事情，下面列举了一些批次效应出去的方法，大家感兴趣可以参考。

先看看文章的一些重点信息

ABSTRACT

1、using multiple benchmarks that supervised clustering, which uses reference transcriptomes as a guide, is robust to batch effects and data quality artifacts.（当然之前我们都是无监管聚类，效果好不好很难评判。）

2、RCA2，这是第一个将参考投影（批量效应鲁棒性）与基于图的聚类（可扩展性）相结合的算法。当然，也包括很多下游的分析模块。RCA2 also provides new reference panels for human and mouse and supports generation of custom panels.（这个panel需要我们注意）。

3、RCA2 facilitates cell type-specific QC, which is essential for accurate clustering of data from heterogeneous tissues（关于数据质控，多次强调过，质控是基础，必须要做好，基础不牢，后面分析的再好都是错的）。

4、Scalable supervised clustering methods such as RCA2 will facilitate unified analysis of cohort-scale SC datasets.（这也是单细胞数据追求的目标）。

INTRODUCTION

单细胞聚类的两种方法，(i) unsupervised (de-novo) clustering, which is the most prevalent（Louvain graph-based clustering algorithm is the most prevalent）, and (ii) supervised clustering, which exploits a panel of reference transcriptomes

单细胞聚类方法存在的挑战，(i) cells may cluster by technical variation and batch effects rather than biological properties, (ii) scRNA-seq data tend to be noisy, primarily due to sampling noise and (iii) the gene expression matrix can be very large, since modern datasets commonly include > 100,000 cells.（相信这样的问题大家都遇到过）。所以不同的聚类方法往往导致不同的分析结果。此外，从头聚类需要一个容易出错、耗时的手动步骤，根据对标记基因表达的主观评估将细胞簇分配给细胞类型（注释）。已经开发了监督聚类和监督细胞类型注释算法来解决这些限制。

Unlike the above-mentioned methods, RCA was not primarily designed for cell type annotation. Rather, the objective of RCA is to cluster single cells in the space of reference transcriptome projections. This is fundamentally different from unsupervised clustering approaches, which cluster cells in the space defined by over a thousand feature genes （关于之前的版本RCA，大家可以网上找一找，之间用过，效果一般）。RCA is the only supervised clustering algorithm for scRNA-seq data. However, the original version of RCA could not scale to datasets larger than 20,000 cells on a high-end laptop, used only a single reference panel, did not implement methods to identify differential gene expression, did not offer KEGG and Gene Ontology (GO) enrichment analysis, was benchmarked on only a single Smart-seq dataset and could not be easily integrated into existing data analysis workflows.（看来新改进的RCA2的方法是一个单细胞数据分析的pipieline）。

MATERIALS AND METHODS

这个部分需要很深的数学知识了。

Projection to a reference，这个地方是对参考和query数据的预处理

Clustering and interpreting the projection

Reference panels，提供了一些固有的参考数据集

当然，也很其他的去除批次的聚类方法进行了比较，包括SEURAT, SEURAT INTEGRATION, SCTRANSFORM, SCTRANSFORM INTEGRATION, SCRAN (30), SCANPY, MNNCORRECT (31) and SCANORAMA，我们后续看看结果如何。

Silhouette Index for quantifying batch effect（这个地方用来衡量批次去除的效果如何）。提到一个SI指数，大家可以查一下。

Result

RCA2的work flow，基本就是一套单细胞的分析流程

包括RCA2和之前的版本RCA的区别

多方法比较

注：Note that a robust method should have low batch SI and high cell type SI, indicating that cells are separated by cell type rather than by batch.

RCA2聚类后的差异基因热图，区分的很明显

当然，后面还有一些其他数据的测试，我们就不一一减少了，不过有一点很有意思，如下图，后续的代码分享会介绍到

来吧，看看代码

library(RCAv2)
PBMCs<-RCAv2::createRCAObjectFrom10X("10xPBMCs/")##官方数据为例

Perform basic QC steps and data normalization

PBMCs<-RCAv2::dataFilter(PBMCs,
                  nGene.thresholds = c(300,5000), 
                  nUMI.thresholds = c(400,30000),
                  percent.mito.thresholds = c(0.025,0.2),
                  min.cell.exp = 3,
                  plot=T,
                  filename = "PBMCs_filter_example.pdf")

PBMCs<-RCAv2::dataLogNormalise(PBMCs)

Compute a projection to a reference data set,里面的参数大家需要注意

PBMCs<-RCAv2::dataProject(PBMCs,
                     method = "GlobalPanel",
                     corMeth = "pearson")

In addition to the GlobalPanel, RCA now provides 12 reference panels:（当然，也可以自己指定）

GlobalPanel from the original RCA (Li et al., 2017) containing both primary cell types and tissues. Can be limited to only cell types with "GlobalPanel_CellTypes.
ColonEpiPanel: 9 colon epithelial samples from Li et al. (Nature genetics, 2017).
MonacoPanel: 29 PBMC cell types from Monaco, G., et al. (Cell reports, 2019).
MonacoBCellPanel: 5 B cell sub-types from Monaco, G., et al. (Cell reports, 2019).
MonacoMonoPanel: 5 Monocyte sub-types from Monaco, G., et al. (Cell reports, 2019).
MonacoTCellPanel: 15 T cell sub-types from Monaco, G., et al. (Cell reports, 2019).
CITESeqPanel based on Seurat 4.0 containing 34 cell types.
ENCODEHumanPanel: 93 human cell types from ENCODE.
NovershternPanel: 15 PBMC cell types from Novershtern et al. (Cell, 2011).
NovershternTCellPanel: 6 T cell sub-types from Novershtern et al. (Cell, 2011).
ENCODEMousePanel: 15 mouse cell types from ENCODE.
ZhangMouseBrainPanel: 7 mouse brain cell types from Tasic et al. (Nature neuroscience, 2016).

To use the custom panel MyPanel.RDS use the following command:

PBMCs<-RCAv2::dataProject(PBMCs,
                     method = "Custom",
             customPath = "MyPanel.RDS",
                     corMeth = "pearson")

To benefit from multiple panels at the same time, users can exploit the dataProjectMultiPanel function:

PBMCs<-RCAv2::dataProjectMultiPanel(PBMCs,method=list("NovershternPanel", 
"MonacoPanel", "GlobalPanel_CellTypes"),corMeth="pearson")

Cluster the projection and visualize it

PBMCs<-RCAv2::dataClust(PBMCs)
RCAv2::plotRCAHeatmap(PBMCs,filename = "Heatmap_PBMCs.pdf",var.thrs=1)

PBMCs<-computeUMAP(PBMCs)
RCAv2::plotRCAUMAP(PBMCs,filename = "UMAP_PBMCs.pdf")

PBMCs<-computeUMAP(PBMCs, nDIMS = 3)
RCAv2::plotRCAUMAP3D(PBMCs,filename = "UMAP3D_PBMCs.html")

#Estimate the most probable cell type label for each cell
PBMCs<-estimateCellTypeFromProjection(PBMCs,confidence = NULL)
#Generate the cluster composition plot
RCAv2::plotRCAClusterComposition(PBMCs,filename="Cluster_Composition.pdf")

Based on the heatmap as well as the stacked bar plots we can relabel the clusters according to the major cell type annotations:

RCAcellTypes<-PBMCs$clustering.out$dynamicColorsList[[1]]
RCAcellTypes[which(RCAcellTypes=="blue")]<-"Monocytes"
RCAcellTypes[which(RCAcellTypes=="green")]<-"Dentritic cells"
RCAcellTypes[which(RCAcellTypes=="yellow")]<-"B cells"
RCAcellTypes[which(RCAcellTypes=="grey")]<-"B cells"
RCAcellTypes[which(RCAcellTypes=="brown")]<-"NK cells"
RCAcellTypes[which(RCAcellTypes=="turquoise")]<-"T cells"
RCAcellTypes[which(RCAcellTypes=="red")]<-"Myeloid cells"
RCAcellTypes[which(RCAcellTypes=="black")]<-"Progenitor cells"

范例

CD56Exp<-PBMCs$data[which(rownames(PBMCs$data)=="NCAM1"),]
RCAv2::plotRCAUMAP(PBMCs,cellPropertyList = list(CellTypes=RCAcellTypes,CD56=CD56Exp),filename = "UMAP_PBMCs.pdf")

To obtain cluster based cell type predictions, the user can run the function estimateCellTypeFromProjectionPerCluster:

PBMCs<-RCAv2::createRCAObjectFrom10X("../Documents/10xExample/")
PBMCs<-RCAv2::dataFilter(PBMCs,nGene.thresholds = c(300,4500),
                  percent.mito.thresholds = c(0.025,0.1),
                  min.cell.exp = 3)
PBMCs<-RCAv2::dataLogNormalise(PBMCs)
PBMCs<-RCAv2::dataProject(PBMCs,
                          method = "NovershternPanel",
                          corMeth = "pearson", 
              nPCs=0,
              approx= FALSE)
PBMCs<-RCAv2::dataSClust(PBMCs,res = 0.15)
PBMCs<-estimateCellTypeFromProjectionPerCluster(PBMCs)

Graph based clustering as an alternative to hierarchical clustering

由于单细胞数据集的大小不断增加，层次聚类需要具有大内存的机器。为了克服这个假定的限制，RCAv2 还使用共享的最近邻 (snn) 方法提供基于图的聚类

PBMCs<-RCAv2::dataSNN(PBMCs,k=100,eps=25,minPts=30,dist.fun="All",corMeth="pearson")

This function has three main parameters: * k as the number of considered neighbours per cell, * eps as the minimum number of shared neighbours between to cells, * minPts minimum number of points that share eps neighbours such that a point is considered a core point.

The dist.fun parameter controls whether the distance matrix used for SNN clustering is based on the full correlation distance matrix or, if dist.fun is set to PCA, on a PC reduction of the reference projection. The corMeth sets which correlation function is used for the distance computation (pearson (default), spearman or kendal). As for the hierarchical clustering, heatmaps and umaps can be generated as well.

To help the user choosing the parameters for clustering, we provide a parameter space exploration feature leading to a 3D umap illustrating the number of clusters depending on the three parameters, as shown below.

上图可用下面的代码生成

parameterSpaceSNN(PBMCs,kL=c(30:50),epsL=c(5:20),minPtsL=c(5:10),folderpath=".",filename="Graph_based_Clustering_Parameter_Space.html")

where kL, epsL and minPtsL define the search space for k, eps and minPts respectively. Note that executing this function will take longer for large search spaces.

In addition to those clustering parameters, via the dist.fun parameter one can choose whether a PCA reduction of the projection matrix, or the entire projection matrix should be used to construct the snn graph.

Using Louvain graph clustering implemented in Seurat

PBMCs<-dataSClust(PBMCs,res=0.5)

其中 res 是解析到 Seurat 聚类函数的分辨率参数。

基于图的聚类，RCA 投影热图将在没有列树状图的情况下绘制。

此函数使用与原始 Seurat 函数相同的 PCA 近似值。将 approx 设置为 FALSE 将计算准确的 PCA。如果 corMeth 参数设置为 pearson、spearman 或 kendal，则该函数将使用相关距离而不是默认的欧几里得距离来计算全距离矩阵。这也可以与精确的 PCA 结合使用。但是，并不是这些选项需要比默认值更多的可用内存。使用肘部绘图功能，可以生成指导选择要考虑的 PC 数量的 ElbowPlot。

Clustering free analysis of the projection

特别是对于非常大的数据集，对投影进行聚类可能具有挑战性。对于这些情况，RCA 包括由 SingleR 和 scMatch 驱动的聚类独立细胞类型分配方法，该方法纯粹基于基于参考投影的每个细胞 z 分数分布。对函数的调用

PBMCs<-estimateCellTypeFromProjection(PBMCs,confidence = NULL)

将为每个cell返回最可能的cell类型并将其保存在 PBMCs 对象中。通过参数置信度，可以对两种最可能的细胞类型之间的比率施加阈值（0 和 1 之间）。在不确定的情况下，cell将被标记为未知。

table(unlist(PBMCs$cell.Type.Estimate))
BDCA4._DentriticCells                     CD14._Monocytes             CD19._BCells.neg._sel.. 
                                 69                                 108                                 307 
                      CD33._Myeloid                               CD34.                         CD4._Tcells 
                               1363                                   3                                2266 
                      CD56._NKCells                         CD8._Tcells                 L45_CMP_Bone.Marrow 
                                458                                 364                                   4 
             L51_B.Cell_Bone.Marrow 

#Retrieve annotation
SimplifiedAnnotation<-unlist(PBMCs$cell.Type.Estimate)
#Relabel it
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD33._Myeloid")]<-"Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD4._Tcells")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD8._Tcells")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD14._Monocytes")]<- "Monocytes"
SimplifiedAnnotation[which(SimplifiedAnnotation=="BDCA4._DentriticCells")]<-"Dentritic cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L93_B.Cell_Plasma.Cell")]<- "B cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L52_Platelet")]<-"Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L74_T.Cell_CD4.Centr..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L51_B.Cell_Bone.Marrow")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L75_T.Cell_CD4.Centr..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L85_NK.Cell_CD56Hi")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD34.")]<-"Progenitor"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L45_CMP_Bone.Marrow")]<- "Progenitor"
SimplifiedAnnotation[which(SimplifiedAnnotation=="WholeBlood")]<- "Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L69_Dendritic.Cell_Monocyte.derived")]<- "Myeloid"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L80_T.Cell_CD8.Eff..Memory")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L60_Monocyte_CD16")]<- "Monocytes"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L86_NK.Cell_CD56Lo")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="L73_T.Cell_CD4.Naive")]<-"T cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD56._NKCells")]<-"NK cells"
SimplifiedAnnotation[which(SimplifiedAnnotation=="CD19._BCells.neg._sel..")]<- "B cells"

umapFigures<-RCAv2::plotRCAUMAP(PBMCs,
                      cellPropertyList = list(`Cell Type`=SimplifiedAnnotation),
                      filename = "UMAP_PBMCs.pdf")

Compute DE genes for RCA clusters

PBMCs<-RCAv2::dataDE(PBMCs,
  logFoldChange = 1.5,
  method = "wilcox",
  mean.Exp = 0.5,
  deep.Split.Values = 1,
  min.pct = 0.25,
  min.diff.pct = -Inf,
  random.seed = 1,
  min.cells.group = 3,
  pseudocount.use = 1,
  p.adjust.methods = "BH",
  top.genes.per.cluster = 10
)

Here, logfoldchange is the required logFoldChange to call a gene to be differentially expressed. The method parameter indicates which statistical test is used. Multiple test correction is perfomed using the method indicated in p.adjust.methods. The parameters mean.Exp and min.pct indicat the minimum expression value as well as the minimum percentage of cells expressing a gene. Furthermore, the pseudocount can be adjusted via the pseudocount.use parameter. The top.genes.per.cluster parameter indicats how many genes should be selected as top DE genes per pairwise comparison for each cluster. Both the entire set of DE genes as well as the top DE genes are stored in the PBMCs rca.obj. The topDE genes can be plotted in a heatmap via the plotDEHeatmap function:

RCAv2::plotDEHeatmap(PBMCs,scale=FALSE)

The scale parameter allows the user to plot either the normalized UMI counts or scaled count (z-transformed). An example is shown below.

Compute enrichment for GO terms and KEGG pathways

使用 clusterProfiler 包，RCAv2 支持分别使用函数 doEnrichGo 和 doEnrichKEGG 对 GO 术语和 KEGG 通路进行富集测试。两者都需要设置参数注释，这对于 ID 映射和 GO-term 分配都需要。人类注释的一个例子是 Bioconductor 上的 org.Hs.eg.db。

GO分析

doEnrichGo<-function(rca.obj,
                    annotation=NULL,
                    ontology="BP",
                    p.Val=0.05,
                    q.Val=0.2,
                    p.Adjust.Method="BH",
                    gene.label.type="SYMBOL",
                    filename="GoEnrichment.pdf",
            background.set="ALL",
                    background.set.threshold=NULL,
                    n.Cells.Expressed=NULL,
                    cluster.ID=NULL,
                    deep.split=NULL)

其中本体是 BP、MF 或 CC，p.Val 和 q.Val 是 clusterprofiler 使用的阈值，p.Adjust.Method 指示使用哪种方法来纠正多次测试。为了给用户带来更大的方便，该功能会自动映射基因 ID。为此，gene.label.type 保存原始标签的类型。根据 10X 数据，默认设置为 SYMBOL。背景集要么基于所有集群，要么仅基于调查的集群。通过 background.set.threshold 或 n.Cells.Expressed" 参数选择单元格。请注意，前者是数值或以下值之一：Min、1stQ、Mean、Median、3thQ。计算这些阈值用于所有基因的所有平均表达值的分布。如果只对一个特定的簇进行分析，可以设置参数cluster.ID*，如果分层，可以指定deep.split的值来选择自定义拆分已经使用了聚类。

doEnrichGo 函数分别为每个集群生成条形图、goplots 和点图。可以使用 filename 参数修改文件名。显示了 PBMC NK 细胞簇的示例条形图和点图。描绘了 PBMC B 细胞的 Goplots。

KEGG分析

doEnrichKEGG<-function(rca.obj,
                     annotation=NULL,
                     org="hsa",
                     key="kegg",
                     p.Val=0.05,
                     q.Val=0.2,
                     p.Adjust.Method="BH",
                     gene.label.type="SYMBOL",
                     filename="KEGGEnrichment.pdf",
             background.set="ALL",
                     background.set.threshold=NULL,
                     n.Cells.Expressed=NULL,
                     cluster.ID=NULL,
                     deep.split=NULL)

Cluster/Cell-type specific quality control RCAv2 offers straightforward ways to perform cluster-specific quality control. We illustrate this functionality using an inhouse dataset of 45926 cells obtained from five bone marrow samples. A link to download the data will be made available here at a later stage. First, we load the data, project it against the global panel and cluster it:

normalBoneMarow<-readRDS("../Documents/DUKE_Normal.RDS")
createRCAObject()
PBMCs<-RCAv2::createRCAObject(normalCML@assays$RNA@data,dataIsNormalized = T)

PBMCs<-RCAv2::dataProject(PBMCs,
                          method = "GlobalPanel",
                          corMeth = "pearson")

PBMCs<-RCAv2::dataSClust(PBMCs,res = 0.1)
RCAv2::plotRCAHeatmap(PBMCs,filename = "Control_HeatmapPostQC.pdf")

we identify the cluster IDs as:

cellTypes<-c("Progenitor B","CMP/MEP","CMP/GMP","GMP/Dendritic cells","CD8 T cells","NK cells","CD4 T cells", "B cells", "Erythroid Progenitor","Monocytes","BT")
clusterColors<-c("purple","black","blue","magenta","turquoise","yellow","green","pink","greenyellow","red","brown")
names(cellTypes)<-clusterColors
cellTypeLabels<-cellTypes[PBMCs$clustering.out$dynamicColorsList[[1]]]

and plot cluster quality scores using

plotClusterQuality(PBMCs,width = 15,height = 9,cluster.labels = cellTypeLabels)

Combining RCA with Seurat

Data processing can also be carried out with Seurat. Here is an example how you can combine a RCA analysis with data preprocessed in Seurat.

Load and preprocess data

Using the same 10x data as before, we generate a Seurat object and perform an initial analysis:

library(Seurat)

#Load the data
PBMCs.10x.data<-Seurat::Read10X('../Downloads/10xExample/')

#Generate a Seurat object
pbmc_Seurat <- CreateSeuratObject(counts = PBMCs.10x.data$`Gene Expression`, 
                  min.cells = 3, 
                  min.features  = 200, 
                  project = '10X_PBMC', 
                  assay = 'RNA')

#Compute the percentage of mitochondrial rates
mito.genes<-grep(pattern='^MT-',x=rownames(pbmc_Seurat@assays[['RNA']]),value=T)
percent.mito <- Matrix::colSums(pbmc_Seurat@assays[['RNA']][mito.genes, ])/
                                Matrix::colSums(pbmc_Seurat@assays[['RNA']])
pbmc_Seurat <- AddMetaData(object = pbmc_Seurat, metadata = percent.mito, col.name = 'percent.mito')

#Perform QC using the same parameters as above
pbmc_Seurat <- subset(pbmc_Seurat, nFeature_RNA >300 & nFeature_RNA < 5000 &
                        nCount_RNA > 400 & nCount_RNA<30000 &
                        percent.mito > 0.025 & percent.mito < 0.2)

#Normalize the data
pbmc_Seurat <- NormalizeData(object = pbmc_Seurat, normalization.method = 'LogNormalize', scale.factor = 10000)

To run RCA, no further processing steps would be needed. However, we want to also compare the RCA result to the Seurat based clustering, therefore we first go on with a Seurat based analysis:


#Find HVGs
pbmc_Seurat <- FindVariableFeatures(object = pbmc_Seurat, 
                   mean.function = ExpMean, 
                   dispersion.function = LogVMR, 
                   x.low.cutoff = 0.0125, 
                   x.high.cutoff = 3, 
                   y.cutoff = 0.5, 
                   nfeatures = 2000)

#Center and scale the data
pbmc_Seurat <- ScaleData(object = pbmc_Seurat)

#Run PCA on the data
pbmc_Seurat <- RunPCA(object = pbmc_Seurat,  npcs = 50, verbose = FALSE)

#Plot different aspsects of the pca
ElbowPlot(object = pbmc_Seurat,ndims = 50)

Based on the Elbowplot (not shown here), we use 20 PCs for further analysis.

#Find Neighbors
pbmc_Seurat <- FindNeighbors(pbmc_Seurat, reduction = 'pca', dims = 1:20)

#Find Clusters
pbmc_Seurat <- FindClusters(pbmc_Seurat, resolution = 0.2, algorithm = 1)

We generate a UMAP of the data stored in the Seurat object using the umap R package:

#Load required libraries
library(umap)
library(ggplot2)
library(randomcoloR)

#Compute Umap from first 20PCs
umap_resultS<- umap(pbmc_Seurat@reductions$pca@cell.embeddings[,c(1:20)])
umap_resultSL<-as.data.frame(umap_resultS$layout)

#Derive distinguishable colors for the seurat clusters
myColors<-distinctColorPalette(length(unique(pbmc_Seurat$seurat_clusters)))

#Generate a UMAP
umapAll_Seurat_RCA<-ggplot(umap_resultSL,aes(x=V1,y=V2,color=pbmc_Seurat$seurat_clusters))+theme_bw(30)+
  geom_point(size=1.5)+labs(colour='ClusterID')+theme(legend.title = element_text(size=10))+
  guides(colour = guide_legend(override.aes = list(size=4)))+theme(legend.position = 'right')+
  theme(legend.text=element_text(size=10))+scale_color_manual(values=myColors)+xlab('UMAP1')+ylab('UMAP2')
umapAll_Seurat_RCA

We obtain the following UMAP:

Generate a RCA object and perform RCA analysis

We use the RCA function createRCAObject to generate a RCA object from the raw and optionally also the normalized data stored in our Seurat object.

library(RCAv2)
RCA_from_Seurat<-RCAv2::createRCAObject(pbmc_Seurat@assays$RNA@counts, pbmc_Seurat@assays$RNA@data)

Next, we can compute the projection, cluster the data, and estimate the most likely cell type for each cell as above:

#Compute projection
RCA_from_Seurat<-RCAv2::dataProject(rca.obj = RCA_from_Seurat)

#Cluster the projection
RCA_from_Seurat<-RCAv2::dataClust(RCA_from_Seurat)

#Estimate most likely cell type
RCA_from_Seurat<-RCAv2::estimateCellTypeFromProjection(RCA_from_Seurat)

Using the RCA cell type labels, RCA and Seurat clusters, we generate two new UMAPs whose coordinates are based on the PCs derived from HVGs and that are colored according to RCA clusters and cell type labels.

#Simplify the cell type annotation
SimplifiedAnnotation<-unlist(RCA_from_Seurat$cell.Type.Estimate)
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD33._Myeloid')]<-'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD4._Tcells')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD8._Tcells')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD14._Monocytes')]<- 'Monocytes'
SimplifiedAnnotation[which(SimplifiedAnnotation=='BDCA4._DentriticCells')]<-'Dentritic cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L93_B.Cell_Plasma.Cell')]<- 'B cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L52_Platelet')]<-'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L74_T.Cell_CD4.Centr..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L51_B.Cell_Bone.Marrow')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L75_T.Cell_CD4.Centr..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L85_NK.Cell_CD56Hi')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD34.')]<-'Progenitor'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L45_CMP_Bone.Marrow')]<- 'Progenitor'
SimplifiedAnnotation[which(SimplifiedAnnotation=='WholeBlood')]<- 'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L69_Dendritic.Cell_Monocyte.derived')]<- 'Myeloid'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L80_T.Cell_CD8.Eff..Memory')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L60_Monocyte_CD16')]<- 'Monocytes'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L86_NK.Cell_CD56Lo')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='L73_T.Cell_CD4.Naive')]<-'T cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD56._NKCells')]<-'NK cells'
SimplifiedAnnotation[which(SimplifiedAnnotation=='CD19._BCells.neg._sel..')]<- 'B cells'

#Plot a umap colored by the simplified cell type labels
myColors<-distinctColorPalette(length(unique(SimplifiedAnnotation)))
umapAll_Seurat_Estimated_CT<-ggplot(umap_resultSL,
aes(x=V1,y=V2,color=SimplifiedAnnotation))+
theme_bw(30)+
geom_point(size=1.5)+
theme(legend.position = 'bottom')+
labs(colour='Cell type')+
guides(colour = guide_legend(override.aes = list(size=4)))+
theme(legend.text=element_text(size=10))+
scale_color_manual(values=myColors)+
ggtitle('b)')+
xlab('UMAP1')+ylab('UMAP2')+
theme(legend.title = element_text(size=12))

#Plot a umap colored by the RCA cluster ID
umapAll_Seurat_RCA_Clusters<-ggplot(umap_resultSL,
aes(x=V1,y=V2,color=RCA_from_Seurat$clustering.out$dynamicColorsList[[1]]))+
theme_bw(30)+
geom_point(size=1.5)+
theme(legend.position = 'bottom')+
labs(colour='RCA Cluster ID')+
guides(colour = guide_legend(override.aes = list(size=4)))+
theme(legend.text=element_text(size=10))+
xlab('UMAP1')+ylab('UMAP2')+
scale_color_identity(guide=guides(color=RCA_from_Seurat$clustering.out$dynamicColorsList[[1]]))+
ggtitle('a)')+
theme(legend.title = element_text(size=12))

#Combine the Figures into one
library(gridExtra)
grid.arrange(umapAll_Seurat_RCA_Clusters,umapAll_Seurat_Estimated_CT,nrow=1)

The RCA clusters show a high concordance to the Seurat clusters shown in the previous UMAP.

Add projection and annotations to the Seurat object

For greater convenience the results of RCA can be saved within the Seurat object for further analysis.

pbmc_Seurat[['RCA.clusters']]<-RCA_from_Seurat$clustering.out$dynamicColorsList
pbmc_Seurat[['cellTypeLabel']]<-RCA_from_Seurat$cell.Type.Estimate
pbmc_Seurat[['Projection']]<-CreateAssayObject(data=RCA_from_Seurat$projection.data)

Add a UMAP based on the projection to the Seurat object

Also, a UMAP reduction based on the projection space can be added to the Seurat object:

RCA_from_Seurat<-computeUMAP(RCA_from_Seurat)
pbmc_Seurat[['RCA_umap']]<-CreateDimReducObject(embeddings=as.matrix(RCA_from_Seurat$umap.coordinates),key='RCA_umap_',assay=DefaultAssay(pbmc_Seurat))

Visualizing RNA velocity on RCA result

RNA velocity describes the rate of gene expression change for an individual gene at a given time point based on the ratio of its spliced and unspliced messenger RNA (mRNA). Here, we describe how one can use the scvelo package, in Python, to visualize RNA velocity on the RCA generated result.

To transfer spliced RNA counts to scvelo, first transpose the raw RCA data matrix to get a cells x genes matrix, and export it to a CSV file.

# R
raw.data.counts <- t(rca_obj$raw.data)
write.table(x = raw.data.counts, file = 'raw_counts.csv', append = FALSE, quote = FALSE, sep = ',')

In addition, export the RCA projection and UMAP embeddings to respective CSV files too.

# R
projection.data <- as.matrix(t(rca_obj$projection.data[, -doublet_index]))
write.table(x = projection.data, file = 'projection_data.csv', append = FALSE, quote = FALSE, col.names = F, row.names = F, sep = ',')

umap.data <- as.matrix(rca_obj$umap.coordinates)
write.table(x = umap.data, file = 'umap_data.csv', append = FALSE, quote = FALSE, col.names = F, row.names = F, sep = ',')

Create an iPython notebook in the same folder and import the required packages as below.

# Python
import scvelo as scv
import scanpy as sc
import numpy as np
import pandas as pd
scv.set_figure_params()

Then, create a Scanpy object using the raw counts from the CSV file.

# Python
adata = sc.read_csv(filename='raw_counts.csv')

Populate the PCA slot in the Scanpy object as the projection data from RCA.


# Python
projection_data = np.loadtxt('bm_input/projection_data.csv',delimiter=',')
projection_data.shape

adata.obsm['X_pca'] = projection_data

Populate the UMAP slot in the Scanpy object as the umap coordinates from RCA.

# Python
umap_data = np.loadtxt('bm_input/umap_data.csv',delimiter=',')
umap_data.shape

adata.obsm['X_umap'] = umap_data

Load the unspliced loom object generated by velocyto.

# Python
ldata = scv.read('merged.loom', cache=True)

Then, merge the spliced and unspliced objects together as described below:

# Python
merged_data = scv.utils.merge(adata, ldata)

As recommended by the scvelo tutorial, perform the following steps to compute RNA velocity:

# Python
scv.pp.filter_and_normalize(merged_data)
scv.pp.moments(merged_data)
scv.tl.velocity(merged_data, mode='stochastic')
scv.tl.velocity_graph(merged_data)

It is possible that not all barcodes had sufficient quality of both spliced and unspliced reads, and thus some cells may have been discarded during the merging process. To ensure your cell type labels are still maintained, export the merged data observations from the merged scvelo object to a CSV file.

# Python
merged_data.obs.to_csv('merged_data_obs.csv')

In R, load this CSV file in and extract the RCA labels and filter only those which were considered in the merged data by scvelo.

# R
merged_data_obs <- read.csv(file = 'merged_data_obs.csv', row.names = 1)
rca_clusters <- rca_obj$clustering.out$dynamicColorsList$Clusters
names(rca_clusters) <- colnames(rca_obj$raw.data)
rca_clusters <- rca_clusters[rownames(merged_data_obs)]

Note: If your cell names have underscores in them, scanpy will automatically split the cell name into barcode and sample_batch.

In this case, replace the last line of the above block of code with the following:

# R
merged_barcodes <- paste0(merged_data_obs$sample_batch, rownames(merged_data_obs))
rca_clusters <- rca_clusters[merged_barcodes]

Now export these cluster labels to a CSV file.

# R
rca_cluster_df <- data.frame(Clusters = rca_clusters)
write.table(x = rca_cluster_df, file = 'rca_cluster_df.csv', append = FALSE, quote = FALSE, col.names = T, row.names = F, sep = ',')

Back in the scvelo iPynb, load this RCA cluster annotation table and set it as the observation slot of your merged data.

# Python
rca_clusters = pd.read_csv('rca_cluster_df.csv')
merged_data.obs = rca_clusters

Now, it’s finally time to visualize the RNA velocity results. There are 3 visualization options provided by scvelo, namely velocity_embedding, velocity_embedding_grid and velocity_embedding_stream. Use them as demonstrated below

# Python
### Velocity embedding
scv.pl.velocity_embedding(merged_data, basis='umap', color = ['Clusters'], legend_loc = 'right margin', palette = 'tab20', figsize = (10,10), save = 'embedding.png')

image.png

Using RCA colors

Since the RCA clusters already have color annotations, you can use the RCA colors in the palette as described below:

# Python
### Velocity embedding
scv.pl.velocity_embedding(merged_data, basis='umap', color = ['Clusters'], legend_loc = 'right margin', palette = merged_data.obs['Clusters'].sort_values().unique().tolist(), figsize = (10,10), save = 'RCAColor_embedding.png')

生活很好，有你更好