单细胞RNAseq的生物分析

一、聚类分析

  scRNA-seq分析的最经常应用之一是基于转录谱的细胞类型(cell-type)的新发现和注释。从计算角度来看,这就是一个困难的无监督聚类问题。也就是说,我们需要在没有先验知识标签的情况下,根据转录组的相似性来识别细胞群。此外在大多数情况下,我们无法预先知道cluster的数量。而且由于高水平的技术噪声(技术和生物上)和大量的维度(eg基因数),这个问题变得更具有挑战性。

1、降维

  在处理大型数据集的时候,通常需要使用某种维数降低的方法。通过将数据投影到低维子空间,第一可以显著减少噪声,第二在2维或者3维子空间,可视化数据更容易。
方法:PCA,tSNE(tSNE更经常用于可视化,主要目的不是降维)

2、聚类方法

2.1 分层聚类 hierarchical clustering

分层聚类一般分为两类:
agglomerative(凝聚):bottom-up,每个单元最初被分配到它自己的cluster,随着层次的上升,cluster被合并,以创建一个层次结构。
divisive(分裂):top-dowm,首先所有的单元都从一个cluster开始,然后递归地拆分每个cluster以形成层次结构。
在这里插入图片描述

2.2 K-means

在kmeans中,目标是将N个单元划分为k个不同的聚类。以迭代的方式,分配集群中心,并将每个单元分配给其最近的集群。
用于scRNAseq分析的大多数方法在某些时候都包括kmeans步骤。
在这里插入图片描述

2.3 基于图的方法

构建图网络,其中每个节点表示一个cell,并为边赋权重。
在这里插入图片描述

2.4 聚类分析中的挑战

聚类的数量k是什么?
什么是cell type?
可扩展性:scRNAseq的细胞数量增长了几个数量级(从10^2 到10^6)。

二 scRNAseq数据的工具

1、SINCERA

基于层次聚类
在聚类之前数据被转换为z-score
通过在层次结构中找到第一个singleton来确定k
https://research.cchmc.org/pbge/sincera.html
参考论文:
Guo, Minzhe, Hui Wang, S. Steven Potter, Jeffrey A. Whitsett, and Yan Xu. 2015. “SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis.” PLoS Comput Biol 11 (11). Public Library of Science (PLoS): e1004575. doi:10.1371/journal.pcbi.1004575.

2、pcaReduce

结合PCA、k-means和迭代的分层聚类;
从大量的clusters开始,pcaReduce迭代合并相似的clusters,在每次合并之后,删除数据中最小方差的组成分。
https://github.com/JustinaZ/pcaReduce
参考论文:
žurauskienė, Justina, and Christopher Yau. 2016. “pcaReduce: Hierarchical Clustering of Single Cell Transcriptional Profiles.” BMC Bioinformatics 17 (1). Springer Nature. doi:10.1186/s12859-016-0984-y.

3、SC3

  • SC3 is based on PCA and spectral dimensionality reductions
  • Utilises k-means
  • Additionally performs the consensus clustering

http://bioconductor.org/packages/release/bioc/html/SC3.html
参考论文:
Kiselev, Vladimir Yu, Kristina Kirschner, Michael T Schaub, Tallulah Andrews, Andrew Yiu, Tamir Chandra, Kedar N Natarajan, et al. 2017. “SC3: Consensus Clustering of Single-Cell RNA-Seq Data.” Nat Meth 14 (5). Springer Nature: 483–86. doi:10.1038/nmeth.4236.
在这里插入图片描述

4、tSNE+k-means

  • Based on tSNE maps
  • Utilises k-means

5、SNN-Cliq

SNN-Cliq is a graph-based method. First the method identifies the k-nearest-neighbours of each cell according to the distance measure. This is used to calculate the number of Shared Nearest Neighbours (SNN) between each pair of cells. A graph is built by placing an edge between two cells If they have at least one SNN. Clusters are defined as groups of cells with many edges between them using a “clique” method. SNN-Cliq requires several parameters to be defined manually.
http://bioinfo.uncc.edu/SNNCliq/
参考论文:
Xu, Chen, and Zhengchang Su. 2015. “Identification of Cell Types from Single-Cell Transcriptomes Using a Novel Clustering Method.” Bioinformatics 31 (12). Oxford University Press (OUP): 1974–80. doi:10.1093/bioinformatics/btv088.

6、Seurat clustering

Seurat clustering is based on a community detection approach similar to SNN-Cliq and to one previously proposed for analyzing CyTOF data (Levine et al. 2015). Seurat has become more like an all-in-one tool for scRNA-seq data analysis.

7、聚类结果比较 Comparing clustering

为了比较两个聚类标签的结果,我们可以使用 adjusted Rand index,这个index表明了两个聚类结果之间的相似性,值在[0,1]区间,1表明两个聚类结果是一致的,0表明可能是随机期望的相似性。

三、特征选择

Most genes detected in a scRNASeq experiment will only be detected at different levels due to technical noise. One consequence of this is that technical noise and batch effects can obscure the biological signal of interest.
因此,对于下游分析来说,进行特征选择是十分有好处的。不仅能够增加信号:数据中的noise ratio;而且能够减少计算复杂性。特征选择通常关注无监督方法,不需要先验知识,例如细胞类型的标签,生物分组等;相反对于差异表达基因来说,可以被考虑是一个有监督的特征选择过程,因为它可以使用每个样本的一直生物标签来识别在不同水平表达的特征(eg gene)。

1、library size的标准化
scRNA-seq data can be QCed and normalized for library size using
M3Drop, which removes cells with few detected genes, removes undetected genes, and converts raw counts to CPM.
在这里插入图片描述
2、对于无监督特征选择过程,有两种主要的方法:一种是 highly Variable Genes,另一种是high Dropout Genes。
2.1 highly variable Genes(HVG)
HVG assumes that if genes have large differences in expression across cells some of those differences are due to biological difference between the cells rather than technical noise. However, because of the nature of count data, there is a positive relationship between the mean expression of a gene and the variance in the read counts across cells. This relationship must be corrected for to properly identify HVGs.
下图为:使用rowmeans 和rowVars来刻画数据集中所有基因的mean expression和variance之间的关系。(图中使用log-scale)。
在这里插入图片描述
一个很好的来correct for the relationship between variance and mean expression 的方法是Brennecke method(Accounting for technical noise in single-cell RNA-seq experiments.Philip Brennecke, Simon Anders, Jong Kyoung Kim, Aleksandra A Kołodziejczyk, Xiuwei Zhang et al. )

To use the Brennecke method, we first normalize for library size then calculate the mean and the square coefficient of variation (variation divided by the squared mean expression). A quadratic curve is fit to the relationship between these two variables for the ERCC spike-in, and then a chi-square test is used to find genes significantly above the curve. This method is included in the M3Drop package as the Brennecke_getVariableGenes(counts, spikes) function. However, this dataset does not contain spike-ins so we will use the entire dataset to estimate the technical noise.

In the figure below the red curve is the fitted technical noise model and the dashed line is the 95% CI. Pink dots are the genes with significant biological variability after multiple-testing correction.
在这里插入图片描述
2.2 Dropout Genes
另一种代替HVGs的方法是识别Dropout Genes( identify genes with unexpectedly high numbers of zeros)。零值是单细胞测序数据的主要特征,通常在最后的表达矩阵中有超过一半的零值。产生零值的原因有两种:其一是mRNAs failing reversed transcribed(逆转录失败);其二是 针对UMI-tagged data,由于 low sequencing coverage(低测序覆盖度)。

一、mRNAs failing reversed transcribed

零值的原因是mRNAs逆转录失败(参考论文:Modelling dropouts for feature selection in scRNASeq experiments. Andrews and Hemberg,2016),逆转录是一种酶促反应,因此能够使用Michaelis-Menten等式来建模:
在这里插入图片描述
由于Michaelis-Menten等式是非线性凸函数,数据集中细胞群体之间的差异表达基因存在于up/right of the Michaelis-Menten model (see Figure below).
在这里插入图片描述
在这里插入图片描述
add log=“x” to the plot call above to see how this looks on the log scale, which is used in M3Drop figures.Produce the same plot as above with different expression levels (S1 & S2) and/or mixtures (mix).
We use M3Drop to identify significant outliers to the right of the MM curve. We also apply 1% FDR multiple testing correction:
在这里插入图片描述
二、low sequencing coverage
An alternative method is contained in the M3Drop package that is tailored specifically for UMI-tagged data which generally contains many zeros resulting from low sequencing coverage in addition to those resulting from insufficient reverse-transcription. This model is the Depth-Adjusted Negative Binomial (DANB). This method describes each expression observation as a negative binomial model with a mean related to both the mean expression of the respective gene and the sequencing depth of the respective cell, and a variance related to the mean-expression of the gene.

Unlike the Michaelis-Menten and HVG methods, there isn’t a reliable statistical test for features selected by this model, so we will consider the top 1500 genes instead.
在这里插入图片描述

  • 9
    点赞
  • 38
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值