10X单细胞（10X空间转录组）轨迹分析（拟时分析）之基因开关（GeneSwitches）

追风少年ii

于 2024-04-27 10:25:15 发布

阅读量1.2k

点赞数 20

文章标签：机器学习人工智能

本文链接：https://blog.csdn.net/weixin_53637133/article/details/138243919

版权

hello，大家好，今天给大家分享的是，拟时序轨迹分析软件中对轨迹关键驱动基因的下游分析，挖掘拟时序分化轨迹开关基因的分析，这个方面大家应该都不陌生，文章在GeneSwitches: ordering gene expression and functional events in single-cell experiments，我们首先来分享一下文献，最后看一看参考代码。

首先认识一下什么是开关基因，开关基因指细胞分化转化过程中表达沉默或表达激活的基因，它可能会引起或推动发育体系，在可选择的相关细胞途径中进行转换，对生物过程的发生发展有着重要的意义。

Abstract

1、Based on the similarity of gene expression profiles, many tools have been developed to generate an in silico（电脑模拟） ordering of cells in the form of pseudo-time trajectories.

2、these tools do not provide a means to find the ordering of critical gene expression changes over pseudo-time.（做轨迹分析做的好的生信人员都不是很多，每个轨迹分析的官方示例都是例题，需要灵活运用，能做到灵活运用的生信人员，比例不会很高，轨迹分析有很多需要注意的地方，大家可以参考文章10X单细胞轨迹分析之回顾，但这个也是简单的总结，远远不够）。

关于轨迹转变基因这个，以monocle2为例，主要就是计算不同state之间的差异基因，这些基因的代表了不同的发育轨迹。（这个做法是否合理，有待商榷）。

3、We present GeneSwitches, a tool that takes any single-cell pseudo-time trajectory and determines the precise（准确的，精确的） order of gene expression and functional-event changes over time.（这个很多软件就是计算差异,不知道这个方法有什么新的思路没有）。

4、GeneSwitches uses a statistical framework based on logistic regression to identify the order in which genes are either switched on or off along pseudo-time.（开关基因的识别，这个是我们最为关心的地方）。

5、With this information, users can identify the order in which surface markers appear, investigate how functional ontologies are gained or lost over time and compare the ordering of switching genes from two related pseudo-temporal processes（开关基因的意义，下游分析，这个是后话了~~~）

Introduction 我们提炼一下

现有软件（monocle2、Slingshot等），A limitation of current pseudo-time methods is that extracting the underlying order of gene expression changes from these trajectories can be difficult（其实有关空间位置检测变化基因的能力，monocle3已经可以做到~~）.

being able to interpret these gene expression changes in terms of the order that they occur would allow for a fuller understanding of the underlying biological processes.(这远不是我们简单计算一下差异基因就能解决的问题~~~)。

To address this, here we developed GeneSwitches, a statistical framework that processes scRNA-seq data together with a pseudo-time trajectory to find the set of genes that switch during the transition.（软件可以帮助我们找到轨迹发生过程中的开关基因）。

For each gene, we calculate a switching time and associated confidence level.（对于每个基因，我们计算转换时间和相关的置信水平。，有点酷~~~~），能实现这样目的的前提有两个`(i)`investigate how gene-regulatory networks or gene ontologies are gained or lost over time（这个绝对可以）`(ii)` stratify selected gene sets (e.g. surface markers) by the order in which they appear(按基因出现的顺序对选定的基因集（例如表面标记）进行分层 ) `(iii)` identify key differences in the gene expression changes in cell transitions that bifurcate over time(分化转变基因，这个最为关键)。

接下来，我们来看看真实的案例（下图）

GeneSwitches functions and examples

workflow （看看数据处理的过程）

1、GeneSwitches requires two inputs, namely the gene expression matrix and the pseudo-time ordering of each single cell。(矩阵和时间，分析过程如下图)。

2、First, GeneSwitches binarizes the gene expression into either an ‘on’ or ‘off’ state to enable the identification of switching events.This binarization is performed gene-wise in a data-driven manner by fitting a mixture model of two Gaussian distributions to the gene expression）The threshold separating the ‘on’ and ‘off’ gene expression states is then determined by the intersection of the two fitted Gaussian distributions.（这个地方涉及到一些数学上的知识，大家感兴趣可以查一下）。

3、the pseudo-time can be partitioned to identify switching events within specific parts of the trajectory（也可以截取一段的时间序列进行开关基因的识别）。

4、Ordering and visualization of switching genes，the binarized gene expression is used as a dependent variable in a logistic regression with the pseudo-time value of each cell providing the independent variable.In doing so, the probability of expression throughout pseudo-time is calculated and the quality of fit is determined by McFadden’s Pseudo R2. 这样就找到了很多的开关基因，开关基因以0.5为界限，然后进行表达的可视化。

我们简单来总结一下，GeneSwitches首先通过对分化轨迹中的基因进行二值化分析，筛选出了表达特性上存在on和off两个状态的潜在开关基因。随后软件对这些潜在开关基因进行逻辑回归分析和McFadden’s Pseudo R2拟时间相关性分析：通过逻辑回归分析推算出每个开关基因的开关时间(Switching Time Point)；通过拟时间相关性分析得到每个相关性R2值，其中表达激活的开关基因与拟时序正相关(R2>0)，被定义为上调型开关基因(Up-regulation)，而表达沉默的开关基因与拟时序负相关(R2<0)，被定义为下调型开关基因(Down-regulation)。拟时序相关性越高代表基因与该轨迹进程的关系越密切。得到了每个潜在开关基因的开关时间和相关性R2值后，将Top开关基因按其开关时间在拟时序上排序可视化，可以更直观地展示分化轨迹进程中的关键基因作用。同时，GeneSwitches嵌入了功能分析模块，通过基因注释(如表面蛋白，转录因子或其他功能类型)和富集分析(GO, KEGG和HALLMARK)的方法帮助解读开关基因分析结果，更好地与生物过程和疾病发生发展联系起来。除此之外，该软件不仅可以对单条轨迹进行开关基因分析，还能比较两条不同分化分支之间开关基因的异同，可用于寻找分支特异性的开关基因或比较相同开关基因在不同分支中开关时间的先后差异。

看一下基因的排序和可视化

Ordering and visualization of gene classes and functional groups

1、Switching genes can be used to investigate the functional nature of the pseudo-time trajectory.例如，it might be desirable to know for a set of known surface proteins at what point they are activated or deactivated during a transition in order to facilitate the identification of suitable markers on which to sort cells that are transitioning.（开关基因什么时候转变的）。

2、GeneSwitches can also identify the order in which functional ontologies are acquired or lost during a transition.（功能的获得和丢失）。

3、软件提供了可视化的功能，To visualize these changes, we provide the functionality to plot the density of switching genes from each ontological class with respect to pseudo-time in order to study when and how different functional classes are important。

4、具体案例（单轨迹），首先monocle2计算矩阵的轨迹值，然后识别开关基因GeneSwitches identified that TIMP1 and VIM were early surface proteins to be activated, indicating that they might represent good candidate markers to identify cells progressing along the differentiation process more rapidly。We also observe that POU5F1 is deactivated early, whilst MYH7 is activated late(下图d)。 Functional ontology analysis showed that the cell cycle-related ontologies were down-regulated at an early time and cells acquired cardiac-related functions later in the pseudotime（功能缺失和获得的时间关系）。

这部分我们简单总结一下，从人胚胎干细胞(hESC)分化为心肌细胞(CM)的分化轨迹开关基因分析。通过开关分析发现了拟时序较为早期表达激活的开关基因VIM和TIMP1，可以作为加速分化进程的候选基因。另外他们还找到了早期表达沉默的开关基因POU5F1与后期表达激活的开关基因MYH7。开关基因富集分析结果表明与细胞周期(Cell Cycle)相关的通路在拟时序早期下调，而与心脏功能相关的通路在拟时序后期上调。

双轨迹开关基因分析案例，这个用于多个分化分支的情况

If GeneSwitches has been used to analyse two related pseudo-time trajectories, it is possible to compare the switching genes and their switching times.例如，在分化过程中，某些细胞群通常会分叉，每组都处于不同的细胞状态。 GeneSwitches can be used to compare these trajectories, looking for similarities and differences in the switching genes, as well as their switching times.（比较和寻找差异、开关基因的时间）。下图展示了用GeneSwitches分析两条有联系的分化分支的开关基因分析，两条分化轨迹分别为从人胚胎干细胞(hESC)到心肌细胞(CM)的分化分支1(Definitive CM)和到非收缩细胞的分化分支2(Non-contractile)。分析发现分支1上调了一些与心脏功能相关的基因如CSRP3和NKX2-5，而分支2则上调了如DCN和COL1A2等成纤维细胞的marker基因。

In summary, GeneSwitches can help identify the timing of gene expression events within a pseudo-time trajectory, which in turn allows for a fuller understanding of the order of regulatory and functional events that occur during a cellular transition.

看一看代码

GeneSwitches

GeneSwitches 的目标是以单细胞分辨率发现细胞状态转换期间基因表达和功能事件的顺序。它适用于任何单细胞轨迹或细胞的伪时间排序，以发现充当细胞状态之间开/关开关的基因，重要的是这些开关发生的顺序。

Installation

Check and install required packages

Users may use following codes to check and install all the required packages.

list.of.packages <- c("SingleCellExperiment", "Biobase", "fastglm", "ggplot2", "monocle",
                      "plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "devtools",
                      "mixtools")

## for package "fastglm", "ggplot2", "plyr", "RColorBrewer", "ggrepel", "ggridges", "gridExtra", "mixtools"
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

## for package "SingleCellExperiment", "Biobase"
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) BiocManager::install(new.packages)

Install GeneSwitches

The source code of GeneSwitches can be installed from GitHub with:

devtools::install_github("SGDDNB/GeneSwitches")

Input datasets

GeneSwitches 需要两个输入，即基因表达矩阵和每个细胞的相应伪时间排序。我们将这些输入数据集转换为一个 SingleCellExperiment 对象（Lun and Risso 2017），在下面你会发现一个完整的“从头到尾”的工作流程来实现这个分析的潜力。

## load libraries
library(GeneSwitches)
library(SingleCellExperiment)

这里的实例代码，我们将使用已发布的单细胞 RNA-seq 数据集，这些数据从人类胚胎干细胞 (hESC) 到心肌细胞 (CM) 的分化（Friedman 等人，2018 年）。运用monocle2进行轨迹分析。选择这个数据集的部分原因是它显示了心脏 hESC 分化的分叉细胞命运，它产生了明确的心肌细胞 (Path1) 或非收缩性心脏衍生物 (Path2)，允许应用 GeneSwitches 的所有方面。

## Download example files to current directory
get_example_inputData()
## Load input data log-normalized gene expression
load("./logexpdata.RData")
## Load Monocle2 object with pseudo-time and dimensionality reduction
load("./cardiac_monocle2.RData")

Direct input (NOT run)

Users can input the gene expression (logexpdata; recommend for log-normalized expression), pseudo-time (cell_pseudotime) and dimensionality reductions (rd_PCA; optional and only for gene expression plots) into SingleCellExperiment object as follows.

### create SingleCellExperiment object with log-normalized single cell data
#sce <- SingleCellExperiment(assays = List(expdata = logexpdata))
### add pseudo-time information
#colData(sce)$Pseudotime <- cell_pseudotime
### add dimensionality reductions, e.g. PCA, UMAP, tSNE
#pca <- prcomp(t(assays(sce)$expdata), scale. = FALSE)
#rd_PCA <- pca$x[,1:2]
#reducedDims(sce) <- SimpleList(PCA = rd_PCA)

Convert from trajectory results

Alternatively, GeneSwitches provides functions to convert Monocle2 or Slingshot results into SingleCellExperiment object directly. For Monocle2 trajectory, users need to indicate the states of the desired path, which can be checked by plotting the trajectory using Monocle2 function plot_cell_trajectory or the following function.

## plot Monocle2 trajectory colored by State
# monocle:::plot_cell_trajectory(cardiac_monocle2, color_by = "State")
plot_monocle_State(cardiac_monocle2)

Based on the marker genes, the pseudo-time trajectory starts from State 3, which are hESC cells. Definitive CM cells are in State 1 and non-contractile cardiac derivatives are in State 5. Therefore, we focus on Path1 with cells in states 3, 2, 1 and Path2 with cells in states 3, 2, 5, and extract these two paths from Monocle2 object.

## Input log-normalized gene expression, Monocle2 pseudo-time and dimensionality reduction
## Path1 containing cells in states 3,2,1
sce_p1 <- convert_monocle2(monocle2_obj = cardiac_monocle2, 
                           states = c(3,2,1), expdata = logexpdata)
## Path2 containing cells in states 3,2,5
sce_p2 <- convert_monocle2(monocle2_obj = cardiac_monocle2, 
                           states = c(3,2,5), expdata = logexpdata)

If we are only interested in the trajectory within a certain range of pseudotime, function subset_pseudotime can be used to subset the SingleCellExperiment object accordingly, followed by filtering out lowly expressed genes.

### Subset cells to pseudotime range from 10 to 25
#sce_p1_subset <- subset_pseudotime(sce_p1, min_time = 10, max_time = 25, minexp = 0, mincells = 10)

In Part I, we will apply GeneSwitches on a single trajectory, Path1, to demonstrate the general workflow and functions. Comparison of GeneSwitches results from two trajectories (Path1 & 2) will be shown in Part II.

PART I. GeneSwitches on a single trajectory

I-1. Binarize gene expression

由于我们关注的是打开或关闭的基因，因此我们首先将基因表达数据二值化为 1(on) 或 0(off) 状态。为了实现这一点，对于每个基因，我们将两个高斯分布的混合模型拟合到输入基因表达中，以计算用于二值化的基因特异性阈值。在拟合之前，我们在基因表达中添加了均值为零和标准差为 0.1 的高斯噪声，这确保了基因表达拟合的数值稳定性。然后去除不具有明显双峰“开-关”分布的基因。对于使用 3 个内核的 2000 个细胞，此步骤可能需要 2 分钟。

### binarize gene expression using gene-specific thresholds
sce_p1 <- binarize_exp(sce_p1, ncores = 3)

Alternatively, we can use a global threshold for fast binarization. We plot a histogram of expression of all the genes in all cells and look for a break between the zero and expressed distributions to identify the global threshold.

### check the threshold for binarization
#h <- hist(assays(sce_p1)$expdata, breaks = 200, plot = FALSE)
#{plot(h, freq = FALSE, xlim = c(0,2), ylim = c(0,1), main = "Histogram of gene expression",
#xlab = "Gene expression", col = "darkgoldenrod2", border = "grey")
#abline(v=0.2, col="blue")}

###In this example, we choose 0.2 (blue line, also set as default) as the threshold.
# bn_cutoff <- 0.2
# sce_p1 <- binarize_exp(sce_p1, fix_cutoff = TRUE, binarize_cutoff = bn_cutoff)

I-2. Fit logistic regression & estimate switching time

Logistic regression is applied to model the binary states (on or off) of gene expression. Then the switching pseudo-time point is determined by the time at which the fitted line crosses the probability threshold 0.5. We use random downsampling of zero expressions (downsample = TRUE) to rescue the prediction of switching time for genes with high zero inflation.

## fit logistic regression and find the switching pseudo-time point for each gene
## with downsampling. This step takes less than 1 mins
sce_p1 <- find_switch_logistic_fastglm(sce_p1, downsample = TRUE, show_warning = FALSE)

I-3. Visualize ordering of switching genes

First, we filter poorly fitted genes based on zero-expression percentage (>90%), FDR (>0.05) and McFadden’s Pseudo R^2 (<0.03). We can then the number of top best fitting (high McFadden’s Pseudo R^2) genes to plot. One can also extract specific gene type(s) to plot, with provided gene type lists containing surface proteins (downloaded from here) and transcription factors (TFs, downloaded from here). Users are allowed to pass their own gene type lists as a data frame to parameter genelists, with rows as genes (non-duplicated) and two columns with name genenames and genetypes.

## filter top 15 best fitting switching genes among all the genes
sg_allgenes <- filter_switchgenes(sce_p1, allgenes = TRUE, topnum = 15)
## filter top 15 best fitting switching genes among surface proteins and TFs only
sg_gtypes <- filter_switchgenes(sce_p1, allgenes = FALSE, topnum = 20,
                                genelists = gs_genelists, genetype = c("Surface proteins", "TFs"))
## combine switching genes and remove duplicated genes from sg_allgenes
sg_vis <- rbind(sg_gtypes, sg_allgenes[setdiff(rownames(sg_allgenes), rownames(sg_gtypes)),])

Finally, plot the selected genes along the pseudo-timeline. Genes that are switched on are plotted above the line, while those switching off are below the line.

plot_timeline_ggplot(sg_vis, timedata = sce_p1$Pseudotime, txtsize = 3)

It is possible to use the dimensionality reduction provided from the user to visualise the gene expression and logistic regression fitting plots if needed.

plot_gene_exp(sce_p1, gene = "VIM", reduction = "monocleRD", downsample = F)

I-4. Order pathways along the pseudo-timeline

GeneSwitches can be used to order pathways or genesets as well. We include the pathways provided by MSigDB hallmark (Liberzon,A. et al., 2015), C2 curated and C5 gene ontology geneset collections. A Hypergeometric test is first applied to extract the pathways that are significantly overrepresented amongst those that are changing along the trajectory. The Switching time of the pathway is then determined by the median switching time of genes in that pathway.

## filter genes for pathway analysis using r^2 cutoff 0.1
sg_pw <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.1)
## apply hypergeometric test and determine the switching time
switch_pw <- find_switch_pathway(rowData(sce_p1), sig_FDR = 0.05,
                                 pathways = msigdb_h_c2_c5, sg_pw)
## remove redundant pathways
switch_pw_reduce <- reduce_pathways(switch_pw, pathways = msigdb_h_c2_c5, 
                                    redundant_pw_rate = 0.8)

To better visualise the functional changes ridge plots of pathways genes show the density of switching genes along the pseudo-time. Top 10 significantly changed pathways are plotted here, ordered by the switching time.

plot_pathway_density(switch_pw_reduce[1:10,], sg_pw, orderbytime = TRUE)
#> Picking joint bandwidth of 2.49

We can also select specific pathway(s) to plot the switching genes in it. Among top 10 significantly changed pathways, we plot genes related to myogenesis and cardiac muscle tissue development.

sg_vis <- filter_switchgenes(sce_p1, topnum = 50, pathway_name = c("HALLMARK_MYOGENESIS",
                                                                "GO_CARDIAC_MUSCLE_TISSUE_DEVELOPMENT"))
plot_timeline_ggplot(sg_vis, timedata=sce_p1$Pseudotime, txtsize=3)

“Multiple” lables the genes in more than one pathways.

PART II. Comparing switching genes from two trajectories

Before comparison, we need to apply same steps in I-1 and I-2 on the cells from Path2 to identify switching pseudo-time point for each gene.

sce_p2 <- binarize_exp(sce_p2)
sce_p2 <- find_switch_logistic_fastglm(sce_p2, downsample = TRUE, show_warnings = FALSE)

And we filter out poorly fitted genes for both paths using the same cutoff.

sg_p1 <- filter_switchgenes(sce_p1, allgenes = TRUE, r2cutoff = 0.03)
sg_p2 <- filter_switchgenes(sce_p2, allgenes = TRUE, r2cutoff = 0.03)

We then plot common switching genes between two paths to compare their ordering.

sg_com <- common_genes(sg_p1, sg_p2, r2cutoff = 0.4,
                       path1name = "Definitive CM", path2name = "non-contractile")
common_genes_plot(sg_com, timedata = sce_p1$Pseudotime)

More importantly, we can plot the distinct switching genes of the two paths.

sg_disgs <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52,
                           path1name = "Definitive CM", path2name = "non-contractile",
                           path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime)
plot_timeline_ggplot(sg_disgs, timedata = sce_p1$Pseudotime, color_by = "Paths", 
                     iffulltml = FALSE, txtsize = 3)

We can also scale the timelines to be the same length (default number of bins is 100) so that differences are based on percentage of the trajectory covered rather than pseudo-time.

sg_disgs_scale <- distinct_genes(sg_p1, sg_p2, r2cutoff = 0.52, 
                                 path1name = "Definitive CM", path2name = "non-contractile",
                                 path1time = sce_p1$Pseudotime, path2time = sce_p2$Pseudotime, 
                                 scale_timeline = T, bin = 100)
# timedata need to be 1 to (number of bins)
plot_timeline_ggplot(sg_disgs_scale, timedata = 1:100, color_by = "Paths", 
                     iffulltml = FALSE, txtsize = 3)

这两个不同转换基因的图只显示了一系列发生转换事件的伪时间线。这个范围实际上是在轨迹的末端，而共同基因大多处于早期（共同基因图）。

Similarly, we can check the gene expression plots for the two paths.

gn <- "DCN"
p <- plot_gene_exp(sce_p1, gene = gn, reduction = "monocleRD", 
                   downsample = FALSE, fitting = TRUE)
#> Warning: glm.fit: algorithm did not converge

p <- plot_gene_exp(sce_p2, gene = gn, reduction = "monocleRD", 
                   downsample = FALSE, fitting = TRUE)

生活很好，等你超越

追风少年ii

关注

20
点赞
踩
25

收藏

觉得还不错? 一键收藏
2
评论
10X单细胞（10X空间转录组）轨迹分析（拟时分析）之基因开关（GeneSwitches）

10X单细胞（10X空间转录组）轨迹分析（拟时分析）之基因开关（GeneSwitches）
复制链接

扫一扫