10X单细胞（10X空间转录组）细胞组成变化分析之scCODA

追风少年ii

于 2024-05-11 09:23:41 发布

阅读量233

点赞数 3

文章标签： java 单细胞空间转录组

本文链接：https://blog.csdn.net/weixin_53637133/article/details/138698598

版权

hello，又是周五了，比较忙，没有更，但是29周岁而立之年的焦虑还一直在，不知道怎么做才能缓解，好了，这一篇我们要分享一个新的方法，a Bayesian model for compositional single-cell data analysis，分享的文章在scCODA is a Bayesian model for compositional single-cell data analysis，2021年12月发表于NC，方法还是很不错的

研究背景

细胞类型的组成变化是生物过程的主要驱动力。由于数据的组成性和样本量较小，很难通过单细胞实验检测它们。

单细胞 RNA 测序 (scRNA-seq) 的最新进展允许在广泛的组织中对单个细胞进行大规模定量转录分析，从而能够监测条件或发育阶段之间的转录变化以及数据驱动的识别不同的细胞类型。

尽管是疾病、发育、衰老和免疫等生物过程的重要驱动因素，但使用 scRNA-seq 检测细胞类型组成的变化并非易事。统计测试需要考虑技术和方法限制的多种来源，包括实验重复次数少。在大多数单细胞技术中，每个样本的细胞总数受到限制，这意味着细胞类型计数本质上是成比例的。反过来，这会导致细胞类型相关性估计出现负偏差。例如，如果只有一种特定的细胞类型在扰动后被耗尽，其他细胞的相对频率就会上升。如果从表面上看，这将导致不同细胞类型的膨胀。因此，独立测试每种细胞类型的组成变化的标准单变量统计模型可能错误地将某些群体变化视为真实效应，即使它们仅由细胞类型比例的固有负相关性引起。然而，目前应用于组成细胞类型分析的常见统计方法忽略了这种影响。

为了解释细胞类型组成中存在的固有偏差，从微生物组数据的组成分析方法中汲取灵感，并提出了一种用于细胞类型组成差异丰度分析的贝叶斯方法，以进一步解决低复制问题。单细胞成分数据分析 (scCODA) 框架使用分层 Dirichlet-Multinomial 分布对细胞类型计数进行建模，该分布通过对所有测量的细胞类型比例而不是通过联合建模来解释细胞类型比例的不确定性和负相关偏差个别的。该模型使用带有对数链接函数的 Logit 正态尖峰和平板先验，以简约的方式估计二元（或连续）协变量对细胞类型比例的影响。由于成分分析始终需要能够识别成分变化的参考，因此 scCODA 可以自动选择适当的细胞类型作为参考或使用预先指定的参考细胞类型。这意味着必须根据所选参考来解释 scCODA 检测到的可信变化。最重要的是，该框架提供了对其他完善的组合测试统计数据的访问，并完全集成到 Scanpy pipeline中。

代码示例

单细胞数据分析细胞比例的缺点

scRNA-seq population data is compositional. This must be considered to avoid an inflation of false-positive results.
Most datasets consist only of very few samples, making frequentist tests inaccurate.
A condition usually only effects a fraction of cell types. Therefore, sparse effects are preferable.
The scCODA model overcomes all these limitations in a fully Bayesian model, that outperforms other compositional and non-compositional methods.（软件是python版本）

scCODA - Compositional analysis of single-cell data

# Setup
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt

from sccoda.util import comp_ana as mod
from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz

import sccoda.datasets as scd

load data

# Load data

cell_counts = scd.haber()

print(cell_counts)

Mouse Endocrine Enterocyte Enterocyte.Progenitor Goblet Stem
0 Control_1 36 59 136 36 239
1 Control_2 5 46 23 20 50
2 Control_3 45 98 188 124 250
3 Control_4 26 221 198 36 131
4 H.poly.Day10_1 42 71 203 147 271
5 H.poly.Day10_2 40 57 383 170 321
6 H.poly.Day3_1 52 75 347 66 323
7 H.poly.Day3_2 65 126 115 33 65
8 Salm_1 37 332 113 59 90
9 Salm_2 32 373 116 67 117

预处理

# Convert data to anndata object
data_all = dat.from_pandas(cell_counts, covariate_columns=["Mouse"])

# Extract condition from mouse name and add it as an extra column to the covariates
data_all.obs["Condition"] = data_all.obs["Mouse"].str.replace(r"_[0-9]", "")

For our first example, we want to look at how the Salmonella infection influences the cell composition. Therefore, we subset our data.

# Select control and salmonella data
data_salm = data_all[data_all.obs["Condition"].isin(["Control", "Salm"])]
viz.boxplots(data_salm, feature_name="Condition")
plt.show()

Model setup and inference

We can now create the model and run inference on it. Creating a sccoda.util.comp_ana.CompositionalAnalysis class object sets up the compositional model and prepares everxthing for parameter inference. It needs these informations:

The data object from above.
The formula parameter. It specifies how the covariates are used in the model. It can process R-style formulas via the patsy package, e.g. formula="Cov1 + Cov2 + Cov3". Here, we simply use the “Condition” covariate of our dataset
The reference_cell_type parameter is used to specify a cell type that is believed to be unchanged by the covariates in formula. This is necessary, because compositional analysis must always be performed relative to a reference (See Büttner, Ostner et al., 2021 for a more thorough explanation). If no knowledge about such a cell type exists prior to the analysis, taking a cell type that has a nearly constant relative abundance over all samples is often a good choice. It is also possible to let scCODA find a suited reference cell type by using reference_cell_type="automatic". Here, we take Goblet cells as the reference.

model_salm = mod.CompositionalAnalysis(data_salm, formula="Condition", reference_cell_type="Goblet")
sim_results = model_salm.sample_hmc()

Result interpretation

sim_results.summary()

Compositional Analysis summary:

Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition

Intercepts:
                       Final Parameter  Expected Sample
Cell Type
Endocrine                        1.102        34.068199
Enterocyte                       2.328       116.089840
Enterocyte.Progenitor            2.523       141.085258
Goblet                           1.753        65.324318
Stem                             2.705       169.247878
TA                               2.113        93.631267
TA.Early                         2.861       197.821355
Tuft                             0.449        17.731884


Effects:
                                         Final Parameter  Expected Sample  \
Covariate         Cell Type
Condition[T.Salm] Endocrine                       0.0000        24.315528
                  Enterocyte                      1.3571       321.891569
                  Enterocyte.Progenitor           0.0000       100.696915
                  Goblet                          0.0000        46.623988
                  Stem                            0.0000       120.797449
                  TA                              0.0000        66.827533
                  TA.Early                        0.0000       141.191224
                  Tuft                            0.0000        12.655794

                                         log2-fold change
Covariate         Cell Type
Condition[T.Salm] Endocrine                     -0.486548
                  Enterocyte                     1.471333
                  Enterocyte.Progenitor         -0.486548
                  Goblet                        -0.486548
                  Stem                          -0.486548
                  TA                            -0.486548
                  TA.Early                      -0.486548
                  Tuft                          -0.486548

Intercepts

The first column of the intercept summary shows the parameters determined by the MCMC inference.

The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.

Effects

For the effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types. Most important is the distinctions between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.

Since the numerical values of the “Final parameter” columns are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, are will change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.

Interpretation

In the salmonella case, we see only a credible increase of Enterocytes, while all other cell types are unaffected by the disease. The log-fold change of Enterocytes between control and infected samples with the same total cell count lies at about 1.54.

Adjusting the False discovery rate

scCODA selects credible effects based on their inclusion probability. The cutoff between credible and non-credible effects depends on the desired false discovery rate (FDR). A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries.

The desired FDR level can be easily set after inference via sim_results.set_fdr(). Per default, the value is 0.05, but we recommend to increase it if no effects are found at a more conservative level.

In our example, setting a desired FDR of 0.4 reveals effects on Endocrine and Enterocyte cells.

sim_results.set_fdr(est_fdr=0.4)
sim_results.summary()

Compositional Analysis summary (extended):

Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Spike-and-slab threshold: 0.434

MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 79.348 sec. Acceptance rate: 51.9%

Intercepts:
                       Final Parameter  HDI 3%  HDI 97%     SD  \
Cell Type
Endocrine                        1.102   0.363    1.740  0.369
Enterocyte                       2.328   1.694    2.871  0.314
Enterocyte.Progenitor            2.523   1.904    3.088  0.320
Goblet                           1.753   1.130    2.346  0.330
Stem                             2.705   2.109    3.285  0.318
TA                               2.113   1.459    2.689  0.332
TA.Early                         2.861   2.225    3.378  0.307
Tuft                             0.449  -0.248    1.207  0.394

                       Expected Sample
Cell Type
Endocrine                    34.068199
Enterocyte                  116.089840
Enterocyte.Progenitor       141.085258
Goblet                       65.324318
Stem                        169.247878
TA                           93.631267
TA.Early                    197.821355
Tuft                         17.731884


Effects:
                                         Final Parameter  HDI 3%  HDI 97%  \
Covariate         Cell Type
Condition[T.Salm] Endocrine                     0.327533  -0.506    1.087
                  Enterocyte                    1.357100   0.886    1.872
                  Enterocyte.Progenitor         0.000000  -0.395    0.612
                  Goblet                        0.000000   0.000    0.000
                  Stem                         -0.240268  -0.827    0.168
                  TA                            0.000000  -0.873    0.252
                  TA.Early                      0.000000  -0.464    0.486
                  Tuft                          0.000000  -1.003    0.961

                                            SD  Inclusion probability  \
Covariate         Cell Type
Condition[T.Salm] Endocrine              0.338               0.457133
                  Enterocyte             0.276               0.998400
                  Enterocyte.Progenitor  0.163               0.338200
                  Goblet                 0.000               0.000000
                  Stem                   0.219               0.434800
                  TA                     0.220               0.364000
                  TA.Early               0.128               0.284733
                  Tuft                   0.319               0.392533

                                         Expected Sample  log2-fold change
Covariate         Cell Type
Condition[T.Salm] Endocrine                    34.413767          0.014560
                  Enterocyte                  328.331183          1.499910
                  Enterocyte.Progenitor       102.711411         -0.457971
                  Goblet                       47.556726         -0.457971
                  Stem                         96.897648         -0.804604
                  TA                           68.164454         -0.457971
                  TA.Early                    144.015830         -0.457971
                  Tuft                         12.908980         -0.457971

数据可视化

# Stacked barplot for each sample
viz.stacked_barplot(data_mouse, feature_name="samples")
plt.show()

# Stacked barplot for the levels of "Condition"
viz.stacked_barplot(data_mouse, feature_name="Condition")
plt.show()

# Grouped boxplots. No facets, relative abundance, no dots.
viz.boxplots(
    data_mouse,
    feature_name="Condition",
    plot_facets=False,
    y_scale="relative",
    add_dots=False,
)
plt.show()

# Grouped boxplots. Facets, log scale, added dots and custom color palette.
viz.boxplots(
    data_mouse,
    feature_name="Condition",
    plot_facets=True,
    y_scale="log",
    add_dots=True,
    cmap="Reds",
)
plt.show()

Finding a reference cell type

The scCODA model requires a cell type to be set as the reference category. However, choosing this cell type is often difficult. A good first choice is a referenece cell type that closely preserves the changes in relative abundance during the compositional analysis.

For this, it is important that the reference cell type is not rare, to avoid large relative changes being caused by small absolute changes. Also, the relative abundance of the reference should vary as little as possible across all samples.

The visualization viz.rel_abundance_dispersion_plot shows the presence (share of non-zero samples) over all samples for each cell type versus its dispersion in relative abundance. Cell types that have a higher presence than a certain threshold (default 0.9) are suitable candidates for the reference and thus colored.

viz.rel_abundance_dispersion_plot(
    data=data_mouse,
    abundant_threshold=0.9
)
plt.show()

Diagnostics and plotting

Similarly to the summary dataframes being compatible with arviz, the result class itself is an extension of arviz’s Inference Data class. This means that we can use all its MCMC diagnostic and plotting functionality. As an example, looking at the MCMC trace plots and kernel density estimates, we see that they are indicative of a well sampled MCMC chain:

Note: Due to the spike-and-slab priors, the beta parameters have many values at 0, which looks like a convergence issue, but is actually not.

Caution: Trying to plot a kernel density estimate for an effect on the reference cell type results in an error, since it is constant at 0 for the entire chain. To avoid this, add coords={"cell_type": salm_results.posterior.coords["cell_type_nb"]} as an argument to az.plot_trace, which causes the plots for the reference cell type to be skipped.

az.plot_trace(
    salm_results,
    divergences=False,
    var_names=["alpha", "beta"],
    coords={"cell_type": salm_results.posterior.coords["cell_type_nb"]},
)
plt.show()