10X单细胞空间转录组联合分析之DAVAE

最新推荐文章于 2024-07-12 11:00:36 发布

追风少年ii

最新推荐文章于 2024-07-12 11:00:36 发布

阅读量406

点赞数 3

文章标签：人工智能

本文链接：https://blog.csdn.net/weixin_53637133/article/details/137966051

版权

本文介绍了一种名为DAVAE的通用方法，用于10X单细胞、10XATAC和空间转录组数据的联合分析，以解决大规模单细胞数据集成中的批处理效应问题。方法利用深度学习和变分自动编码器进行跨样本、技术和模式的整合，适用于大规模数据集并支持多模态数据的整合。

摘要由CSDN通过智能技术生成

隔离的第10天，孤独仍在，且行且珍惜，每个人都会做一些选择，选择之后，珍惜眼前人。好了，今天我们分享的方法是10X单细胞、10XATAC、10X空间转录组联合分析相互之间的联合分析，参考的文章在A versatile and scalable single-cell data integration algorithm based on domain-adversarial and variational approximation, 2021年9月发表于Briefings in Bioinformatics，IF11分，纯方法论的文章，这个影响因子已经很高了，我们先来看看原理，分享一下示例代码。

在过去的十年中，单细胞测序技术已经成为一种非常敏感的技术，可以定量测量基因表达水平、DNA 甲基化landscope、染色质可及性、单细胞水平的原位表达。大量的单细胞数据集跨越不同的技术、生物体和方式产生，一些大规模的综合性单细胞图谱正在建设中，几乎涵盖生物学和复杂疾病的方方面面。因此，面临着开发可扩展且有效的方法来整合跨样本、技术和模式的大型单细胞数据集的挑战，并获得对复杂组织中细胞异质性、生物状态/细胞类型、细胞发育和空间模式的生物学见解。

单细胞数据集成的主要问题是消除各种数据噪声，例如批处理效应，这些噪声阻碍了比较两个或多个异质组织的方式。在过去的十年中，已经提出了许多算法来解决这个问题，而不同的算法可能专注于不同类型的数据，并有自己的特定优势。基于参考的集成算法包括 scmap 和 scAlign，它们将参考 scRNA-seq 图集的注释转移到query scRNA-seq 数据上，但这些方法无法预测新的细胞类型。一些专门为bulk RNA-seq 设计的方法也可用于 scRNA-seq 整合，而他们的模型强烈假设每批的细胞组成是相同的。它们包括combat、RUVseq 和 limma。此外，还提出了一些基于因子分析的算法，如 scMerge、LIGER、SPOTLight 和 Duren 的方法。然而，这些算法由于其高计算资源消耗而难以集成大规模数据集。提出了包括 DCA、scVI、scGen 和 DESC 在内的深度学习方法的变体，用于基于自动编码器或变分自动编码器集成 scRNA-seq 数据，可以从瓶颈层获得无批次细胞表示。然而，由于它们的底层模型是专门为 scRNA-seq 数据设计的，因此这些方法在跨模式对齐单细胞数据方面可能不太有效。例如，scVI 使用分层贝叶斯模型将计数表达式数据拟合到零膨胀负二项分布中。另一种有效的策略是基于相互最近的邻居 (MNN)，它首先用于在 mnnCorrect 中检测跨 scRNA-seq 批次的相似细胞pair。 mnnCorrect 方法通过对许多 MNN 对进行平均来获得批校正向量，但输入数据集的顺序可能会导致次优解决方案，因为它使用连续集成策略。受 MNN 的启发，提出了另外两种类似的算法：Seurat 3.0 和 Scanorama。 Seurat 3.0 对其配对数据集中的每个细胞使用 k-MNN 来识别匹配对，称为“锚点”，基于通过典型相关分析 (CCA) 减少的细胞嵌入。尽管 Seurat 可以跨模式对齐单细胞数据，但它依赖于不同的策略来捕获 scATAC-seq 数据的生物结构，而不是 CCA。 Scanorama 采用一种广义的相互最近邻匹配方法，在基于 SVD 的嵌入上，在所有 scRNA-seq 数据集中而不是配对数据集中找到相似的细胞。此外，还有一些其他的集成模型，例如基于图的模型（例如 BBKNN）、基于聚类的模型（例如 Harmony、DC3）、基于几何的模型和多模态交叉模型（例如 MIA）。在上述现有方法中，Seurat 3.0、LIGER、DC3和Stanley的方法能够跨模态整合单细胞数据；采用 Duren 方法整合 scRNA-seq 和 scATAC-seq 数据； SPOTLight、 MIA专为整合 scRNA-seq 和空间转录组数据而设计；所有其他只能应用于 scRNA-seq 数据。

尽管上述方法提供了多种方式以不同策略集成多个单细胞数据集，但只有少数方法促进了跨样本、技术和模式的单细胞数据集成；他们中很少有人表现出整合成对多模态数据的能力，而且其中大多数对于大型数据集是不可扩展的。为了解决这些限制，提出了一种通用且可扩展的方法，可以促进以下集成任务：（i）将多个 scRNA-seq 集成到图集参考中； (ii) 将标签从特征良好的 scRNA-seq 转移到 scATAC-seq 数据和空间分辨的转录组； (iii) 多模式单细胞数据的整合和 (iv) 大规模数据集的整合。

overview of DAVAE

在这里，考虑了跨模式集成多个 scRNAseq 数据集和多个单细胞数据的问题。为了解决这个问题，提出了一个通用框架，域对抗和变分自动编码器（DAVAE），将归一化的基因表达（或染色质可及性）拟合到非线性模型中，将潜在变量 z 转换为表达式具有非线性函数、KL 正则化器和域对抗正则化器的空间。如下图所示，DAVAE 依赖于深度神经多层感知器的结构进行回归，它由变分逼近网络、生成贝叶斯神经网络和域对抗分类器组成。深度神经网络使我们能够有效地从大规模数据集中学习回归模型。共享低维空间中的潜在因素可用于聚类、轨迹推断、跨模态迁移学习和许多其他下游综合分析。

示例代码

Integrating multiple scRNA-seq data

Importing scbean package

import scbean.model.davae as davae
import scbean.tools.utils as tl
import scanpy as sc
import matplotlib
from numpy.random import seed
seed(2021)
matplotlib.use('TkAgg')

# Command for Jupyter notebooks only
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

Loading data

base_path = "/Users/zhongyuanke/data/vipcca/mixed_cell_lines/"
file1 = base_path+"293t/hg19/"
file2 = base_path+"jurkat/hg19/"
file3 = base_path+"mixed/hg19/"

adata_b1 = tl.read_sc_data(file1, fmt='10x_mtx', batch_name="293t")
adata_b2 = tl.read_sc_data(file2, fmt='10x_mtx', batch_name="jurkat")
adata_b3 = tl.read_sc_data(file3, fmt='10x_mtx', batch_name="mixed")

或者

base_path = "/Users/zhongyuanke/data/vipcca/mixed_cell_lines/"

adata_b1 = tl.read_sc_data(base_path+"293t.h5ad", batch_name="293t")
adata_b2 = tl.read_sc_data(base_path+"jurkat.h5ad", batch_name="jurkat")
adata_b3 = tl.read_sc_data(base_path+"mixed.h5ad", batch_name="mixed")

Data preprocessing

Here, we filter and normalize each data separately and concatenate them into one AnnData object.

adata_all = tl.davae_preprocessing([adata_b1, adata_b2, adata_b3], index_unique="-")

DAVAE Integration

# Command for Jupyter notebooks only
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"

adata_integrate = davae.fit_integration(
    adata_all,
    batch_num=3,
    domain_lambda=2.0,
    epochs=25,
    sparse=True,
    hidden_layers=[64, 32, 6]
)

1.The meta.data of each cell has been saved in adata.obs
2.The embedding representation of davae for each cell have been saved in adata.obsm(‘X_davae’)

UMAP Visualization

import umap
adata_integrate.obsm['X_umap']=umap.UMAP().fit_transform(adata_integrate.obsm['X_davae'])
sc.pl.umap(adata_integrate, color=['_batch', 'celltype'], s=3)

空间数据的联合

Importing scbean package

import scbean.model.davae as davae
from scbean.tools import utils as tl
import scanpy as sc
import matplotlib.pyplot as plt
import matplotlib
from numpy.random import seed
seed(2021)
matplotlib.use('TkAgg')

# Command for Jupyter notebooks only
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')

DAVAE integration of two spatial gene expression data

base_path = '/Users/zhongyuanke/data/'
file1_spatial = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/'
file2_spatial = base_path+'spatial/mouse_brain/10x_mouse_brain_Posterior/'
file1 = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.h5'
file2 = base_path+'spatial/mouse_brain/10x_mouse_brain_Posterior/V1_Mouse_Brain_Sagittal_Posterior_filtered_feature_bc_matrix.h5'

adata_spatial_anterior = sc.read_visium(file1_spatial, count_file=file1)
adata_spatial_posterior = sc.read_visium(file2_spatial, count_file=file2)
adata_spatial_anterior.var_names_make_unique()
adata_spatial_posterior.var_names_make_unique()

Data preprocessing

Here, we filter and normalize each dataset separately and concatenate them into one AnnData object.

adata_spatial = tl.spatial_preprocessing([adata_spatial_anterior, adata_spatial_posterior])

DAVAE integration

adata_integrate = davae.fit_integration(
    adata_spatial,
    epochs=25,
    split_by='loss_weight',
    hidden_layers=[128, 64, 32, 5],
    sparse=True,
    domain_lambda=0.5,
)
adata_spatial.obsm["X_davae"] = adata_integrate.obsm['X_davae']

UMAP visualization and clustering

sc.set_figure_params(facecolor="white", figsize=(5, 4))
sc.pp.neighbors(adata_spatial, use_rep='X_davae', n_neighbors=12)
sc.tl.umap(adata_spatial)
sc.tl.louvain(adata_spatial, key_added="clusters")
sc.pl.umap(adata_spatial, color=['library_id', "clusters"],
           size=8, color_map='Set2', frameon=False)

Visualization in spatial coordinates

clusters_colors = dict(
    zip([str(i) for i in range(18)], adata_spatial.uns["clusters_colors"])
)
fig, axs = plt.subplots(1, 2, figsize=(10, 6))

for i, library in enumerate(
    ["V1_Mouse_Brain_Sagittal_Anterior", "V1_Mouse_Brain_Sagittal_Posterior"]
):
    ad = adata_spatial[adata_spatial.obs.library_id == library, :].copy()
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color="clusters",
        size=1.5,
        palette=[
            v
            for k, v in clusters_colors.items()
            if k in ad.obs.clusters.unique().tolist()
        ],
        legend_loc=None,
        show=False,
        ax=axs[i],
    )

plt.tight_layout()
plt.show()

DAVAE integration of spatial gene expression and scRNA-seq data(单细胞空间联合)

import pandas as pd
from sklearn.metrics.pairwise import cosine_distances
import numpy as np
base_path = '/Users/zhongyuanke/data/'
file_rna = base_path+'spatial/mouse_brain/adata_processed_sc.h5ad'
adata_rna = sc.read_h5ad(file_rna)
file1 = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/V1_Mouse_Brain_Sagittal_Anterior_filtered_feature_bc_matrix.h5'
file1_spatial = base_path+'spatial/mouse_brain/10x_mouse_brain_Anterior/'
adata_spatial_anterior = sc.read_visium(file1_spatial, count_file=file1)
adata_spatial_anterior.var_names_make_unique()
adata_spatial_anterior = adata_spatial_anterior[
    adata_spatial_anterior.obsm["spatial"][:, 1] < 6000, :
]

Preprocessing

adata_all = tl.spatial_rna_preprocessing(
    adata_spatial_anterior,
    adata_rna,
)

DAVAE integration

adata_integrate = davae.fit_integration(
    adata_all,
    epochs=40,
    batch_size=128,
    domain_lambda=2.5,
    sparse=True,
    hidden_layers=[128, 64, 32, 10]
)

Calculate distance

len_anterior = adata_spatial_anterior.shape[0]
len_rna = adata_rna.shape[0]
davae_emb = adata_integrate.obsm['X_davae']

adata_spatial_anterior.obsm["davae_embedding"] = davae_emb[0:len_anterior, :]
adata_rna.obsm['davae_embedding'] = davae_emb[len_anterior:len_rna+len_anterior, :]

distances_anterior = 1 - cosine_distances(
    adata_rna.obsm["davae_embedding"],
    adata_spatial_anterior.obsm['davae_embedding'],
)

Transfer label

def label_transfer(dist, labels):
    lab = pd.get_dummies(labels).to_numpy().T
    class_prob = lab @ dist
    norm = np.linalg.norm(class_prob, 2, axis=0)
    class_prob = class_prob / norm
    class_prob = (class_prob.T - class_prob.min(1)) / class_prob.ptp(1)
    return class_prob

class_prob_anterior = label_transfer(distances_anterior, adata_rna.obs.cell_subclass)
cp_anterior_df = pd.DataFrame(
    class_prob_anterior,
    columns=np.sort(adata_rna.obs.cell_subclass.unique())
)
cp_anterior_df.index = adata_spatial_anterior.obs.index
adata_anterior_transfer = adata_spatial_anterior.copy()
adata_anterior_transfer.obs = pd.concat(
    [adata_spatial_anterior.obs, cp_anterior_df],
    axis=1
)

Visualize the neurons cortical layers

sc.set_figure_params(facecolor="white", figsize=(2, 2))
sc.pl.spatial(
    adata_anterior_transfer,
    img_key="hires",
    color=["L2/3 IT", "L4", "L5 PT", "L6 CT"],
    size=1.5,
    color_map='Blues',
)