3.基于python的scRNA-seq细胞类型注释-手动注释

最新推荐文章于 2024-06-17 17:35:06 发布

tzc_fly

最新推荐文章于 2024-06-17 17:35:06 发布

阅读量751

点赞数 26

分类专栏：生物计算工具文章标签： python

本文链接：https://blog.csdn.net/qq_40943760/article/details/139003206

版权

生物计算工具专栏收录该内容

49 篇文章 62 订阅

订阅专栏

在生物学上，不同的细胞具有不同的表型，细胞的表面抗原，或者特定的表达基因会有所区别。例如在B细胞中，我们可以根据细胞是否表达浆细胞表面抗原，来识别B细胞中的浆细胞。此外，在发育的谱系中，例如内胚层细胞，其也有着与外胚层细胞不同的标志基因。

不过，通过marker gene对细胞类型进行分配有一定的局限性，B细胞会高表达B细胞表面抗原MS4A1，但是我们所测得的B细胞却有着不同的状态，如幼稚B细胞，成熟B细胞，浆细胞等，甚至还有一些疾病/组织特异性B细胞。故细胞类型的注释并不是绝对的，我们一般会首先获得大类细胞，再从大类细胞中识别细胞类型的"亚型"或"细胞状态"。

参考：
[1] https://github.com/Starlitnightly/single_cell_tutorial
[2] https://github.com/theislab/single-cell-best-practices

注释前的处理-聚类

首先从一个稍微复杂的数据集开始注释，这里使用降维可视化中保存的人类骨髓数据集：

import omicverse as ov
import scanpy as sc
ov.ov_plot_set()

adata = ov.read('./data/s4d8_dimensionality_reduction.h5ad')

在手动注释中，流程是先聚类，然后进行：1.主要簇的注释，2.选定簇，在簇内注释亚群。

所以，第一步注释的聚类分辨率不用很高，使用默认的resolution=1就行：

sc.pp.neighbors(adata, n_neighbors=15,
                n_pcs=30,use_rep='scaled|original|X_pca')
sc.tl.leiden(adata, key_added="leiden_res1", resolution=1.0)

# UMAP可视化
adata.obsm["X_mde"] = ov.utils.mde(adata.obsm["scaled|original|X_pca"])
ov.utils.embedding(adata,
                basis='X_mde',
                color=[ "leiden_res1"],
                title=['Clusters'],
                palette=ov.palette()[:],
                show=False,frameon='small',)

聚类结果如下：
fig1

基于marker字典注释

这里根据文献构造字典，比如基于文献可以获得骨髓细胞类型的marker gene----之前研究特定细胞类型和亚型并报告了这些细胞类型的marker gene的论文。理想情况下，一个marker gene集应该在多个数据集中得到验证。

然后假设我们有一个正确的marker字典，比如：

marker_dict={
‘Fibroblast’:[‘ACTA2’],
‘Endothelium’:[‘PTPRB’,‘PECAM1’],
‘Epithelium’:[‘KRT5’,‘KRT14’],
‘Mast cell’:[‘KIT’,‘CD63’],
‘Neutrophil’ :[‘FCGR3A’,‘ITGAM’],
‘cDendritic cell’:[‘FCER1A’,‘CST3’],
‘pDendritic cell’:[‘IL3RA’,‘GZMB’,‘SERPINF1’,‘ITM2C’],
‘Monocyte’ :[‘CD14’,‘LYZ’,‘S100A8’,‘S100A9’,‘LST1’,],
‘Macrophage’:[‘CSF1R’,‘CD68’],
‘B cell’:[‘MS4A1’,‘CD79A’],
‘Plasma cell’:[‘MZB1’,‘IGKC’,‘JCHAIN’],
‘Proliferative signal’:[‘MKI67’,‘TOP2A’,‘STMN1’],
‘NK/NKT cell’:[‘GNLY’,‘NKG7’,‘KLRD1’],
‘T cell’:[‘CD3D’,‘CD3E’],
}

然后循环遍历所有细胞类型，并仅保留adata中与marker字典中重叠的基因作为marker基因。用dotplot统计这些重叠的marker gene在聚类簇和已知细胞类型中的表达量：

small_marker_dict={
    'Fibroblast':['ACTA2'],
    'Endothelium':['PTPRB','PECAM1'],
    'Epithelium':['KRT5','KRT14'],
    'Mast cell':['KIT','CD63'],
    'Neutrophil' :['FCGR3A','ITGAM'],
    'cDendritic cell':['FCER1A','CST3'],
    'pDendritic cell':['IL3RA','GZMB','SERPINF1','ITM2C'],
    'Monocyte' :['CD14','LYZ','S100A8','S100A9','LST1',],
    'Macrophage':['CSF1R','CD68'],
    'B cell':['MS4A1','CD79A'],
    'Plasma cell':['MZB1','IGKC','JCHAIN'],
    'Proliferative signal':['MKI67','TOP2A','STMN1'],
    'NK/NKT cell':['GNLY','NKG7','KLRD1'],
    'T cell':['CD3D','CD3E'],
}
# check if the markers are in the data
smarker_genes_in_data = dict()
for ct, markers in small_marker_dict.items():
    markers_found = list()
    for marker in markers:
        if marker in adata.var.index:
            markers_found.append(marker)
    smarker_genes_in_data[ct] = markers_found
#del [] # remove the last marker
del_markers = list()
for ct, markers in smarker_genes_in_data.items():
    if markers==[]:
        del_markers.append(ct)
for ct in del_markers:
    del smarker_genes_in_data[ct]


sc.pl.dotplot(
    adata,
    groupby="leiden_res1",
    var_names=smarker_genes_in_data,
    dendrogram=True,
    standard_scale="var",  # standard scale: normalize each gene to range from 0 to 1
)

fig2
然后根据经验给簇分配细胞类型名称，比如根据上图：

簇8，0，18，1：NK细胞

然后可视化这部分簇，观察它们是否连在一起，如果连在一起就有更大可能是注释正确的：

cluster2annotation = {
     '8': 'NK cells',
     '0': 'NK cells',
     '18': 'NK cells',
     '1': 'NK cells',
}
adata.obs['major_celltype'] = adata.obs['leiden_res1'].map(cluster2annotation).astype('category')

ov.utils.embedding(adata,
                basis='X_mde',
                color=[ "leiden_res1","major_celltype"],
                title=['Clusters','Cell types'],
                palette=ov.palette()[:],wspace=0.55,
                show=False,frameon='small',)

fig3

基于簇特异性marker基因的注释

直接提前给定一个marker字典范围太过于宽广，字典的内容我们很难控制（一方面是细胞类型提前指定哪些，细胞类型对应的marker基因提前指定大概多少个，以及具体是哪几个），为了更简便，我们可以计算聚类后每个簇的marker基因（那些高度差异表达的基因），然后查找是否可以将这些marker基因与任何已知的细胞类型联系起来。

这里，一般会使用t-test检验来计算不同簇的差异基因：

# 计算差异基因
adata.uns['log1p']['base']=None
sc.tl.rank_genes_groups(
    adata, groupby="leiden_res1", use_raw=False,
    method="wilcoxon", key_added="dea_leiden_res1"
)

# 点图可视化差异表达情况
sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_res1", 
    standard_scale="var", n_genes=3, key="dea_leiden_res1"
)

fig4
差异表达基因在多个簇中都具有高表达，为此，还需要手动过滤部分差异表达基因：

sc.tl.filter_rank_genes_groups(
    adata,
    min_in_group_fraction=0.1,
    max_out_group_fraction=0.2,
    key="dea_leiden_res1",
    key_added="dea_leiden_res1_filtered",
)

sc.pl.rank_genes_groups_dotplot(
    adata, groupby="leiden_res1", dendrogram=True,
    standard_scale="var", n_genes=3, key="dea_leiden_res1_filtered"
)

fig5
上图中，簇13和15是一类细胞，我们现在要找到13和15的差异基因对应的是什么细胞类型。如果这里有字典数据库，我们就能根据每个目标簇的多个差异基因去索引细胞类型。由于涉及字典的索引，我们明显可以看到对经验的依赖，这也是手动的特点。

基于marker数据库的注释

为了提高自动化程度，后来人们将基于簇特异性marker注释与marker基因数据库连接，从而减少了手动过程。

这里，我们使用AUCell来评估每一类细胞的marker基因在簇中的富集情况。

首先，获取数据的raw格式，因为我们需要所有基因：

print(adata)
print(adata.raw)
adata_raw=adata.raw.to_adata()
print(adata_raw)

"""
AnnData object with n_obs × n_vars = 14814 × 2000
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'scDblFinder_score', 'scDblFinder_class', 'log2_nUMIs', 'log2_nGenes', 'leiden_res1', 'major_celltype'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'counts|original|cum_sum_eigenvalues', 'counts|original|pca_var_ratios', 'hvg', 'layers_counts', 'log1p', 'neighbors', 'scaled|original|cum_sum_eigenvalues', 'scaled|original|pca_var_ratios', 'tsne', 'umap', 'leiden_res1', 'leiden_res1_colors', 'dendrogram_leiden_res1', 'major_celltype_colors', 'dea_leiden_res1', 'dea_leiden_res1_filtered'
    obsm: 'X_tsne', 'X_tsne_counts', 'X_tsne_scaled', 'X_umap', 'X_umap_counts', 'X_umap_scaled', 'counts|original|X_pca', 'scaled|original|X_pca', 'X_mde', 'X_pca'
    varm: 'counts|original|pca_loadings', 'scaled|original|pca_loadings'
    layers: 'counts', 'lognorm', 'scaled', 'soupX_counts'
    obsp: 'connectivities', 'distances'
Raw AnnData with n_obs × n_vars = 14814 × 20171
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
AnnData object with n_obs × n_vars = 14814 × 20171
    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_20_genes', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'log1p_total_counts_ribo', 'pct_counts_ribo', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'outlier', 'mt_outlier', 'scDblFinder_score', 'scDblFinder_class', 'log2_nUMIs', 'log2_nGenes', 'leiden_res1', 'major_celltype'
    var: 'gene_ids', 'feature_types', 'genome', 'mt', 'ribo', 'hb', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'percent_cells', 'robust', 'mean', 'var', 'residual_variances', 'highly_variable_rank', 'highly_variable_features'
    uns: 'counts|original|cum_sum_eigenvalues', 'counts|original|pca_var_ratios', 'hvg', 'layers_counts', 'log1p', 'neighbors', 'scaled|original|cum_sum_eigenvalues', 'scaled|original|pca_var_ratios', 'tsne', 'umap', 'leiden_res1', 'leiden_res1_colors', 'dendrogram_leiden_res1', 'major_celltype_colors', 'dea_leiden_res1', 'dea_leiden_res1_filtered'
    obsm: 'X_tsne', 'X_tsne_counts', 'X_tsne_scaled', 'X_umap', 'X_umap_counts', 'X_umap_scaled', 'counts|original|X_pca', 'scaled|original|X_pca', 'X_mde', 'X_pca'
    obsp: 'connectivities', 'distances'
"""

然后下载CellMarker_Augmented_2021数据库，一共包含1097种细胞类型，并且平均每一种细胞类型有80个marker基因。

使用富集分析自动给每个簇注释几个潜在可能的细胞类型（细胞类型只在数据库中出现）：

pathway_dict=ov.utils.geneset_prepare('./database/CellMarker_Augmented_2021.txt',organism='Human')

adata_aucs=ov.single.pathway_aucell_enrichment(adata_raw,
                                                pathways_dict=pathway_dict,
                                                num_workers=8)
adata_aucs.obs=adata_raw[adata_aucs.obs.index].obs
adata_aucs.obsm=adata_raw[adata_aucs.obs.index].obsm
adata_aucs.obsp=adata_raw[adata_aucs.obs.index].obsp

#adata_aucs.uns['log1p']['base']=None
sc.tl.rank_genes_groups(
    adata_aucs, groupby="leiden_res1", use_raw=False,
    method="wilcoxon", key_added="dea_leiden_aucs_res1"
)
sc.pl.rank_genes_groups_dotplot(adata_aucs,groupby='leiden_res1',
                                cmap='RdBu_r',key='dea_leiden_aucs_res1',
                                standard_scale='var',n_genes=3)

fig6
也可以逐个打印出来，查看每个簇对应的前2个细胞类型：

for cluster in adata_aucs.uns['dendrogram_leiden_res1']['categories_ordered']:
    special_cluster=str(cluster)
    degs = sc.get.rank_genes_groups_df(adata_aucs, group=special_cluster, key='dea_leiden_aucs_res1')
    degs=degs.dropna()
    print('{}: {}'.format(special_cluster,str(degs.names[:2].tolist()).replace("'","").replace("[","").replace("]","").replace(",","|")))

14: Natural Regulatory T (Treg) cell:Peripheral Blood| Natural Killer T (NKT) cell:Fetal Kidney
2: Germinal Center B cell:Undefined| B cell:Kidney
10: Monocyte:Fetal Kidney| Cancer Stem cell:Bone Marrow
12: Endothelial cell:Fetal Gonad| Lake Et al.Science.Ex3:Brain
16: Leydig cell:Fetal Gonad| Lake Et al.Science.Ex3:Brain
6: Immature Transitional B cell:Blood| Plasmablast:Peripheral Blood
4: B cell:Kidney| Transitional B cell:Blood
9: B cell:Kidney| Lymphoid cell:Peripheral Blood
7: Cancer Stem cell:Bone Marrow| Cardiomyocyte:Pluripotent Stem Cell
13: Erythroid cell:Umbilical Cord Blood| Erythroblast:Blood
15: Red Blood Cell (erythrocyte):Placenta| Erythroid cell:Umbilical Cord Blood
8: pro-Natural Killer Cell (pro-NK cell):Peripheral Blood| Natural Killer cell:Peripheral Blood
0: CD4+ Cytotoxic T cell:Liver| Effector CD8+ Memory T (Tem) cell:Peripheral Blood
18: Natural Killer T (NKT) cell:Kidney| Natural Killer cell:Peripheral Blood
1: T cell:Placenta| CD4+ Cytotoxic T cell:Liver
17: Erythroid precursor:Bone Marrow| Gonadal Mitotic Phase Fetal Germ cell:Fetal Gonad
5: T Helper cell:Liver| Exhausted CD8+ T cell:Liver
3: Naive CD4+ T cell:Peripheral Blood| Naive CD8+ T cell:Peripheral Blood
11: Mucosal-associated Invariant T cell:Liver| T Helper cell:Liver