celltypist使用体验

66 篇文章 0 订阅
21 篇文章 0 订阅

brief

类似于singleR,可以对单细胞数据进行细胞注释,该分类器使用逻辑回归模型,训练集使用了一些已发表和注释的单细胞数据,训练集较大然后标签可能注释的比较好,结果是在该分类器在免疫细胞注释上表现较好。(听说比singleR注释准确一下,然后还发表在science上,那就用起来呗!)

注意事项

  • 内置的 reference – annotation cell type
    具体列表地址:https://www.celltypist.org/models

    import celltypist
    from celltypist import models
    #Show all available models that can be downloaded and used.
    models.models_description()
    #Download a specific model, for example, `Immune_All_Low.pkl`.
    models.download_models(model = 'Immune_All_Low.pkl')
    #Download a list of models, for example, `Immune_All_Low.pkl` and `Immune_All_High.pkl`.
    models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'])
    #Update the models by re-downloading the latest versions if you think they may be outdated.
    models.download_models(model = ['Immune_All_Low.pkl', 'Immune_All_High.pkl'], force_update = True)
    #Show the local directory storing these models.
    models.models_path
    
  • if the model argument is not specified, CellTypist will by default use the Immune_All_Low.pkl model

  • celltypist.annotate注释时的mode参数:
    mode = ‘best match’ : each query cell is predicted into the cell type with the largest score/probability among all possible cell types
    mode = ‘prob match’ : in some scenarios where a query cell cannot be assigned to any cell type in the reference model (i.e., a novel cell type) or can be assigned to multiple cell types (i.e., multi-label classification), a mode of probability match can be turned on (mode = ‘prob match’) with a probability cutoff (default to 0.5, p_thres = 0.5) to decide the cell types (none, 1, or multiple) assigned for a given cell.

  • majority voting classifier
    By default, CellTypist will only do the prediction jobs to infer the identities of input cells, which renders the prediction of each cell independent. To combine the cell type predictions with the cell-cell transcriptomic relationships, CellTypist offers a majority voting approach based on the idea that similar cell subtypes are more likely to form a (sub)cluster regardless of their individual prediction outcomes.
    默认情况下,celltypist对每个细胞进行注释,但是majority voting classifier分类器会考虑到细胞与细胞之间转录组的相似性,根据细胞转录组之间的相似度划分一个近似细胞类群,这个细胞类群内每个细胞注释的细胞类型主要是什么,那么这个细胞类群就是什么类型。
    也就是相对细胞进行聚类,然后注释这个细胞类群。(During the majority voting, to define cell-cell relations, CellTypist will use a heuristic over-clustering approach according to the size of the input data with the aid of a Leiden clustering pipeline. )

    #Turn on the majority voting classifier as well.
    predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', majority_voting = True)
    # 你也可以自己提供聚类信息
    #Add your own over-clustering result.
    # an input plain file with the over-clustering result of one cell per line.
    # or a list-like object (such as a numpy 1D array) indicating the over-clustering result of all cells.
    predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl', majority_voting = True, over_clustering = '/path/to/over_clustering/file')
    

实例演示

官方教程

conda activate R4
conda install -c bioconda -c conda-forge celltypist
python

### python 解释器
import celltypist
from celltypist import models

#Select the model from the above list. If the `model` argument is not provided, will default to `Immune_All_Low.pkl`.
model = models.Model.load(model = 'Immune_All_Low.pkl')
#The model summary information.
model
#Examine cell types contained in the model.
model.cell_types
#Examine genes/features contained in the model.
model.features

# the input data as a count table (cell-by-gene or gene-by-cell) in the format of txt/csv/tsv/tab/mtx/mtx.gz.

#Get a demo test data. This is a UMI count csv file with cells as rows and gene symbols as columns.
input_file = celltypist.samples.get_sample_csv()

#Predict the identity of each input cell.
# predictions = celltypist.annotate(input_file, model = 'Immune_All_Low.pkl')
predictions = celltypist.annotate(input_file, model = model)

# If your input file is in a gene-by-cell format (genes as rows and cells as columns), pass in the transpose_input = True argument.
# In addition, if the input is provided in the .mtx format, you will also need to specify the gene_file and cell_file
predictions = celltypist.annotate(input_file, model = model, transpose_input = True, gene_file = '/path/to/gene/file.txt', cell_file = '/path/to/cell/file.txt')

# 注释结果部分
#Summary information for the prediction result.
predictions
#Examine the predicted cell type labels.
predictions.predicted_labels
#Examine the matrix representing the decision score of each cell belonging to a given cell type.
predictions.decision_matrix
#Examine the matrix representing the probability each cell belongs to a given cell type (transformed from decision matrix by the sigmoid function).
predictions.probability_matrix
# 保存注释结果
#Export the three results to csv tables.
predictions.to_table(folder = '/path/to/a/folder', prefix = '')
#Alternatively, export the three results to a single Excel table (.xlsx).
predictions.to_table(folder = '/path/to/a/folder', prefix = '', xlsx = True)

#Visualise the predicted cell types overlaid onto the UMAP.
predictions.to_plots(folder = '/path/to/a/folder', prefix = '')

现实数据

R 
list.files("./processed_data/")
# [1] "pre_sce.rds"      "sce.combined.rds"
# sce.combined.rds是整合后的数据,我们需要使用其中的counts数据
sce <- readRDS("../processed_data/sce.combined.rds")
write.csv(sce@assays[["RNA"]]$counts,file="sce_integrated_raw_counts.csv")


python
import celltypist
from celltypist import models

models.models_description()
predictions = celltypist.annotate("./sce_integrated_raw_counts.csv", model = 'Immune_All_Low.pkl',transpose_input = True)
predictions.to_table(folder = './', prefix = './celltypsit_Immune_All_Low_')

predictions = celltypist.annotate("./sce_integrated_raw_counts.csv", model = 'Cells_Intestinal_Tract.pkl',transpose_input = True)
predictions.to_table(folder = './', prefix = './celltypsit_Cells_Intestinal_Tract_')

#########
models.models_description()
predictions = celltypist.annotate("./sce_integrated_raw_counts.csv", model = 'Immune_All_Low.pkl',transpose_input = True,majority_voting = True)
predictions.to_table(folder = './', prefix = './celltypsit_Immune_All_Low_MV_')

predictions = celltypist.annotate("./sce_integrated_raw_counts.csv", model = 'Cells_Intestinal_Tract.pkl',transpose_input = True,majority_voting = True)
predictions.to_table(folder = './', prefix = './celltypsit_Cells_Intestinal_Tract_MV_')

R
# 与Seurat对象整合,然后可视化
ct_CIT <- read.table("celltypsit_Cells_Intestinal_Tract_predicted_labels.csv",sep=",",header=T)
ct_IAL <- read.table("celltypsit_Immune_All_Low_predicted_labels.csv",sep=",",header=T)
library(stringr)
rownames(ct_CIT) <- str_replace_all(ct_CIT$X,pattern="\\.",replacement="-")
rownames(ct_IAL) <- str_replace_all(ct_IAL$X,pattern="\\.",replacement="-")

sce@meta.data$celltypist_IAL <- ct_IAL[rownames(sce@meta.data),"predicted_labels"]
sce@meta.data$celltypist_CIT <- ct_CIT[rownames(sce@meta.data),"predicted_labels"]

总结

  • 多个model 测试一下,交叉验证
  • 该工具容易过拟合,貌似设置majority_voting = True 表现会好很多
  • 我下次还是使用SingleR吧!!!
  • 28
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值