SCCAF:全称为单细胞聚类评估框架,是一种全新的单细胞推断细胞类型的方法
该方法的核心就是通过不停的聚类评估在聚类在评估的方法最后筛选出最优的细胞类型
SCCAF方法不仅会识别出特异性的细胞类型,还会识别出各种类型中的细胞对应的marker基因
关于这个方法和文章我就不多做介绍了,下面的这个教程,做了很深入的讲解
https://mp.weixin.qq.com/s/AnhKvVlr_2uzEz2jY_nNbA
那么,直接进行试验看一下到底有啥用
请务必参考官方文档https://pypi.org/project/SCCAF/
官方文档的一句话请务必牢记:Before applying SCCAF, please make sure the doublets have been excluded and the batch effect has been effectively regressed.
__Attention:__在学习一个流程的时候,首先要测试一下官方的数据,了解输入的数据类型,以及数据特征。然后用自己的数据进行处理和测试,这才是完整的学校一个流程的方法
# 首先进行软件安装
# pip install sccaf
# 初始化
import os
import math
import itertools
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scanpy as sc
from collections import Counter
from SCCAF import *
%matplotlib inline
%pylab inline
%config InlineBackend.figure_format = 'svg'
warnings.filterwarnings("ignore")
end = ''
import urllib.request
Populating the interactive namespace from numpy and matplotlib
SCCAF 是基于 Scanpy 对象生成的结果。那这样正好。不像R语言,颠来倒去的在各种类型里面变来变去
首先使用官方数据跑一边流程
#urllib.request.urlretrieve("https://drive.google.com/uc?export=download&id=1hprRFkEk8q27kqrt9RLvY6aqu6XWPb2C", 'Zeisel.h5')
ad = sc.read(filename="Zeisel.h5")
sc.tl.leiden(ad, resolution=0.6)
这里要注意一下,文章使用的是louvain聚类方法,但是由于我的版本是0.7的,存在bug,暂时还没修复就先用ledien方法聚类
ad.obs_keys()
sc.pl.tsne(ad, color=['cell','leiden'], frameon=False)
['group #',
'total mRNA mol',
'well',
'sex',
'age',
'diameter',
'cell_id',
'level1class',
'level2class',
'n_genes',
'louvain',
'cell',
'self-project',
'L1_Round0',
'L1_Round0_self-projection',
'L1_result',
'L1_Round1',
'L1_Round1_self-projection',
'L1_Round2',
'L1_Round2_self-projection',
'L1_Round3',
'L1_Round3_self-projection',
'leiden']
从图中,可以明显的看出不同手动注释的细胞,不同群落的边界基本上是没有重叠的。
然后,在使用SCCAF的时候是接受过度聚类和欠聚类的
这里强调一下啊,过度聚类不是说一下子聚个好几十类出来,因一旦涉及到可视化之类的问题,必然会用到画板,设计到画板就要强调颜色的数量了,一般不要超过12种,不然报错的话解决起来就很麻烦
然后我们用SCCAF方法验证一下这个手动聚类出来的效果
figsize(8,3)
y_prob, y_pred, y_test, clf, cvsm, acc = SCCAF_assessment(ad.X, ad.obs['cell'],n=100)
aucs = plot_roc(y_prob, y_test, clf, cvsm=cvsm, acc=acc)
plt.show()
Mean CV accuracy: 0.9522
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.9593
然后让算法自动聚类,直接跑流程就行
ad.obs['L1_Round0'] = ad.obs['leiden']
SCCAF_optimize_all(min_acc=0.953, ad=ad, basis ='tsne',c_iter=5) # use='pca'
R1norm_cutoff: 0.500000
R2norm_cutoff: 0.050000
Accuracy: 0.000000
======================
Round1 ...
Mean CV accuracy: 0.9301
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.9205
... storing 'L1_Round0_self-projection' as categorical
IGRAPH U-W- 12 0 --
+ attr: weight (e)
Converge SCCAF_optimize no. cluster!
m1: 0.318182
m2: 0.021178
Accuracy: 0.908434
R1norm_cutoff: 0.308182
R2norm_cutoff: 0.020178
Accuracy: 0.908434
======================
Round1 ...
Mean CV accuracy: 0.9140
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.9171
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.9094
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.9186
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.9166
Accuracy on the training set: 1.0000
Accuracy on the hold-out set: 0.8988
Max R1mat: 0.466102
Max R2mat: 0.031714
min_acc: 0.898795
看一下最后的结果,由原本的12类变成了10类
sc.pl.tsne(ad, color=['L1_result','cell',],frameon=False, title=['SCCAF','Human Annotation'])
最后我们总结一下基本流程:
-
走标准流程得到聚类信息 https://zhuanlan.zhihu.com/p/338240909
-
将聚类信息设置为初始化
L1_Round0
-
然后跑流程即可
注意一下,迭代次数设置的多的话可以确保结果稳定
Best Regards,
Yuan.SH
---------------------------------------
School of Basic Medical Sciences,
Fujian Medical University,
Fuzhou, Fujian, China.
please contact with me via the following ways:
(a) e-mail :yuansh3354@163.com