在Linux上跑GSEA只是工程需要,在Windows上不能批量的跑;在单细胞数据上跑GSEA那是客户的需要。有需要就需要被满足,我们今天介绍一下如何提取Seurat数据做GSEA。这里的GSEA指的是GSEA官网软件,而不是fgsea, clusterProfiler等,该软件是由JAVA写的,所以应该安装合适的JAVA,为了能看懂一些常见的报错,建议学点JAVA(半年时间过去了)。
java安装与否
which java
/usr/bin/java
/usr/bin/java -version
openjdk version "1.8.0_181"
OpenJDK Runtime Environment (build 1.8.0_181-b13)
OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
gsea 安装与否
检查gsea是否安装成功
java -Xmx512m -cp gsea-3.0.jar xtools.gsea.Gsea -h
xtools.api.param.MissingReqdParamException:
Some required parameters (3) were not specified. The parameters are:
>gmx< Gene sets database (gmx or gmt files - one or more are allowed)
>res< Expression dataset - with rows as genes and columns as samples (for instance: res, gct, pcl files)
>cls< Phenotype labels for the samples in the expression dataset (cls file)
at xtools.api.param.ToolParamSet.check(ToolParamSet.java:340)
at xtools.api.AbstractTool.init(AbstractTool.java:169)
at xtools.api.AbstractTool.init(AbstractTool.java:193)
at xtools.gsea.Gsea.(Gsea.java:51)
at xtools.gsea.Gsea.main(Gsea.java:139)
选择一个通路(gene set)
在整理这份资料的时候,发现关于GSEA的大部分疑难杂症我们生信技能树都有文章,正所谓:
生信教程技能树
入门生信谁敢坑
提示:通路名字中尽量不要有/,如GLYCOLYSIS / GLUCONEOGENESIS ,因为在Linux中/意味着路径,在输出文件中有以通路名命名的文件。细胞命名也最好不要有这个符号呀。
提取数据做GSEA
SeuratRunGSEA
Idents(seuo)
seuo
seuo@assays$RNA@counts-> counts # 是用count还是data?
expr_data
# 整理GSEA需要的表达谱格式
write.table(rbind(c('symbols',colnames(expr_data)),
cbind(rownames(expr_data),expr_data)),
file='expr.txt',quote=F,sep='\t',col.names=F,row.names=F)
# 整理GSEA需要的分组格式
pheno
con
write(paste(length(pheno),'2 1'),con)
write(paste('# ',test1,' ',test2,sep=''),con)
classes
for (i in 1:length(pheno)){
classes
}
write(classes,con)
close(con)
# GSEA 命令
command
' -collapse false -nperm 1000 -permute gene_set -rnd_type no_balance -scoring_scheme weighted -rpt_label ',testname,
' -metric Diff_of_Classes -sort real -order descending -include_only_symbols false -make_sets true -median false -num 100',
' -plot_top_x 20 -rnd_seed 123456 -save_rnd_lists false -set_max 10000 -set_min 5 -zip_report false -out ', outdir, ' -gui false',sep='')
system(command)
com
write(command,com) # 保存运行的命令,以便在命令行中调参
close(com)
}
照例,请出pbmc3k数据集,测试一番:
library(Seurat)
library(SeuratData)
gmt
outdir = './'
base.name = './'
test1='B'
test2 = 'NK'
testname='seurat_annotations'
pbmc3k
An object of class Seurat
13714 features across 2700 samples within 1 assay
Active assay: RNA (13714 features, 0 variable features)
> head(pbmc3k@meta.data)
orig.ident nCount_RNA nFeature_RNA seurat_annotations
AAACATACAACCAC pbmc3k 2419 779 Memory CD4 T
AAACATTGAGCTAC pbmc3k 4903 1352 B
AAACATTGATCAGC pbmc3k 3147 1129 Memory CD4 T
AAACCGTGCTTCCG pbmc3k 2639 960 CD14+ Mono
AAACCGTGTATGCG pbmc3k 980 521 NK
AAACGCACTGGTAC pbmc3k 2163 781 Memory CD4 T
运行:
SeuratRunGSEA(pbmc3k,testname,test1,test2,gmt,outdir,testname)
结果在seurat_annotations.Gsea.XXXXXXXXXXXXX下,导出,用浏览器打开index.html文件即可:
剩下的就是结合生物学意义解读结果了,网上有许多的介绍,这砖不搬了。
两个参数注意一啊,我们用的都是默认的:
-create_svgs
Create SVG plot images along with PNGs (GZ compressed to save space as these are very large)
Default: false
Hints : true,false
-create_gcts
Create GCT files for the data backing the Gene Set Enrichment Heatmaps
Default: false
Hints : true,false
preRank 模式
对应细胞数较多的情况。
runGSEA_preRank
set.seed(1314)
Idents(seuo)
mk
mk$gene
mk %>% filter(p_val_adj >0) ->mk1
mk$p_val_adj[which(mk$p_val_adj==0)]
mk %>% mutate(sign = ifelse(avg_logFC>0,1,-1),Pstatistics=-1*log(p_val_adj)*sign) -> mk
preRank.matrix
prerank
write.table(preRank.matrix,
file=prerank,
quote=F,
sep='\t',
col.names=F,
row.names=F)
#call java gsea version
command
' -scoring_scheme weighted -make_sets true -rnd_seed 123456 -set_max 500 -set_min 1 -zip_report false ',
' -out preRankResults -create_svgs true -gui false -rpt_label ',testname, sep='')
system(command)
com
write(command,com)
close(com)
}