ConsensusClusterPlus包进行聚类分析

最新推荐文章于 2024-08-13 11:28:01 发布

qq_27390023

最新推荐文章于 2024-08-13 11:28:01 发布

阅读量2.3k

点赞数 2

文章标签：大数据

本文链接：https://blog.csdn.net/qq_27390023/article/details/125624145

版权

ConsensusClusterPlus包的ConsensesClusterPlus函数，用于通过稳定性证据确定簇数和类成员身份。计算聚类一致性和项目一致性的calcICL函数。

Usage

ConsensusClusterPlus(
d=NULL, maxK = 3, reps=10, pItem=0.8, pFeature=1, clusterAlg="hc",title="untitled_consensus_cluster",
innerLinkage="average", finalLinkage="average", distance="pearson", ml=NULL,
tmyPal=NULL,seed=NULL,plot=NULL,writeTable=FALSE,weightsItem=NULL,weightsFeature=NULL,verbose=F,corUse="everything")

calcICL(res,title="untitled_consensus_cluster",plot=NULL,writeTable=FALSE)

Arguments

`d`	data to be clustered; either a data matrix where columns=items/samples and rows are features. For example, a gene expression matrix of genes in rows and microarrays in columns, or ExpressionSet object, or a distance object (only for cases of no feature resampling)
`maxK`	integer value. maximum cluster number to evaluate.
`reps`	integer value. number of subsamples.
`pItem`	numerical value. proportion of items to sample.
`pFeature`	numerical value. proportion of features to sample.
`clusterAlg`	character value. cluster algorithm. 'hc' hierarchical (hclust), 'pam' for paritioning around medoids, 'km' for k-means upon data matrix, or a function that returns a clustering. See example and vignette for more details.
`title`	character value for output directory. Directory is created only if plot is not NULL or writeTable is TRUE. This title can be an abosulte or relative path.
`innerLinkage`	hierarchical linkage method for subsampling.
`finalLinkage`	hierarchical linkage method for consensus matrix.
`distance`	character value. 'pearson': (1 - Pearson correlation), 'spearman' (1 - Spearman correlation), 'euclidean', 'binary', 'maximum', 'canberra', 'minkowski" or custom distance function.
`ml`	optional. prior result, if supplied then only do graphics and tables.
`tmyPal`	optional character vector of colors for consensus matrix
`seed`	optional numerical value. sets random seed for reproducible results.
`plot`	character value. NULL - print to screen, 'pdf', 'png', 'pngBMP' for bitmap png, helpful for large datasets.
`writeTable`	logical value. TRUE - write ouput and log to csv.
`weightsItem`	optional numerical vector. weights to be used for sampling items.
`weightsFeature`	optional numerical vector. weights to be used for sampling features.
`res`	result of consensusClusterPlus.
`verbose`	boolean. If TRUE, print messages to the screen to indicate progress. This is useful for large datasets.
`corUse`	optional character value. specifies how to handle missing data in correlation distances 'everything','pairwise.complete.obs', 'complete.obs' see cor() for description.

# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 
# BiocManager::install("ConsensusClusterPlus")

### 1.准备数据
## 行为特征，列为样本
library(ALL)
data(ALL)
d=exprs(ALL)
d[1:5,1:5]

# 取中位数绝对偏差(Median Absolute Deviation)大的前5000个探针
mads=apply(d,1,mad)
d=d[rev(order(mads))[1:5000],]
# order(mads):从小到大排序，返回索引
# rev(order(mads):从大到小排序

d = sweep(d,1, apply(d,1,median,na.rm=T))
# sweep：Return an array obtained from an input array 
# by sweeping out a summary statistic.
# 输入数组行数据减去各行中间值得到的数据。
# 如第一行 d[1,]-median(d[1,])

### 2.运行一致性聚类
library(ConsensusClusterPlus)
output_dir="/Users/zhengxueming/test/test0705"
results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
                               title=output_dir,clusterAlg="hc",distance="pearson",
                               seed=1213,plot="png")
# str(results)
# str(results[[2]])

## output_dir 目录下生成不同K值下的聚类图和聚类评估图 
# 根据consensus CDF和Delta area图，选择最佳的k值：从K=2开始,计算K和K-1相比，
# CDF 曲线下面积的相对变化,选取增加不明显的点作为最佳的K值
# trackling plot:行为样本，列为每个K, 用热图展示样本在每个K下的cluster, 
# 用于定性评估不稳定的聚类和不稳定的样本

# the top ten rows and columns of results for k=2:
results[[2]][["consensusMatrix"]][1:10,1:10]

# 查看各类别颜色
results[[6]][["clrs"]]

#consensusTree - hclust object 
results[[2]][["consensusTree"]]


###3.计算组间一致性和组类一致性
# calculating cluster-consensus and item-consensus.
icl = calcICL(results,title=output_dir,plot="png")
# output_dir生成icl开头的png文件
# icl 为list,含有"clusterConsensus" "itemConsensus" 
icl[["clusterConsensus"]]
icl[["itemConsensus"]][1:5,]


### 4.选择合适的K值，得到各样本聚类结果的数据框
sample_cluster <- results[[5]]$consensusClass

sample_cluster_df <- data.frame(sample = names(sample_cluster),
                                cluster = sample_cluster)
head(sample_cluster_df)