ConsensusClusterPlus包进行聚类分析

ConsensusClusterPlus包的ConsensesClusterPlus函数,用于通过稳定性证据确定簇数和类成员身份。计算聚类一致性和项目一致性的calcICL函数。

Usage

ConsensusClusterPlus(
d=NULL, maxK = 3, reps=10, pItem=0.8, pFeature=1, clusterAlg="hc",title="untitled_consensus_cluster",
innerLinkage="average", finalLinkage="average", distance="pearson", ml=NULL,
tmyPal=NULL,seed=NULL,plot=NULL,writeTable=FALSE,weightsItem=NULL,weightsFeature=NULL,verbose=F,corUse="everything")

calcICL(res,title="untitled_consensus_cluster",plot=NULL,writeTable=FALSE)

Arguments

d

data to be clustered; either a data matrix where columns=items/samples and rows are features. For example, a gene expression matrix of genes in rows and microarrays in columns, or ExpressionSet object, or a distance object (only for cases of no feature resampling)

maxK

integer value. maximum cluster number to evaluate.

reps

integer value. number of subsamples.

pItem

numerical value. proportion of items to sample.

pFeature

numerical value. proportion of features to sample.

clusterAlg

character value. cluster algorithm. 'hc' hierarchical (hclust), 'pam' for paritioning around medoids, 'km' for k-means upon data matrix, or a function that returns a clustering. See example and vignette for more details.

title

character value for output directory. Directory is created only if plot is not NULL or writeTable is TRUE. This title can be an abosulte or relative path.

innerLinkage

hierarchical linkage method for subsampling.

finalLinkage

hierarchical linkage method for consensus matrix.

distance

character value. 'pearson': (1 - Pearson correlation), 'spearman' (1 - Spearman correlation), 'euclidean', 'binary', 'maximum', 'canberra', 'minkowski" or custom distance function.

ml

optional. prior result, if supplied then only do graphics and tables.

tmyPal

optional character vector of colors for consensus matrix

seed

optional numerical value. sets random seed for reproducible results.

plot

character value. NULL - print to screen, 'pdf', 'png', 'pngBMP' for bitmap png, helpful for large datasets.

writeTable

logical value. TRUE - write ouput and log to csv.

weightsItem

optional numerical vector. weights to be used for sampling items.

weightsFeature

optional numerical vector. weights to be used for sampling features.

res

result of consensusClusterPlus.

verbose

boolean. If TRUE, print messages to the screen to indicate progress. This is useful for large datasets.

corUse

optional character value. specifies how to handle missing data in correlation distances 'everything','pairwise.complete.obs', 'complete.obs' see cor() for description.

# if (!require("BiocManager", quietly = TRUE))
#   install.packages("BiocManager")
# 
# BiocManager::install("ConsensusClusterPlus")

### 1.准备数据
## 行为特征,列为样本
library(ALL)
data(ALL)
d=exprs(ALL)
d[1:5,1:5]

# 取中位数绝对偏差(Median Absolute Deviation)大的前5000个探针
mads=apply(d,1,mad)
d=d[rev(order(mads))[1:5000],]
# order(mads):从小到大排序,返回索引
# rev(order(mads):从大到小排序

d = sweep(d,1, apply(d,1,median,na.rm=T))
# sweep:Return an array obtained from an input array 
# by sweeping out a summary statistic.
# 输入数组行数据减去各行中间值得到的数据。
# 如第一行 d[1,]-median(d[1,])

### 2.运行一致性聚类
library(ConsensusClusterPlus)
output_dir="/Users/zhengxueming/test/test0705"
results = ConsensusClusterPlus(d,maxK=6,reps=50,pItem=0.8,pFeature=1,
                               title=output_dir,clusterAlg="hc",distance="pearson",
                               seed=1213,plot="png")
# str(results)
# str(results[[2]])

## output_dir 目录下生成不同K值下的聚类图和聚类评估图 
# 根据consensus CDF和Delta area图,选择最佳的k值:从K=2开始,计算K和K-1相比,
# CDF 曲线下面积的相对变化,选取增加不明显的点作为最佳的K值
# trackling plot:行为样本,列为每个K, 用热图展示样本在每个K下的cluster, 
# 用于定性评估不稳定的聚类和不稳定的样本

# the top ten rows and columns of results for k=2:
results[[2]][["consensusMatrix"]][1:10,1:10]

# 查看各类别颜色
results[[6]][["clrs"]]

#consensusTree - hclust object 
results[[2]][["consensusTree"]]


###3.计算组间一致性和组类一致性
# calculating cluster-consensus and item-consensus.
icl = calcICL(results,title=output_dir,plot="png")
# output_dir生成icl开头的png文件
# icl 为list,含有"clusterConsensus" "itemConsensus" 
icl[["clusterConsensus"]]
icl[["itemConsensus"]][1:5,]


### 4.选择合适的K值,得到各样本聚类结果的数据框
sample_cluster <- results[[5]]$consensusClass

sample_cluster_df <- data.frame(sample = names(sample_cluster),
                                cluster = sample_cluster)
head(sample_cluster_df)

参考

https://www.bioconductor.org/packages/release/bioc/vignettes/ConsensusClusterPlus/inst/doc/ConsensusClusterPlus.pdf
 

  • 2
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
ConsensusClusterPlus和K-means聚类是两种不同的聚类方法,它们的原理和应用也存在一些差异。 K-means聚类是一种基于距离度量的划分聚类方法,它将样本分成k个簇,通过最小化样本点与簇中心之间的距离来进行聚类。K-means聚类需要预先指定簇的数量k,并且对初始簇中心的选择敏感。该方法迭代地更新簇中心,直到达到收敛条件。 ConsensusClusterPlus是一种基于聚类的融合方法,旨在提高聚类结果的稳定性和可靠性。它通过对原始数据进行多次随机重抽样和聚类操作,得到多个聚类结果,并使用一致性矩阵来评估不同聚类结果之间的一致性。最终,ConsensusClusterPlus通过共识聚类算法将这些聚类结果合并成一个稳定的聚类解决方案。 ConsensusClusterPlus相比于K-means聚类具有以下特点: 1. 稳定性:ConsensusClusterPlus通过重复聚类操作和一致性矩阵评估,可以提供更加稳定和可靠的聚类结果,减少了单次随机初始化对结果的影响。 2. 自动确定簇的数量:ConsensusClusterPlus可以在聚类过程中自动确定最优的簇的数量,而不需要预先指定k值。 3. 聚类结果评估:ConsensusClusterPlus提供了一致性矩阵和其他评估指标,用于评估聚类结果的质量和可靠性。 总之,ConsensusClusterPlus是一种通过多次聚类和融合操作来提高聚类结果稳定性的方法,相比之下,K-means聚类是一种简单而常用的划分聚类方法。选择使用哪种方法取决于数据的性质、聚类目标以及对结果稳定性和可靠性的需求。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值