profile聚类

最新推荐文章于 2024-08-14 11:22:43 发布

weixin_43231976

最新推荐文章于 2024-08-14 11:22:43 发布

阅读量91

点赞数

分类专栏： DMP 文章标签：聚类机器学习算法

本文链接：https://blog.csdn.net/weixin_43231976/article/details/124960715

版权

DMP 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

针对profile的处理

#峰峦图绘制

set.seed(10)#使生成的十个数据前后一致
n <- 10
palette <- distinctColorPalette(n)#生成差异明显的十个颜色
g <- ggplot(dag3_species_plot, aes(x = Relative, y = Level,fill=Level,color=Level)) + theme_classic() +
  geom_density_ridges(alpha=0.8,scale = 2)+scale_fill_manual(values = palette) +
  scale_color_manual(values = palette)+ylab("")+xlab("Norm. Rel. Abundance (-log2)")
print(g)

==================================

species clustering using PAM

==================================

R语言聚类包括：
Kmeans函数：kmeans聚类
pam函数：PAM聚类
hclust函数：层次聚类
cutree函数：层次聚类解
Mclust函数：EM聚类
mclustBIC函数：EM聚类	
PAM（Partition Around Medoids）是K-medoid（K中心点划分）的基础算法，基本流程如下：
     首先随机选择k个对象作为中心，把每个对象分配给离它最近的中心。然后随机地选择一个非中心对象替换中心对
象，计算分配后的距离改进量。聚类的过程就是不断迭代，进行中心对象和非中心对象的反复替换过程，直到目标函
数不再有改进为止。

dag3_species=as.data.frame(t(dag3_species))
dag3_species.dist=dist.JSD(dag3_species)#计算两列之间的距离
dag3_species.cluster=pam.clustering(dag3_species.dist, k=3)

#转置后的dag3_species是以species为行，sample为列.
#dist.JSD（Jenson shannon距离）计算两列之间的距离，形成2000x2000的距离矩阵
> dim(dag3_species.dist)
[1] 2000 2000
#pam.clustering进行聚类，返回每个样本的聚类结果。
>print(dag3_species.cluster)
 [1] 1 1 2 1 1 2 2 2 2 1 2 2 2 3 1 3 2 2 1 1 2 3 1 2 1 3 3 1 1 1

# > perform clustering
nclusters = index.G1(t(dag3_species), dag3_species.cluster, d = dag3_species.dist, centrotypes = "medoids")
nclusters=NULL

# > calculate CH index to identify optimal number of clusters
for (k in 1:10) {
  if (k==1) {
    nclusters[k]=NA
  } else {
    dag3_species.cluster_temp=pam.clustering(dag3_species.dist, k)
    nclusters[k]=index.G1(t(dag3_species),dag3_species.cluster_temp,  d = dag3_species.dist,
                          centrotypes = "medoids")
  }
}
plot(nclusters, type="h", xlab="k clusters", ylab="CH index")

cluster=data.frame(row.names = colnames(dag3_species),Cluster=dag3_species.cluster)
write.table(cluster,"microbiome_clustering/plots/MockData_species_PAM.txt",row.names = T,quote = F,sep = "\t")
dag3_species=as.data.frame(t(dag3_species))

解释：
原理：CH index的数值越大越好。数值越小，组间协方差越小，组与组之间边界不明显，类别内部数据的协方差越小越好，类别之间的协方差越大越好，这样的Calinski-Harabasz分数会高。
代码：利用循环计算index.G1，以确定最优的聚类个数
>nclusters
[1]        NA 312.78338 177.74214 111.35853  76.67199  45.74820
[7]  37.86139  36.48413  32.82202  29.79001

参考博客：
1.K-means和PAM聚类算法Python实现及对比
2.基于sklearn的聚类算法的聚类效果指标

weixin_43231976

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
profile聚类

针对profile的处理#峰峦图绘制set.seed(10)#使生成的十个数据前后一致n <- 10palette <- distinctColorPalette(n)#生成差异明显的十个颜色g <- ggplot(dag3_species_plot, aes(x = Relative, y = Level,fill=Level,color=Level)) + theme_classic() + geom_density_ridges(alpha=0.8,scale = 2
复制链接

扫一扫