针对profile的处理
#峰峦图绘制
set.seed(10)#使生成的十个数据前后一致
n <- 10
palette <- distinctColorPalette(n)#生成差异明显的十个颜色
g <- ggplot(dag3_species_plot, aes(x = Relative, y = Level,fill=Level,color=Level)) + theme_classic() +
geom_density_ridges(alpha=0.8,scale = 2)+scale_fill_manual(values = palette) +
scale_color_manual(values = palette)+ylab("")+xlab("Norm. Rel. Abundance (-log2)")
print(g)
==================================
species clustering using PAM
==================================
R语言聚类包括:
Kmeans函数:kmeans聚类
pam函数:PAM聚类
hclust函数:层次聚类
cutree函数:层次聚类解
Mclust函数:EM聚类
mclustBIC函数:EM聚类
PAM(Partition Around Medoids)是K-medoid(K中心点划分)的基础算法,基本流程如下:
首先随机选择k个对象作为中心,把每个对象分配给离它最近的中心。然后随机地选择一个非中心对象替换中心对
象,计算分配后的距离改进量。聚类的过程就是不断迭代,进行中心对象和非中心对象的反复替换过程,直到目标函
数不再有改进为止。
dag3_species=as.data.frame(t(dag3_species))
dag3_species.dist=dist.JSD(dag3_species)#计算两列之间的距离
dag3_species.cluster=pam.clustering(dag3_species.dist, k=3)
#转置后的dag3_species是以species为行,sample为列.
#dist.JSD(Jenson shannon距离)计算两列之间的距离,形成2000x2000的距离矩阵
> dim(dag3_species.dist)
[1] 2000 2000
#pam.clustering进行聚类,返回每个样本的聚类结果。
>print(dag3_species.cluster)
[1] 1 1 2 1 1 2 2 2 2 1 2 2 2 3 1 3 2 2 1 1 2 3 1 2 1 3 3 1 1 1
# > perform clustering
nclusters = index.G1(t(dag3_species), dag3_species.cluster, d = dag3_species.dist, centrotypes = "medoids")
nclusters=NULL
# > calculate CH index to identify optimal number of clusters
for (k in 1:10) {
if (k==1) {
nclusters[k]=NA
} else {
dag3_species.cluster_temp=pam.clustering(dag3_species.dist, k)
nclusters[k]=index.G1(t(dag3_species),dag3_species.cluster_temp, d = dag3_species.dist,
centrotypes = "medoids")
}
}
plot(nclusters, type="h", xlab="k clusters", ylab="CH index")
cluster=data.frame(row.names = colnames(dag3_species),Cluster=dag3_species.cluster)
write.table(cluster,"microbiome_clustering/plots/MockData_species_PAM.txt",row.names = T,quote = F,sep = "\t")
dag3_species=as.data.frame(t(dag3_species))
解释:
原理:CH index的数值越大越好。数值越小,组间协方差越小,组与组之间边界不明显,类别内部数据的协方差越小越好,类别之间的协方差越大越好,这样的Calinski-Harabasz分数会高。
代码:利用循环计算index.G1,以确定最优的聚类个数
>nclusters
[1] NA 312.78338 177.74214 111.35853 76.67199 45.74820
[7] 37.86139 36.48413 32.82202 29.79001