cluster soil traits

Find a suitable cluster method

In GWAS analysis, both highly positve and negative related traits will return us the similar results, such as below (a vs b: 0.996, a vs c: -0.976),what we want to do is randomly remain one from (a, b, c).

abc


But just image a simple situation as below when we do clustering from data both with high positive and negative correlation coefficient:

abc
a10.90.9
b0.910.9
c0.90.91
def
d10.9-0.9
e0.91-0.9
f-0.9-0.91
x<-c(0.9,0.9,0.9)
dist(x)
  1 2
2 0  
3 0 0

y<-c(0.9,0.9,-0.9)
dist(y)
    1   2
2 0.0    
3 1.8 1.8

Actually, we want (d,e,f) cluster together for the selection of GWAS traits, but the cluster method based on distance cannot consider this.
So, the first thing we need to do is use the absolute value. Then, how to find the most suitable cluster method?
Luckily, this course give us the solution: https://www.datanovia.com/en/courses/cluster-validation-essentials/

1. choosing the best clustering algorithms and optimal number of clusters (Internal measures)

library(clValid)
corm<-as.matrix(read.table("pro_alti_rm.cor"))
rcorm<-abs(corm)
clmethods<-c("hierarchical","kmeans","pam") ##pam represent kmediods
intern <- clValid(rcorm, nClust = 2:30, clMethods = clmethods, validation = "internal") ## internal measures for comparing 

Optimal Scores:

             Score   Method       Clusters
Connectivity 10.3996 hierarchical 2       
Dunn          0.2259 hierarchical 29      
Silhouette    0.4277 kmeans       2     ## kmediods also with minimum silhouette in 2 clusters

So, in internal measures, suggest kmeans with 2 cluster or hierarchical with 2 or 29 cluster.

2. Clustering with hierarchical in k=29

library(pheatmap)
corm<-as.matrix(read.table("pro_alti_rm.cor"))
rcorm<-abs(corm)
t<-pheatmap(rcorm,silent = TRUE)
png(file="soil_hclu.png",height=700,width=700)
pheatmap(rcorm,show_rownames=FALSE,show_colnames=FALSE,cutree_cols=29,legend=FALSE,na_col=NA,border_color=NA,annotation_col =data.frame(type=cutree(t$tree_row,k=29)),annotation_legend = FALSE,annotation_names_col=FALSE)
dev.off()

hclu with k=29As we can see, the clusters of diagonal is what we want.

3. randomly select one trait in a cluster

clu<-data.frame(trait=colnames(rcorm),type=cutree(t$tree_col,k=29))
for (i in 1:29)  # randomly select one trait in a cluster
 {
     print (as.character(sample(clu[which(clu$type==i),1],1)))
 }

4. combine the manhattan plot

#!usr/bin/python3
import math
from PIL import Image

column = 5
width = 3600
height = 2400
size = (3600, 2400)

file="random_k29.trait"
with open(file) as f:
    list_im=f.read().splitlines()

imgs = [Image.open(i) for i in list_im]

row_num = math.ceil(len(imgs)/column)
target = Image.new('RGB', (width*column, height*row_num))
for i in range(len(list_im)):
    if i % column == 0:
        end = len(list_im) if i + column > len(list_im) else i + column
        for col, image in enumerate(imgs[i:i+column]):
            target.paste(image, (width*col, height*(i//column),
                                 width*(col + 1), height*(i//column + 1)))
target.show()
target.save('random_k29_combine.png')

manhattan plot

5. Stability measures

stab <- clValid(rcorm, nClust = 2:30, clMethods = clmethods, validation = "stability")
summary(stab)

Optimal Scores:

    Score  Method Clusters
APN 0.0000 kmeans 2       
AD  1.2498 pam    30      
ADM 0.0000 kmeans 2       
FOM 0.0735 pam    30  

Suggest kmediods with 30 or kmeans with 2.

6. k-mediods analysis

library(cluster)
cor_pam<-pam(rcorm,30)
rownames(cor_pam$medoids) #centroid trait of k-medoids
layout(matrix(c(1,2),1,2)) #show two plot in one page
plot(pam(rcorm,30))
layout(matrix(1))

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值