Find a suitable cluster method
In GWAS analysis, both highly positve and negative related traits will return us the similar results, such as below (a vs b: 0.996, a vs c: -0.976),what we want to do is randomly remain one from (a, b, c).
a | b | c |
---|
But just image a simple situation as below when we do clustering from data both with high positive and negative correlation coefficient:
a | b | c | |
---|---|---|---|
a | 1 | 0.9 | 0.9 |
b | 0.9 | 1 | 0.9 |
c | 0.9 | 0.9 | 1 |
d | e | f | |
---|---|---|---|
d | 1 | 0.9 | -0.9 |
e | 0.9 | 1 | -0.9 |
f | -0.9 | -0.9 | 1 |
x<-c(0.9,0.9,0.9)
dist(x)
1 2
2 0
3 0 0
y<-c(0.9,0.9,-0.9)
dist(y)
1 2
2 0.0
3 1.8 1.8
Actually, we want (d,e,f) cluster together for the selection of GWAS traits, but the cluster method based on distance cannot consider this.
So, the first thing we need to do is use the absolute value. Then, how to find the most suitable cluster method?
Luckily, this course give us the solution: https://www.datanovia.com/en/courses/cluster-validation-essentials/
1. choosing the best clustering algorithms and optimal number of clusters (Internal measures)
library(clValid)
corm<-as.matrix(read.table("pro_alti_rm.cor"))
rcorm<-abs(corm)
clmethods<-c("hierarchical","kmeans","pam") ##pam represent kmediods
intern <- clValid(rcorm, nClust = 2:30, clMethods = clmethods, validation = "internal") ## internal measures for comparing
Optimal Scores:
Score Method Clusters
Connectivity 10.3996 hierarchical 2
Dunn 0.2259 hierarchical 29
Silhouette 0.4277 kmeans 2 ## kmediods also with minimum silhouette in 2 clusters
So, in internal measures, suggest kmeans with 2 cluster or hierarchical with 2 or 29 cluster.
2. Clustering with hierarchical in k=29
library(pheatmap)
corm<-as.matrix(read.table("pro_alti_rm.cor"))
rcorm<-abs(corm)
t<-pheatmap(rcorm,silent = TRUE)
png(file="soil_hclu.png",height=700,width=700)
pheatmap(rcorm,show_rownames=FALSE,show_colnames=FALSE,cutree_cols=29,legend=FALSE,na_col=NA,border_color=NA,annotation_col =data.frame(type=cutree(t$tree_row,k=29)),annotation_legend = FALSE,annotation_names_col=FALSE)
dev.off()
As we can see, the clusters of diagonal is what we want.
3. randomly select one trait in a cluster
clu<-data.frame(trait=colnames(rcorm),type=cutree(t$tree_col,k=29))
for (i in 1:29) # randomly select one trait in a cluster
{
print (as.character(sample(clu[which(clu$type==i),1],1)))
}
4. combine the manhattan plot
#!usr/bin/python3
import math
from PIL import Image
column = 5
width = 3600
height = 2400
size = (3600, 2400)
file="random_k29.trait"
with open(file) as f:
list_im=f.read().splitlines()
imgs = [Image.open(i) for i in list_im]
row_num = math.ceil(len(imgs)/column)
target = Image.new('RGB', (width*column, height*row_num))
for i in range(len(list_im)):
if i % column == 0:
end = len(list_im) if i + column > len(list_im) else i + column
for col, image in enumerate(imgs[i:i+column]):
target.paste(image, (width*col, height*(i//column),
width*(col + 1), height*(i//column + 1)))
target.show()
target.save('random_k29_combine.png')
5. Stability measures
stab <- clValid(rcorm, nClust = 2:30, clMethods = clmethods, validation = "stability")
summary(stab)
Optimal Scores:
Score Method Clusters
APN 0.0000 kmeans 2
AD 1.2498 pam 30
ADM 0.0000 kmeans 2
FOM 0.0735 pam 30
Suggest kmediods with 30 or kmeans with 2.
6. k-mediods analysis
library(cluster)
cor_pam<-pam(rcorm,30)
rownames(cor_pam$medoids) #centroid trait of k-medoids
layout(matrix(c(1,2),1,2)) #show two plot in one page
plot(pam(rcorm,30))
layout(matrix(1))