原始数据:13个品牌的鼠标参数,如下表。
brand Touch Chips Driver Compatibility Game
1 Brand1 7.5 17.5 7.0 8.0 8.0
2 Brand2 7.5 19.5 7.0 7.0 9.0
3 Brand3 8.5 18.0 8.5 8.0 9.5
4 Brand4 9.0 18.5 8.5 8.0 9.5
5 Brand5 7.0 14.0 6.5 7.0 7.5
6 Brand6 7.0 16.0 6.5 7.5 8.0
7 Brand7 7.5 17.0 8.0 7.5 8.0
8 Brand8 8.0 17.5 8.5 7.5 8.5
9 Brand9 7.0 16.5 6.0 8.0 7.0
10 Brand10 7.5 17.0 7.5 8.5 8.0
11 Brand11 8.0 16.0 6.5 7.0 7.0
12 Brand12 7.0 15.5 6.0 8.0 7.0
13 Brand13 7.5 17.0 8.0 7.0 7.0
一、层次聚类分析:
这里采用Ward方法。
步骤如下:
1、数据读入,
> m<-read.csv(choose.files())
2、选择数据,
> library(dplyr)
> m<-select(m,Touch:Game)
> m
Touch Chips Driver Compatibility Game
1 7.5 17.5 7.0 8.0 8.0
2 7.5 19.5 7.0 7.0 9.0
3 8.5 18.0 8.5 8.0 9.5
4 9.0 18.5 8.5 8.0 9.5
5 7.0 14.0 6.5 7.0 7.5
6 7.0 16.0 6.5 7.5 8.0
7 7.5 17.0 8.0 7.5 8.0
8 8.0 17.5 8.5 7.5 8.5
9 7.0 16.5 6.0 8.0 7.0
10 7.5 17.0 7.5 8.5 8.0
11 8.0 16.0 6.5 7.0 7.0
12 7.0 15.5 6.0 8.0 7.0
13 7.5 17.0 8.0 7.0 7.0
3、数据标准化
> m.s<-scale(m)
> m.s
Touch Chips Driver Compatibility Game
[1,] -0.1868322 0.41293963 -0.2836931 0.7595545 0.0000000
[2,] -0.1868322 1.84446367 -0.2836931 -1.2152872 1.0954451
[3,] 1.4323802 0.77082064 1.2968829 0.7595545 1.6431677
[4,] 2.2419864 1.12870165 1.2968829 0.7595545 1.6431677
[5,] -0.9964384 -2.09222745 -0.8105518 -1.2152872 -0.5477226
[6,] -0.9964384 -0.66070340 -0.8105518 -0.2278664 0.0000000
[7,] -0.1868322 0.05505862 0.7700242 -0.2278664 0.0000000
[8,] 0.6227740 0.41293963 1.2968829 -0.2278664 0.5477226
[9,] -0.9964384 -0.30282239 -1.3374105 0.7595545 -1.0954451
[10,] -0.1868322 0.05505862 0.2431655 1.7469754 0.0000000
[11,] 0.6227740 -0.66070340 -0.8105518 -1.2152872 -1.0954451
[12,] -0.9964384 -1.01858441 -1.3374105 0.7595545 -1.0954451
[13,] -0.1868322 0.05505862 0.7700242 -1.2152872 -1.0954451
attr(,"scaled:center")
Touch Chips Driver Compatibility Game
7.615385 16.923077 7.269231 7.615385 8.000000
attr(,"scaled:scale")
Touch Chips Driver Compatibility Game
0.6175842 1.3971124 0.9490211 0.5063697 0.9128709
4、计算相异度
> m.dist<-dist(m.s) #返回包含相异度信息的特殊类,传给下面的hclust()函数做聚类分析,必不可少的步骤,
> class(m.dist)
[1] "dist"
> m.dist
1 2 3 4 5 6 7 8 9 10 11 12
2 2.6738102
3 2.8192460 3.2361673
4 3.4073003 3.6207314 0.8851785
5 3.3777070 4.3738518 5.2189544 5.8312143
6 1.7495004 3.0633159 4.0080796 4.6696517 1.8232556
7 1.4877497 2.5470160 2.6620939 3.3172954 3.0065795 1.9146796
8 2.1044436 2.5451805 1.7200410 2.3041623 3.9387805 2.9181911 1.1666709
9 1.8649659 3.8828623 4.6358786 5.1938577 2.7712002 1.6064429 2.7202614 3.7079193
10 1.1750144 3.6680445 2.8141578 3.4405217 3.9308549 2.4855779 2.0439129 2.4685747 2.3359712
11 2.6806252 3.4653866 4.3065072 4.6545784 2.2295986 2.1901709 2.4168157 3.0445017 2.6320159 3.5004425
12 2.2394293 4.3200494 4.8518689 5.4348628 2.3728231 1.6064429 2.9024908 3.9096818 0.7157620 2.5458578 2.6320159
13 2.5176177 3.0186571 3.8486029 4.3277534 2.8398099 2.4168157 1.4747881 2.1762631 3.0207321 3.2019650 1.9146796 3.1858206
转化成矩阵,结果同上,对角线为0。
> as.matrix(m.dist)
1 2 3 4 5 6 7 8 9 10 11 12 13
1 0.000000 2.673810 2.8192460 3.4073003 3.377707 1.749500 1.487750 2.104444 1.864966 1.175014 2.680625 2.239429 2.517618
2 2.673810 0.000000 3.2361673 3.6207314 4.373852 3.063316 2.547016 2.545181 3.882862 3.668045 3.465387 4.320049 3.018657
3 2.819246 3.236167 0.0000000 0.8851785 5.218954 4.008080 2.662094 1.720041 4.635879 2.814158 4.306507 4.851869 3.848603
4 3.407300 3.620731 0.8851785 0.0000000 5.831214 4.669652 3.317295 2.304162 5.193858 3.440522 4.654578 5.434863 4.327753
5 3.377707 4.373852 5.2189544 5.8312143 0.000000 1.823256 3.006579 3.938781 2.771200 3.930855 2.229599 2.372823 2.839810
6 1.749500 3.063316 4.0080796 4.6696517 1.823256 0.000000 1.914680 2.918191 1.606443 2.485578 2.190171 1.606443 2.416816
7 1.487750 2.547016 2.6620939 3.3172954 3.006579 1.914680 0.000000 1.166671 2.720261 2.043913 2.416816 2.902491 1.474788
8 2.104444 2.545181 1.7200410 2.3041623 3.938781 2.918191 1.166671 0.000000 3.707919 2.468575 3.044502 3.909682 2.176263
9 1.864966 3.882862 4.6358786 5.1938577 2.771200 1.606443 2.720261 3.707919 0.000000 2.335971 2.632016 0.715762 3.020732
10 1.175014 3.668045 2.8141578 3.4405217 3.930855 2.485578 2.043913 2.468575 2.335971 0.000000 3.500443 2.545858 3.201965
11 2.680625 3.465387 4.3065072 4.6545784 2.229599 2.190171 2.416816 3.044502 2.632016 3.500443 0.000000 2.632016 1.914680
12 2.239429 4.320049 4.8518689 5.4348628 2.372823 1.606443 2.902491 3.909682 0.715762 2.545858 2.632016 0.000000 3.185821
13 2.517618 3.018657 3.8486029 4.3277534 2.839810 2.416816 1.474788 2.176263 3.020732 3.201965 1.914680 3.185821 0.000000
5、执行层次聚类分析
> m.h<-hclust(m.dist,method='ward.D2')
> class(m.h)
[1] "hclust"
> m.h
Call:
hclust(d = m.dist, method = "ward.D2")
Cluster method : ward.D2
Distance : euclidean
Number of objects: 13
> str(m.h) #structrue函数
List of 7
$ merge : int [1:12, 1:2] -9 -3 -7 -1 -6 -11 3 -5 -2 6 ...
$ height : num [1:12] 0.716 0.885 1.167 1.175 1.808 ...
$ order : int [1:13] 11 13 5 6 9 12 3 4 2 7 ...
$ labels : NULL
$ method : chr "ward.D2"
$ call : language hclust(d = m.dist, method = "ward.D2")
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
> m.h$merge
[,1] [,2]
[1,] -9 -12
[2,] -3 -4
[3,] -7 -8
[4,] -1 -10
[5,] -6 1
[6,] -11 -13
[7,] 3 4
[8,] -5 5
[9,] -2 7
[10,] 6 8
[11,] 2 9
[12,] 10 11
6、树状图(Dendomgram)呈现
> m.cnum<-cutree(m.h,k=4)
> m.cnum
[1] 1 1 2 2 3 3 1 1 3 1 4 3 4
> plot(m.h)![在这里插入图片描述](https://img-blog.csdnimg.cn/2019122012043216.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NTM4NzE2MA==,size_16,color_FFFFFF,t_70)
> rect.hclust(m.h,k=3)
7、可视化进阶(待完善)
> install.packages('dendextend')
> library(dendextend).
> m.dend<-as.dendrogram(m.h) # 聚类分析结果转化为dendrogram类
> install.packages('colorspace') #
> library(colorspace)
> m.dend<-color_branches(m.dend,k=4)
> m.dend
'dendrogram' with 2 branches and 13 members total, at height 7.174631
> class(m.dend)
[1] "dendrogram"
> labels_colors(m.dend)<-rainbow_hcl(4)[sort_levels_values(m.cnum[order.dendrogram(m.dend)])]#使品牌标签变为与各自聚类相同的颜色
> plot(m.dend)
> install.packages('gplots') #绘制热图捕捉各个聚类的视觉特征,使用heatmap.2()函数
> library(gplots)
> heatmap.2(m.s,main='不同品牌鼠标属性热 图',dendrogram='row',Rowv=m.dend,Colv=FALSE,col=colorspace::diverge_hcl(256),RowSideColors=labels_colors(m.dend)[order(order.dendrogram(m.dend))]) #注意参数m.s,是用来做聚类分析的标准化数据,并非dist()函数转化的结果。
二、非层次聚类分析步骤:(为何结果每次不同)
这里采用K-means方法。步骤如下:
> set.seed(100)
> m.km<-kmeans(m.s,center=4)
> m.km
K-means clustering with 4 clusters of sizes 2, 4, 5, 2
Cluster means:
Touch Chips Driver Compatibility Game
1 -0.18683220 0.2339991 -0.0202638 1.2532650 0.0000000
2 0.01556935 0.5918801 0.6383096 -0.7215768 0.1369306
3 -0.67259591 -0.9470082 -1.0212953 -0.2278664 -0.7668116
4 1.83718328 0.9497611 1.2968829 0.7595545 1.6431677
Clustering vector:
[1] 1 2 4 4 3 3 2 2 3 1 3 3 2
Within cluster sum of squares by cluster:
[1] 0.6903294 7.5874419 9.1861416 0.3917705
(between_SS / total_SS = 70.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
> m.km$cluster
[1] 1 2 4 4 3 3 2 2 3 1 3 3 2
> tapply(row.names(m),m.km$cluster,unique) #显示各聚类包含的鼠标品牌
$`1`
[1] "1" "10"
$`2`
[1] "2" "7" "8" "13"
$`3`
[1] "5" "6" "9" "11" "12"
$`4`
[1] "3" "4"
> install.packages('GGally')
> library(GGally)
> m.km.df<-as.data.frame(cbind(m.s,cluster=m.km$cluster))
> m.km.df
Touch Chips Driver Compatibility Game cluster
1 -0.1868322 0.41293963 -0.2836931 0.7595545 0.0000000 1
2 -0.1868322 1.84446367 -0.2836931 -1.2152872 1.0954451 2
3 1.4323802 0.77082064 1.2968829 0.7595545 1.6431677 4
4 2.2419864 1.12870165 1.2968829 0.7595545 1.6431677 4
5 -0.9964384 -2.09222745 -0.8105518 -1.2152872 -0.5477226 3
6 -0.9964384 -0.66070340 -0.8105518 -0.2278664 0.0000000 3
7 -0.1868322 0.05505862 0.7700242 -0.2278664 0.0000000 2
8 0.6227740 0.41293963 1.2968829 -0.2278664 0.5477226 2
9 -0.9964384 -0.30282239 -1.3374105 0.7595545 -1.0954451 3
10 -0.1868322 0.05505862 0.2431655 1.7469754 0.0000000 1
11 0.6227740 -0.66070340 -0.8105518 -1.2152872 -1.0954451 3
12 -0.9964384 -1.01858441 -1.3374105 0.7595545 -1.0954451 3
13 -0.1868322 0.05505862 0.7700242 -1.2152872 -1.0954451 2
> class(m.km.df)
[1] "data.frame"
> m.km.df$cluster<-factor(m.km.df$cluster)
> ggparcoord(m.km.df,columns=1:6,groupColumn='cluster')+facet_wrap(~cluster)+theme_bw()