实验任务:
请根据表1给出的16个因素对我国31个地区的发展状况进行聚类分析。
- 请用六种类间距离进行系统聚类,并绘制聚类图
- 请对16个因素采用kmeans聚类法分析我国31个地区的发展状况
实验步骤:
1.用六种类间距离进行系统聚类,并绘制聚类图
①读取上方表格的数据
library(openxlsx)
read.xlsx("C:\\Users\\Dell\\Desktop\\多元统计和r语言\\shiyansi.xlsx")
X=read.xlsx("C:\\Users\\Dell\\Desktop\\多元统计和r语言\\shiyansi.xlsx",rowNames=T)
②将因素进行分类
colnames(X)=c("y","x1","x2","x3","x4","x5","x6","x7","x8","x9","x10","x11","x12"
,"x13","x14","x15","x16")
D=dist(X,method="euclidean",p=2)
③最短距离法(采用欧氏距离)
a=hclust(D,method = "complete")
plot(a)
④最长距离法(欧氏距离)
b=hclust(D,method = "complete")
plot(b)
⑤ward(欧氏距离)
c=hclust(D,method="ward.D2")
plot(c)
⑥中间距离法
d=hclust(D,method="median")
plot(d)
⑦类平均法(欧氏距离)
e=hclust(D,method="average")
plot(e)
⑧重心法(欧式距离)
f=hclust(D,method="centroid")
plot(f)
2.请对16个因素采用kmeans聚类法分析我国31个地区的发展状况
①安装并且加载包
#install.packages("factoextra")
#install.packages("cluster")
#install.packages("NbClust")
#install.packages("dplyr")
#install.packages("pacman")
#加载包
library(factoextra)
library(dplyr)
library(cluster)
library(pacman)
library(NbClust)
names(X)
②标准化数据
X.scaled<-scale(X[2:16])
③通过NbClust函数的投票功能进行聚类数选择
L=NbClust(X.scaled,distance="euclidean",method="average")
table(L$Best.n[1,])#k值选择3
win.graph(width=6, height=5,pointsize=9)#写这行解决figure margins too large报错
barplot(table(L$Best.n[1,]),xlab = "No. of cluster")
根据投票结果,发现3应该票数最多,选择3为聚类数,即k值
④选择聚类数为3
kmeans1<-kmeans(X.scaled,centers=3,nstart = 25)
fviz_cluster(object=kmeans1,data=X[2:17],
ellipse.type = "euclid",star.plot=T,repel=T,
geom = ("point"),palette='jco',main="",
ggtheme=theme_minimal())+
theme(axis.title = element_blank())
⑤进行归类
summary(kmeans1)
kmeans$cluster
kmeans$size
⑥对每组数据进行平均值统计
summarize(by_fenzu,x1=mean(x1),x2=mean(x2),
x3=mean(x3),x4=mean(x4),x5=mean(x5),x6=mean(x6),x7=mean(x7),x8=mean(x8),x9=mean(x9)
,x10=mean(x10),x11=mean(x11),x12=mean(x12),x13=mean(x13),x14=mean(x14),x15=mean(x15),
x16=mean(x16))
分析:在因素1,2,3,4,5,6,8,9中,第三类城市表现突出,在因素7,11中,第一类城市表现突出,在因素12中,第二类城市表现突出。
⑦对数据进行分组,得出每组因素的密度分布情况,这里以前五个因素为例
Data1=z[which(z$kmeans1.cluster==1),]
Data2=z[which(z$kmeans1.cluster==2),]
Data3=z[which(z$kmeans1.cluster==3),]
#因素1的分布曲线
par(mfrow=c(1,3))
plot(density(Data1[,1]),main="1.1")
plot(density(Data2[,1]),main="1.2")
plot(density(Data3[,1]),main="1.3")
#因素2的分布曲线
par(mfrow=c(1,3))
plot(density(Data1[,2]),main="2.1")
plot(density(Data2[,2]),main="2.2")
plot(density(Data3[,2]),main="2.3")
#因素3的分布曲线
par(mfrow=c(1,3))
plot(density(Data1[,3]),main="3.1")
plot(density(Data2[,3]),main="3.2")
plot(density(Data3[,3]),main="3.3")
#因素4的分布曲线
par(mfrow=c(1,3))
plot(density(Data1[,4]),main="4.1")
plot(density(Data2[,4]),main="4.2")
plot(density(Data3[,4]),main="4.3")
#因素5的分布曲线
par(mfrow=c(1,3))
plot(density(Data1[,5]),main="5.1")
plot(density(Data2[,5]),main="5.2")
plot(density(Data3[,5]),main="5.3")
#散点图矩阵
pairs(x1~x2+x3+x4+x5,data=z)[unclass(z$kmeans1.cluster)]