Machine Learning for Data Analysis(4) k-Means Cluster Analysis

I am using Python 3.7, Windows 10, Anaconda.
The data set tree_addhealth.csv and the code are from the course Machine Learning for Data Analysis 
https://www.coursera.org/learn/machine-learning-data-analysis
 

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

"""
Data Management
"""
data = pd.read_csv("tree_addhealth.csv")

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management

data_clean = data.dropna()

# subset clustering variables
cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]
cluster.describe()

# standardize clustering variables to have mean=0 and standard deviation=1
# Astype float64 ensures that my clustering variables have a numeric format
clustervar=cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))

# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

# k-means cluster analysis for 1-9 clusters                                                           
from scipy.spatial.distance import cdist
# create an object called clusters that will include numbers in the range between 1 and 10
# We will use this object when we specify the number of clusters we want to test, which will give us the cluster solutions for k equals 1 to k equals 9 clusters.
clusters=range(1,10)
# store the average distance values that we will calculate for the 1 to 9 cluster solutions. 
meandist=[]
# tells Python to run the cluster analysis code below for each value of k in the cluster's object.
for k in clusters:
    # n_clusters indicates the number of clusters, which in our example we substitute with k to tell Python to run the cluster analysis for 1 through 9 clusters
    model=KMeans(n_clusters=k)
    model.fit(clus_train)
    # store for each observation the cluster number to which it was assigned based on the cluster analysis. 
    # .predict asks that the results of the cluster analyses stored in the model object be used to predict the closest cluster that each observation belongs to.
    clusassign=model.predict(clus_train)
    # The code following meandist.append computes the average of the sum of the distances between each observation in the cluster centroids
    # model.cluster_centers_  is the name of the model results attribute that contains the cluster centeroids
    # use cdist to calculate the distance between each observation in the clus_train data set in the cluster centroids using Euclidean distance欧氏距离
    # use np.min function to determine the smallest or minimum difference for each observation among the cluster centroids.
    # Axis=1 means that the minimum should be determine by examining the distance between the observation and each centroid taking the smallest distance as the value of the minimum
    # use the sum function to sum the minimum distances across all observations
    meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) 
    # divides the sum of the distances by the number of observations in the clus_train data set where the .shape[0]) code returns the number of observations in the clus_train data set. 
    / clus_train.shape[0])

"""
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""

# clusters is the object that includes the values of 1 through 9 for the range of clusters we specified and meandist is the average distance value that we just calculated.
plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')


 

what this plot shows is the decrease in the average minimum distance of the observations from the cluster centroids for 

each of the cluster solutions. We can see that the average distance decreases as the number of clusters increases. 

# Interpret 3 cluster solution
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
# plot clusters

As an example, let's interpret the 3 cluster solution.
So we'll rerun the cluster analysis, this time asking for 3 clusters. So we create an object, model 3, which will contain the results from the cluster analysis with 3 clusters =KMeans, and in parenthesis, n_clusters=3. And we fit the model and create an object called clusassign that has the cluster assignments based on the 3 cluster model.

# conduct the canonical discriminate analysis
from sklearn.decomposition import PCA
# return the two first canonical variables
pca_2 = PCA(2)
#PCA_2.fit asks Python to fit the canonical discriminate analysis that we specified with the PCA command, and the _transform applies the canonical discriminate analysis to the clus_train data set to calculate the canonical variables. 
plot_columns = pca_2.fit_transform(clus_train)
# x=plot_columns[:,0] tells Python to plot the first canonical variable, which is in the first column in the plot_column matrix on the x axis
# y=plot_columns[:,1] tells Python to plot the second canonical variable on the y axis.
# Model3.labels_ contains the cluster assignment variable from the 3 cluster solution. So c=model3.labels_ tells python to color code the points for each of the three clusters.  
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()

Here is the scatter plot. What this shows is that these two clusters are densely packed, meaning that the observations within the clusters are pretty highly correlated with each other, and within cluster variance is relatively low. But they appear to have a good deal of overlap, meaning that there is not good separation between these two clusters. On the other hand, this cluster here shows better separation, but the observations are more spread out indicating less correlation among the observations and higher within cluster variance. This suggests that the two cluster solution might be better, meaning that it would be especially important to further evaluate the two cluster solution as well. 

Next we can take a look at the pattern of means on the clustering variables for each cluster to see whether they are distinct and meaningful. To do this, we have to link the cluster assignment variable back to its corresponding observation in the clus_train dataset that has the clustering variables. The first thing we need to do is create a unique identifier variable for our clus_train dataset that has the clustering variables.

"""
BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the 
# cluster training data to merge with the cluster assignment variable
# level=0 tells Python to only remove the given levels from the index, and inplace=True, tells Python to add the new column to the existing clus_train dataset.
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
# This will be combined with the cluster assignment variable, so that we can merge the two datasets together by each observation's unique identifier. 
cluslist=list(clus_train['index'])
# create a list of cluster assignments for each observation.
labels=list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist=dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']

# now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the 
# cluster assignment dataframe 
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()

"""
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""

# FINALLY calculate clustering variable means by cluster
# groupby function specifying the new cluster assignment variable and .mean() calculates the mean for all the clustering variables by cluster. Then I print the means.
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)


# validate clusters in training data by examining cluster differences in GPA using ANOVA
# first have to merge GPA with clustering variables and cluster assignment data 
gpa_data=data_clean['GPA1']
# split GPA data into train and test sets
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123)
gpa_train1=pd.DataFrame(gpa_train)
# reset the index for the GPA variable data frame to create the unique identifier to link the datasets. 
gpa_train1.reset_index(level=0, inplace=True)
# merge the datasets by the unique identifier index into a data frame called merge.
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())

print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)

print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值