

KMeans — that was the first unsupervised learning algorithm that I learned back when I started to get deeper into the world of machine learning. At that moment I thought there’s nothing so special with the algorithm since what’s essentially done is no more than just a data points clustering in a simple cartesian plane. But well, now I realize that it can be very helpful especially when it is applied to dataset which has high dimensionality.

Today, in this article, I would like to do another simple project: implementing KMeans algorithm on mall customers dataset. The main objective of this project is to perform customers segmentation based on their income and spending. Such task is also commonly called as market basket analysis. The dataset used in this project can be downloaded from this Kaggle link. Before getting into the algorithm, I wanna do a little bit of data analysis first.

Note: I share the entire code used in this project at the end of this article.


Below is all the modules used in this project. Instead of creating KMeans function manually, here I use Scikit-Learn module to make things simpler. I will probably write about KMeans from scratch someday in separate article.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.cluster import KMeans

Anyway, after the csv file has been downloaded, we can just load it using read_csv() function and display the first several data.


df = pd.read_csv('Mall_Customers.csv')
Image for post
Here’s the first 5 data looks like.

As I have mentioned earlier, in this project we will only use the values of annual income and spending score column. According to the dataset details, the value of spending score column ranges between 1 to 100 (inclusive), where higher value means more spending. Now I wanna show you the values distribution of both columns in form of boxplot. Below is the code to do that.

plt.boxplot(df['Annual Income (k$)'], vert=False)
plt.title('Income distribution')
plt.xlabel('Annual Income (k$)')
plt.boxplot(df['Spending Score (1-100)'], vert=False)
plt.title('Spending score distribution')
plt.xlabel('Spending Score (1-100)')
Image for post
Income distribution shown using boxplot.
Image for post
Spending score distribution shown using boxplot.

I think both plots are self-explanatory — especially if you’ve taken statistics class. But here I wanna highlight several things about the two plots above. First, most people in our dataset make around $43,000 to $78,000 within a year. And there’s a super-rich person whose income almost reaches $140,000 a year. In the field of statistics, such person is usually called as an outlier. Next, the second figure says that the average spending score lies almost exactly at 50. In fact, there’s no outlier in this distribution which is known due to the fact that we got no little circle appears in the second boxplot.

If you want, you can also display both distributions using distplot() function coming with Seaborn module.


sns.distplot(df['Annual Income (k$)'])
plt.show()sns.distplot(df['Spending Score (1-100)'])
Image for post
Income distribution shown using distplot() function.
Image for post
Spending score distribution shown using distplot() function.

KMeans聚类 (KMeans clustering)

As the income and spending score distribution have been analyzed, we are gonna separate the values of the two columns into different array. After running the code below, we should now have a 2-dimensional array X.

X = df[['Annual Income (k$)', 'Spending Score (1-100)']].values
Image for post
The first 5 data of X array. This is exactly the same as the two column values of data frame df.
Now we are going to use scatter plot to see customer distribution based on the two features. Here I decided to use x-axis and y-axis to represent income and spending respectively.

现在,我们将使用散点图查看基于这两个功能的客户分布。 在这里,我决定使用x轴和y轴分别表示收入和支出。

plt.scatter(X[:,0], X[:,1], s=7)
plt.title('Annual income vs spending score distribution')
plt.xlabel('Annual income (k$)')
plt.ylabel('Spending score (1-100)')
Image for post
Data points distribution.

By looking at the data distribution above, we can guess that probably those data points can be put into 5 different clusters — upper left, upper right, lower left, lower right and center. Therefore, here I am going to choose 5 as the value of K. By the way, in the case of KMeans, the letter K is essentially a variable which denotes the number of clusters that the algorithm is going to create. What’s the drawback of this algorithm is that we need to choose the value of K by ourselves, which sometimes might be difficult especially when we are working with data that has more than 2 dimensions. But fortunately, we got a trick to figure out the optimum number of K, which we are going to discuss it later. By the way, if you need more explanation about the details of KMeans algorithm I suggest you to visit this article.

Anyway, we will start to do the clustering by initializing KMeans() object. Notice here that I pass the number of 5 as the argument.

kmeans = KMeans(n_clusters=5, random_state=44)

Next, we are going to actually cluster the data points stored in X array using fit() method. The process should not take long since our dataset only consists of 200 samples.

Now, I wanna show the final result of this clustering by doing prediction on our X data itself. I put the prediction result in y_kmeans array, which stores the class of every single sample in X data.

y_kmeans = kmeans.predict(X)

We should obtain the following output after running the code above.


Image for post
Cluster predictions of each data point.

As the class of each samples have been obtained, we can now display those clusters using plt.scatter() function again, but this one is using the values of y_kmeans to do the color-code. Furthermore, I will also display the centers of those clusters where the values can be taken from cluster_centers_ attribute.

centroids = kmeans.cluster_centers_plt.figure(figsize=(7,7))
plt.scatter(X[:,0], X[:,1], s=20, c=y_kmeans, cmap='gist_rainbow')
plt.scatter(centroids[:,0], centroids[:,1], s=75, c='black')
plt.title('Annual income vs spending score distribution')
plt.xlabel('Annual income (k$)')
plt.ylabel('Spending score (1-100)')
Image for post
Clusters created by KMeans algorithm, where K = 5.
预测新数据 (Predict new data)

Now let’s assume that we got new customers with the following details, where the first column (from the left) denotes annual incomes and the second one shows the spending scores:


pred_data = np.array([[30,10],[70,50],[20,80],[100,80],[100,20],[20,20],[60,60]])
Image for post
New customer details, columns (from the left): annual income and spending score.

Then we can just predict those data points using predict() method — exactly the same as what we’ve done in the previous step.


predictions = kmeans.predict(pred_data)

And here’s the content of predictions array:


Image for post
Content of our predictions array.

To make the predictions look clearer, we are going to display it on another scatter plot, which can be seen in the following figure.


plt.figure(figsize=(7,7))plt.scatter(X[:,0], X[:,1], s=20, c=y_kmeans, cmap='gist_rainbow')
plt.scatter(pred_data[:,0], pred_data[:,1], s=250, c=predictions, cmap='gist_rainbow', marker='+')
plt.scatter(centroids[:,0], centroids[:,1], s=75, c='black')plt.title('Annual income vs spending score distribution')
plt.xlabel('Annual income (k$)')
plt.ylabel('Spending score (1-100)')
Image for post
Display new data points (shown using “+ “ sign).
Here I decided to display all important points in our dataset. We can see the figure above that our new customers data are drawn using “+” sign, while the training samples and centroids are displayed in small and large circles respectively. According to our prediction results on new data, we can say that this KMeans model is pretty good thanks to the fact that all new data are successfully grouped into the cluster where they should be. These new data grouping process is simple as what’s basically done is just calculating its distance towards all centroids and take the closest centroid as its cluster group.

如何解释集群 (How to interpret clusters)

The previous discussion was more related to technical stuff about KMeans. But in fact, there’s another important thing that we need to be able to explain: how to interpret those clusters. What’s the point of doing customer segmentation yet we don’t understand what kind of segments that we obtained? So now let’s get into what we just got.

  1. Red cluster (bottom right) — All customers who fall into this group have relatively high income. However they probably prefer to save their money instead of purchasing stuff in the mall.

  2. Lime cluster (middle) — People in this cluster are average in terms of earning and spending.


  3. Blue cluster (bottom left) — Tome, this customer group behavior pretty makes sense. They tend to spend less due to the fact that they don’t really got much money.

  4. Pink cluster (upper left)— This cluster is kinda strange to me since they don’t really have much income yet their spending are pretty high.


  5. Green(?) cluster (upper right) — I got no idea what color is this , lol :). Anyway, this cluster belongs to those who make much money and at the same time spend much as well. In fact, people in this cluster might be our actual target market. For example, the marketing division of this mall should send advertisements through email or chat more often to this group compared to those in other clusters. This is because the people of lime class are relatively easy to spend money.

找出K的最佳值(弯头法) (Finding out the best value for K (elbow method))

What we did in the previous step was choosing the value of K by directly looking at the scatter plot. And as a human, we can easily figure out how to roughly cluster these data. Now, what if we got more than 2 attributes to compare? Like, for example, what if we also take the values of Age and Gender column into account? There will be 4 features in total, which is completely unimaginable how the data distribution is going to look like. Therefore, we can not guess the optimum number of cluster as the K value. And here’s where elbow method comes in.

The point of elbow method itself is essentially as simple as calculating the distance value of each centroid towards all its cluster members. The distance value itself is commonly called as WCSS (Within Cluster Sum of Squares), where the formula looks something like the following figure:

Image for post
WCSS formula, where c is the cluster centroids and x is the data point in each cluster.

Fortunately, we don’t really need to do the calculation from scratch since its value is already stored in inertia_ (yes with that underscore) attribute of our KMeans object every time we train a model. So the idea of this elbow method is to train a KMeans model several times with increasing K value and store the WCSS of each iteration. Remember that we still use X array to do this — which contains only the annual incomes and spending scores.

wcss = []for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=44)

Now as the WCSS values have been stored in wcss list, we can just plot it using the following code.


plt.title('WCSS (Within Cluster Sum of Squares)')
plt.plot(range(1,11), wcss)
Image for post
The optimal K value for this clustering task is 5. Our initial guess was correct.

An article about elbow method in Geeks for Geeks says that:

To determine the optimal number of clusters, we have to select the value of K at the “elbow” i.e. the point after which the distortion/inertia start decreasing in a linear fashion. — Geeks for Geeks, 2019

为了确定最佳的簇数,我们必须选择“弯头”处的K值,即失真/惯性开始以线性方式减小之后的点。 —极客,2019年

According to the graph above, we can conclude that indeed the optimal value for K in our case is 5. This means that our initial guess was correct. And well, such guessing method may not work in all cases since there are plenty of problems out there where the dataset contains more than 2 features.

That’s all of today’s project. Hopefully this article makes you learn something new. See you in the next one!

Note: here’s the code :)


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

print("Loading data ...")
df = pd.read_csv('Mall_Customers.csv')

print("Displaying annual income distribution ...")
plt.boxplot(df['Annual Income (k$)'], vert=False)
plt.title('Income distribution')
plt.xlabel('Annual Income (k$)')

print("Displaying spending score distribution ...")
plt.boxplot(df['Spending Score (1-100)'], vert=False)
plt.title('Spending score distribution')
plt.xlabel('Spending Score (1-100)')

print("Displaying annual income and score distribution with distplot ...")
sns.distplot(df['Annual Income (k$)'])
sns.distplot(df['Spending Score (1-100)'])

print("Selecting 2 features (income and spending) ...")
X = df[['Annual Income (k$)', 'Spending Score (1-100)']].values

print("Displaying annual income vs spending distribution ...")
plt.scatter(X[:,0], X[:,1], s=7)
plt.title('Annual income vs spending score distribution')
plt.xlabel('Annual income (k$)')
plt.ylabel('Spending score (1-100)')

print("Initializing KMeans object ...")
kmeans = KMeans(n_clusters=5, random_state=41)

print("Training KMeans model ...")

print("Predicting train data ...")
y_kmeans = kmeans.predict(X)

print("Displaying cluster centers ...")
centroids = kmeans.cluster_centers_

plt.scatter(X[:,0], X[:,1], s=20, c=y_kmeans, cmap='gist_rainbow')
plt.scatter(centroids[:,0], centroids[:,1], s=75, c='black')
plt.title('Annual income vs spending score distribution')
plt.xlabel('Annual income (k$)')
plt.ylabel('Spending score (1-100)')

print("Creating new data ...")
pred_data = np.array([[30,10],[70,50],[20,80],[100,80],[100,20],[20,20],[60,60]])

print("Predicting new data ...")
predictions = kmeans.predict(pred_data)

print("Displaying clusters of new data ...")
plt.scatter(X[:,0], X[:,1], s=20, c=y_kmeans, cmap='gist_rainbow')
plt.scatter(pred_data[:,0], pred_data[:,1], s=250, c=predictions, cmap='gist_rainbow', marker='+')
plt.scatter(centroids[:,0], centroids[:,1], s=75, c='black')
plt.title('Annual income vs spending score distribution')
plt.xlabel('Annual income (k$)')
plt.ylabel('Spending score (1-100)')

print("Calculating WCSS ...")
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, random_state=41)

print("Displaying WCSS plot for elbow method ...")
plt.title('WCSS (Within Cluster Sum of Squares)')
plt.plot(range(1,11), wcss)

