[Python] 用K-means聚类算法进行客户分群

最新推荐文章于 2024-05-14 11:00:23 发布

VIP文章这一步就是天涯海角

最新推荐文章于 2024-05-14 11:00:23 发布

阅读量9.2k

点赞数 23

分类专栏：机器学习文章标签： python 聚类

本文链接：https://blog.csdn.net/lam_yx/article/details/108149282

版权

一、背景

1.项目描述

你拥有一个超市(Supermarket Mall)。通过会员卡，你用有一些关于你的客户的基本数据，如客户ID，年龄，性别，年收入和消费分数。
消费分数是根据客户行为和购买数据等定义的参数分配给客户的。
问题陈述：你拥有这个商场。想要了解怎么样的顾客可以很容易地聚集在一起(目标顾客)，以便可以给营销团队以灵感并相应地计划策略。

2.数据描述

字段名	描述
CustomerID	客户编号
Gender	性别
Age	年龄
Annual Income (k$)	年收入，单位为千美元
Spending Score (1-100)	消费分数，范围在1~100

二、相关模块

import numpy as np
import pandas as pd

from pandas import plotting
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py

from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')

三、数据可视化

1.数据读取

io = '.../Mall_Customers.csv'
df = pd.DataFrame(pd.read_csv(io))
# 修改列名
df.rename(columns={
   'Annual Income (k$)': 'Annual Income', 'Spending Score (1-100)': 'Spending Score'}, inplace=True)
print(df.head())
print(df.describe())
print(df.shape)
print(df.count())
print(df.dtypes)

输出如下。

   CustomerID  Gender  Age  Annual Income  Spending Score
0           1    Male   19             15              39
1           2    Male   21             15              81
2           3  Female   20             16               6
3           4  Female   23             16              77
4           5  Female   31             17              40
-----------------------------------------------------------------
       CustomerID         Age  Annual Income  Spending Score
count  200.000000  200.000000     200.000000      200.000000
mean   100.500000   38.850000      60.560000       50.200000
std     57.879185   13.969007      26.264721       25.823522
min      1.000000   18.000000      15.000000        1.000000
25%     50.750000   28.750000      41.500000       34.750000
50%    100.500000   36.000000      61.500000       50.000000
75%    150.250000   49.000000      78.000000       73.000000
max    200.000000   70.000000     137.000000       99.000000
-----------------------------------------------------------------
(200, 5)
CustomerID        200
Gender            200
Age               200
Annual Income     200
Spending Score    200
dtype: int64
-----------------------------------------------------------------
CustomerID         int64
Gender            object
Age                int64
Annual Income      int64
Spending Score     int64
dtype: object

2.数据可视化

2.1 平行坐标图

平行坐标图(Parallel coordinates plot)用于多元数据的可视化，将高维数据的各个属性(变量)用一系列相互平行的坐标轴表示，纵向是属性值，横向是属性类别。
若在某个属性上相同颜色折线较为集中，不同颜色有一定的间距，则说明该属性对于预标签类别判定有较大的帮助。
若某个属性上线条混乱，颜色混杂，则可能该属性对于标签类别判定没有价值。

plotting.parallel_coordinates(df.drop('CustomerID', axis=1), 'Gender')
plt.title('平行坐标图', fontsize=12)
plt.grid(linestyle='-.')
plt.show()

在这里插入图片描述

2.2 年龄/年收入/消费分数的分布

这里用了直方图和核密度图。（注：核密度图看的是(x<X)的面积，而不是高度）

sns.set(palette="muted", color_codes=True)    # seaborn样式
# 配置
plt.rcParams['axes.unicode_minus'] = False    # 解决无法显示符号的问题
sns.set(font='SimHei', font_scale=0.8)        # 解决Seaborn中文显示问题
# 绘图
plt.figure(1, figsize=(13, 6))
n = 0
for x in

最低0.47元/天解锁文章

这一步就是天涯海角

关注

23
点赞
踩
190

收藏

觉得还不错? 一键收藏
9
评论
[Python] 用K-means聚类算法进行客户分群

k均值聚类算法（k-means clustering algorithm）是一种迭代求解的聚类分析算法，其步骤是，预将数据分为K组，则随机选取K个对象作为初始的聚类中心，然后计算每个对象与各个种子聚类中心之间的距离，把每个对象分配给距离它最近的聚类中心。聚类中心以及分配给它们的对象就代表一个聚类。每分配一个样本，聚类的聚类中心会根据聚类中现有的对象被重新计算。这个过程将不断重复直到满足某个终止条件。终止条件可以是没有（或最小数目）对象被重新分配给不同的聚类，没有（或最小数目）聚类中心再发生变化，误差平方和局
复制链接

扫一扫