聚类算法_案例:探究⽤户对物品类别的喜好细分

本文介绍了如何使用Python中的Pandas、PCA和K-means算法对Instacart数据集进行预处理,分析用户对商品类别的喜好细分。首先合并数据表,然后进行特征工程,通过PCA降维并应用K-means进行聚类,最后评估了聚类效果的轮廓系数。
摘要由CSDN通过智能技术生成

目标:

应⽤ pca 和 K-means 实现⽤户对物品类别的喜好细分划分

导入模块

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

获取数据

# 获取数据
order_product = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")

数据如下:

  • order_products__prior.csv:订单与商品信息
    • 字段:order_id, product_id, add_to_cart_order, reordered
  • products.csv:商品信息
    • 字段:product_id, product_name, aisle_id, department_id
  • orders.csv:⽤户的订单信息
    • 字段:order_id,user_id,eval_set,order_number,….
  • aisles.csv:商品所属具体物品类别
    • 字段: aisle_id, aisle

查看数据

order_product.head()
order_idproduct_idadd_to_cart_orderreordered
023312011
122898521
22932730
324591841
423003550
products.head()
product_idproduct_nameaisle_iddepartment_id
01Chocolate Sandwich Cookies6119
12All-Seasons Salt10413
23Robust Golden Unsweetened Oolong Tea947
34Smart Ones Classic Favorites Mini Rigatoni Wit...381
45Green Chile Anytime Sauce513
orders.head()
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
025393291prior128NaN
123987951prior23715.0
24737471prior331221.0
322547361prior44729.0
44315341prior541528.0
aisles.head()
aisle_idaisle
01prepared soups salads
12specialty cheeses
23energy granola bars
34instant foods
45marinades meat preparation

数据基本处理

合并表格

# 数据基本处理
# 合并表格
table1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table1
order_idproduct_idadd_to_cart_orderreorderedproduct_nameaisle_iddepartment_id
023312011Organic Egg Whites8616
1263312050Organic Egg Whites8616
212033120130Organic Egg Whites8616
33273312051Organic Egg Whites8616
439033120281Organic Egg Whites8616
........................
3243448432650994349230Gourmet Burger Seasoning10413
32434485336194543492190Gourmet Burger Seasoning10413
3243448632672013309720Piquillo & Jalapeno Bruschetta8115
32434487339315138977320Original Jerky10021
3243448834008032362470Flatbread Pizza All Natural791

32434489 rows × 7 columns

table2 = pd.merge(table1, orders, on=["order_id", "order_id"])
table = pd.merge(table2, aisles, on=["aisle_id", "aisle_id"])
table.shape

(32434489, 14)

table.head()
order_idproduct_idadd_to_cart_orderreorderedproduct_nameaisle_iddepartment_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderaisle
023312011Organic Egg Whites8616202279prior3598.0eggs
1263312050Organic Egg Whites8616153404prior20167.0eggs
212033120130Organic Egg Whites861623750prior116810.0eggs
33273312051Organic Egg Whites861658707prior21698.0eggs
439033120281Organic Egg Whites8616166654prior480129.0eggs

交叉表合并

# 交叉表合并
data = pd.crosstab(table["user_id"], table["aisle"])
data.shape

(206209, 134)

data.head()
aisleair fresheners candlesasian foodsbaby accessoriesbaby bath body carebaby food formulabakery dessertsbaking ingredientsbaking supplies decorbeautybeers coolers...spreadsteatofu meat alternativestortillas flat breadtrail mix snack mixtrash bags linersvitamins supplementswater seltzer sparkling waterwhite winesyogurt
user_id
10000000000...1000000001
20300002000...31100002042
30000000000...4100000200
40000000000...0001000100
50200000000...0000000003

5 rows × 134 columns

# 数据截取
new_data = data[:1000]

特征工程 - pca

# 特征工程 — pca
transfer = PCA(n_components=0.9)
trans_data = transfer.fit_transform(new_data)
trans_data.shape

(1000, 22)

机器学习(k-means)

estimator = KMeans(n_clusters=5)
y_pre = estimator.fit_predict(trans_data)
y_pre
array([2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2,
       0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 0, 0,
       0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 2, 0, 2, 1, 2, 0, 2, 2, 0, 3, 2, 2, 2, 1, 2, 0, 2, 2, 0, 0,
       0, 0, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 1,
       2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 0, 2, 2,
       2, 2, 3, 4, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2,
       0, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
       2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 3, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 3, 0, 2, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 3, 2, 2, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 3, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 0,
       2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 2, 2, 2, 0, 1, 2, 2, 1, 0, 2, 0, 0, 2, 1, 2, 0, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 2, 0, 2, 2, 2,
       0, 0, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2,
       2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2,
       2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 1, 0, 2, 2, 2, 1, 2, 3, 0, 2, 0, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 2, 1, 2, 3, 2, 2, 2, 0, 2, 2, 2, 2, 1, 1,
       0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 2, 2, 2, 1,
       0, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 0, 1, 2], dtype=int32)
# 5.模型评估
silhouette_score(trans_data, y_pre)

0.4793021644455867

sklearn.metrics.silhouette_score(X, labels)

  • 计算所有样本的平均轮廓系数
  • X:特征值
  • labels:被聚类标记的⽬标值
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

¥骁勇善战¥

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值