聚类算法_案例：探究⽤户对物品类别的喜好细分

¥骁勇善战¥

已于 2023-09-26 12:18:03 修改

阅读量125

点赞数

文章标签：机器学习 python 算法聚类

于 2023-09-13 11:53:34 首次发布

本文链接：https://blog.csdn.net/m0_68462260/article/details/132849475

版权

本文介绍了如何使用Python中的Pandas、PCA和K-means算法对Instacart数据集进行预处理，分析用户对商品类别的喜好细分。首先合并数据表，然后进行特征工程，通过PCA降维并应用K-means进行聚类，最后评估了聚类效果的轮廓系数。

摘要由CSDN通过智能技术生成

目标：

应⽤ pca 和 K-means 实现⽤户对物品类别的喜好细分划分

导入模块

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

获取数据

# 获取数据
order_product = pd.read_csv("./data/instacart/order_products__prior.csv")
products = pd.read_csv("./data/instacart/products.csv")
orders = pd.read_csv("./data/instacart/orders.csv")
aisles = pd.read_csv("./data/instacart/aisles.csv")

数据如下：

order_products__prior.csv：订单与商品信息
- 字段：order_id, product_id, add_to_cart_order, reordered
products.csv：商品信息
- 字段：product_id, product_name, aisle_id, department_id
orders.csv：⽤户的订单信息
- 字段：order_id,user_id,eval_set,order_number,….
aisles.csv：商品所属具体物品类别
- 字段： aisle_id, aisle

查看数据

order_product.head()

	order_id	product_id	add_to_cart_order	reordered
0	2	33120	1	1
1	2	28985	2	1
2	2	9327	3	0
3	2	45918	4	1
4	2	30035	5	0

products.head()

	product_id	product_name	aisle_id	department_id
0	1	Chocolate Sandwich Cookies	61	19
1	2	All-Seasons Salt	104	13
2	3	Robust Golden Unsweetened Oolong Tea	94	7
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1
4	5	Green Chile Anytime Sauce	5	13

orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

aisles.head()

	aisle_id	aisle
0	1	prepared soups salads
1	2	specialty cheeses
2	3	energy granola bars
3	4	instant foods
4	5	marinades meat preparation

数据基本处理

合并表格

# 数据基本处理
# 合并表格
table1 = pd.merge(order_product, products, on=["product_id", "product_id"])
table1

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id
0	2	33120	1	1	Organic Egg Whites	86	16
1	26	33120	5	0	Organic Egg Whites	86	16
2	120	33120	13	0	Organic Egg Whites	86	16
3	327	33120	5	1	Organic Egg Whites	86	16
4	390	33120	28	1	Organic Egg Whites	86	16
...	...	...	...	...	...	...	...
32434484	3265099	43492	3	0	Gourmet Burger Seasoning	104	13
32434485	3361945	43492	19	0	Gourmet Burger Seasoning	104	13
32434486	3267201	33097	2	0	Piquillo & Jalapeno Bruschetta	81	15
32434487	3393151	38977	32	0	Original Jerky	100	21
32434488	3400803	23624	7	0	Flatbread Pizza All Natural	79	1

32434489 rows × 7 columns

table2 = pd.merge(table1, orders, on=["order_id", "order_id"])

table = pd.merge(table2, aisles, on=["aisle_id", "aisle_id"])

table.shape

(32434489, 14)

table.head()

	order_id	product_id	add_to_cart_order	reordered	product_name	aisle_id	department_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order	aisle
0	2	33120	1	1	Organic Egg Whites	86	16	202279	prior	3	5	9	8.0	eggs
1	26	33120	5	0	Organic Egg Whites	86	16	153404	prior	2	0	16	7.0	eggs
2	120	33120	13	0	Organic Egg Whites	86	16	23750	prior	11	6	8	10.0	eggs
3	327	33120	5	1	Organic Egg Whites	86	16	58707	prior	21	6	9	8.0	eggs
4	390	33120	28	1	Organic Egg Whites	86	16	166654	prior	48	0	12	9.0	eggs

交叉表合并

# 交叉表合并
data = pd.crosstab(table["user_id"], table["aisle"])

data.shape

(206209, 134)

data.head()

aisle	air fresheners candles	asian foods	baby accessories	baby bath body care	baby food formula	bakery desserts	baking ingredients	baking supplies decor	beauty	beers coolers	...	spreads	tea	tofu meat alternatives	tortillas flat bread	trail mix snack mix	trash bags liners	vitamins supplements	water seltzer sparkling water	white wines	yogurt
user_id
1	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	1
2	0	3	0	0	0	0	2	0	0	0	...	3	1	1	0	0	0	0	2	0	42
3	0	0	0	0	0	0	0	0	0	0	...	4	1	0	0	0	0	0	2	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	1	0	0
5	0	2	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	3

5 rows × 134 columns

# 数据截取
new_data = data[:1000]

特征工程 - pca

# 特征工程 — pca
transfer = PCA(n_components=0.9)
trans_data = transfer.fit_transform(new_data)

trans_data.shape

(1000, 22)

机器学习（k-means）

estimator = KMeans(n_clusters=5)
y_pre = estimator.fit_predict(trans_data)

y_pre

array([2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 0, 2,
       0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 2, 0, 2, 0, 0, 0, 0,
       0, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 1, 2,
       2, 2, 2, 0, 2, 1, 2, 0, 2, 2, 0, 3, 2, 2, 2, 1, 2, 0, 2, 2, 0, 0,
       0, 0, 1, 2, 2, 0, 1, 2, 0, 2, 2, 1, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
       1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 1,
       2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 2, 0, 2, 2,
       2, 2, 3, 4, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2,
       0, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 2, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,
       2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 3, 2,
       2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 0, 0, 2, 0,
       2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 3, 0, 2, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2,
       2, 2, 2, 3, 2, 2, 2, 0, 2, 0, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2,
       2, 0, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1,
       0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 3, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 1, 0, 2, 2, 0,
       2, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 1, 2, 2, 2, 0, 1, 2, 2, 1, 0, 2, 0, 0, 2, 1, 2, 0, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 0, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 0, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 0, 0, 2, 0, 2, 2, 2,
       0, 0, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 0, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2,
       2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2,
       2, 2, 1, 2, 0, 0, 2, 2, 0, 2, 1, 0, 2, 2, 2, 1, 2, 3, 0, 2, 0, 2,
       2, 0, 2, 2, 0, 0, 2, 0, 2, 1, 2, 3, 2, 2, 2, 0, 2, 2, 2, 2, 1, 1,
       0, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 1, 2, 0, 2, 0, 2, 0, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 2, 2, 0,
       2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0,
       1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 1, 2, 2, 0, 0, 2, 1, 2, 2, 2, 1,
       0, 2, 2, 0, 1, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 0, 2, 2, 0, 1, 2], dtype=int32)

# 5.模型评估
silhouette_score(trans_data, y_pre)

0.4793021644455867

sklearn.metrics.silhouette_score(X, labels)

计算所有样本的平均轮廓系数
X：特征值
labels：被聚类标记的⽬标值