kaggle案例分析
数据来源:kaggle网站
百度云分享:提取码:bt3t
说明:
- products.csv 商品信息
- order_products__prior.csv 订单与商品信息
- orders.csv 用户的订单信息
- aisles.csv 商品所属具体物品类别
import sklearn.decomposition import PCA
import pandas as pd
读取表格信息:
pd.read_csv('products.csv').head()
结果显示该表的前五条数据 据此可了解表头信息 并运用同样的方法了解其余三张表的具体存储内容
prior:product_id,order_id
products:product_id,aisle_id
orders:order_id,user_id
aisles:aisle_id,aisle
创建用户——商品表
- 读取所有的数据 并合并四张表到一张表
读取前五条信息:_mg = pd.merge(prior,products,on=['product_id','product_id']) _mg = pd.merge(_mg,orders,on=['order_id','order_id']) mt = pd.merge(_mg,aisles,on=['aisle_id','aisle_id'])
mt.head()
- 交叉表(特殊的分组工具)
cross = pd.crosstab(mt['user_id'],mt['aisle']) cross.head(10)
主成分分析
pca = PCA(n_components=.9)
data = pca.fit_transform(cross)
data
输出:
array([[-2.42156587e+01, 2.42942720e+00, -2.46636975e+00, …,
6.86800336e-01, 1.69439402e+00, -2.34323022e+00],
[ 6.46320806e+00, 3.67511165e+01, 8.38255336e+00, …,
4.12121252e+00, 2.44689740e+00, -4.28348478e+00],
[-7.99030162e+00, 2.40438257e+00, -1.10300641e+01, …,
1.77534453e+00, -4.44194030e-01, 7.86665571e-01],
…,
[ 8.61143331e+00, 7.70129866e+00, 7.95240226e+00, …,
-2.74252456e+00, 1.07112531e+00, -6.31925661e-02],
[ 8.40862199e+01, 2.04187340e+01, 8.05410372e+00, …,
7.27554259e-01, 3.51339470e+00, -1.79079914e+01],
[-1.39534562e+01, 6.64621821e+00, -5.23030367e+00, …,
8.25329076e-01, 1.38230701e+00, -2.41942061e+00]])
data.shape
输出:
(206209, 27)