特征工程案例--(合并表,交叉表、主成分分析)

本博客介绍了一个基于Instacart市场篮子分析的数据集进行特征降维处理的案例,利用主成分分析(PCA)将用户购买行为数据进行维度压缩,保留关键信息的同时简化模型。通过交叉表构造和PCA应用,展示了如何从海量商品购买记录中提取用户偏好的核心特征。

目标:特征降维处理主成分分析APA

方法:

关联表:user_id---->aisle

交叉表:构造每个用户购买了哪些物品细分类别的商品及数量

降维处理:主成分分析APA

数据来源:https://www.kaggle.com/c/instacart-market-basket-analysis/data

·order_products_prior.csv:订单与商品信息
    。字段:order_id,product_id,add_to_cart_order,reordered
    。解释:订单id,产品id,加入购物车订单,再次订购(不止一次订购)
·products.csv:商品信息
    。字段:product_id,product_name,aisle_id,department_id
    。解释:产品id,产品名称,物品类别id,产品大分类id
·orders.csv:用户的订单信息
    。字段:order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
    。解释:订单编号,用户编号,评价等级,订单数量,星期几,当天的购买时段h,距离预定日期的天数
·aisles.csv:商品所属具体物品类别
    。字段:aisle_id,aisle 
    。解释:物品细分类别id,物品细分类别名称
import numpy as np
import pandas as pd
#获取数据
aisles = pd.read_csv(r"E:\instacart-market-basket-analysis\aisles.csv",sep=",",encoding="utf-8")
orders = pd.read_csv(r"E:\instacart-market-basket-analysis\orders.csv",sep=",",encoding="utf-8")
products = pd.read_csv(r"E:\instacart-market-basket-analysis\products.csv",sep=",",encoding="utf-8")
order_products_prior = pd.read_csv(r"E:\instacart-market-basket-analysis\order_products__prior.csv",sep=",",encoding="utf-8")
#查验数据
display(aisles.head(3))
display(orders.head(3))
display(products.head(3))
display(order_products_prior.head(3))
aisle_idaisle
01prepared soups salads
12specialty cheeses
23energy granola bars
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
025393291prior128NaN
123987951prior23715.0
24737471prior331221.0
product_idproduct_nameaisle_iddepartment_id
01Chocolate Sandwich Cookies6119
12All-Seasons Salt10413
23Robust Golden Unsweetened Oolong Tea947
order_idproduct_idadd_to_cart_orderreordered
023312011
122898521
22932730
import time
#关联表:user_id---->aisle
data01 = pd.merge(orders,order_products_prior,how='inner',on=["order_id","order_id"])
time.sleep(15)
data02 = pd.merge(data01,products,on=["product_id","product_id"])
data03 = pd.merge(data02,aisles,on=["aisle_id","aisle_id"])
time.sleep(3)
display(data03.shape,data03.tail(10000))
(32434489, 14)
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_idadd_to_cart_orderreorderedproduct_nameaisle_iddepartment_idaisle
32424489254224075675prior125125.04447171Free & Clear Unscented Baby Wipes8218baby accessories
32424490326048375675prior160914.044471211Free & Clear Unscented Baby Wipes8218baby accessories
32424491219640775675prior3001112.04447191Free & Clear Unscented Baby Wipes8218baby accessories
3242449253267275675prior385137.044471201Free & Clear Unscented Baby Wipes8218baby accessories
32424493170504775675prior395130.044471201Free & Clear Unscented Baby Wipes8218baby accessories
3242449499867275675prior4851411.044471131Free & Clear Unscented Baby Wipes8218baby accessories
32424495214974675675prior49698.04447161Free & Clear Unscented Baby Wipes8218baby accessories
3242449648380475804prior126154.044471190Free & Clear Unscented Baby Wipes8218baby accessories
32424497178319176027prior641613.044471130Free & Clear Unscented Baby Wipes8218baby accessories
32424498307420276027prior72155.04447181Free & Clear Unscented Baby Wipes8218baby accessories
3242449943115576081prior801416.04447180Free & Clear Unscented Baby Wipes8218baby accessories
32424500287952976238prior366106.044471250Free & Clear Unscented Baby Wipes8218baby accessories
32424501165287776238prior395106.044471101Free & Clear Unscented Baby Wipes8218baby accessories
3242450273797276466prior200107.04447170Free & Clear Unscented Baby Wipes8218baby accessories
32424503315463276556prior803182.04447170Free & Clear Unscented Baby Wipes8218baby accessories
32424504177686176576prior70157.04447120Free & Clear Unscented Baby Wipes8218baby accessories
32424505269582476726prior401128.044471260Free & Clear Unscented Baby Wipes8218baby accessories
32424506317638876823prior1612NaN44471190Free & Clear Unscented Baby Wipes8218baby accessories
32424507144176476866prior1301625.04447170Free & Clear Unscented Baby Wipes8218baby accessories
32424508288844676868prior1751016.044471190Free & Clear Unscented Baby Wipes8218baby accessories
32424509267073377148prior191912.044471240Free & Clear Unscented Baby Wipes8218baby accessories
32424510232830077187prior119NaN4447110Free & Clear Unscented Baby Wipes8218baby accessories
32424511192358177229prior2131117.04447120Free & Clear Unscented Baby Wipes8218baby accessories
32424512204275077229prior2401412.04447161Free & Clear Unscented Baby Wipes8218baby accessories
32424513268575477238prior2096.04447150Free & Clear Unscented Baby Wipes8218baby accessories
32424514140119777265prior6159.04447180Free & Clear Unscented Baby Wipes8218baby accessories
32424515291719577265prior104205.04447141Free & Clear Unscented Baby Wipes8218baby accessories
32424516132167477265prior3101011.04447121Free & Clear Unscented Baby Wipes8218baby accessories
32424517126858977265prior3711829.04447171Free & Clear Unscented Baby Wipes8218baby accessories
32424518304430377280prior234231.04447130Free & Clear Unscented Baby Wipes8218baby accessories
.............................................
32434459814403161964prior106125.026478200Frozen Apple Juice1131frozen juice
32434460503516175436prior451613.026478180Frozen Apple Juice1131frozen juice
32434461385156183189prior412322.02647820Frozen Apple Juice1131frozen juice
3243446247138285005prior75013.02434410Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
32434463183301692263prior52138.02434420Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344642624885136840prior26104.024344110Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344651604793136840prior65103.024344171Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344663154099136840prior162163.02434441Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344673135581151840prior70091.02434460Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344683297537181495prior211415.02434490Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
32434469823196181495prior31140.02434411Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344702471510107801prior86154.05500190Blended Juice Beverage, Mango Orange1131frozen juice
324344712181814135090prior531410.0550030Blended Juice Beverage, Mango Orange1131frozen juice
32434472962734167413prior1112NaN550090Blended Juice Beverage, Mango Orange1131frozen juice
324344732928960167413prior401210.0550031Blended Juice Beverage, Mango Orange1131frozen juice
324344741393242167413prior50127.05500211Blended Juice Beverage, Mango Orange1131frozen juice
324344752601337181750prior1302030.0550020Blended Juice Beverage, Mango Orange1131frozen juice
324344762125702109046prior33168.0264230Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344772849065138824prior1613NaN2642200Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344782634996138824prior601628.02642151Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344791857751181888prior20710.0264250Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344802131276181888prior71118.0264261Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344811466142181888prior931416.0264241Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344821022794204495prior48095.0264290Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344833249444204495prior506144.0264281Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344842231925204495prior511159.0264281Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
32434485327001204495prior53287.0264211Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344861997103110030prior42165.02418980Tropical Fruit Smoothie Tasty American Favorites1131frozen juice
324344871362143113181prior333175.024189120Tropical Fruit Smoothie Tasty American Favorites1131frozen juice
32434488777464179210prior751520.024189160Tropical Fruit Smoothie Tasty American Favorites1131frozen juice

10000 rows × 14 columns

#构造交叉表user_id---->aisle
data04 = pd.crosstab(data03["user_id"],data03["aisle"])
display(data04.shape,data04.head(10))
(206209, 134)
aisleair fresheners candlesasian foodsbaby accessoriesbaby bath body carebaby food formulabakery dessertsbaking ingredientsbaking supplies decorbeautybeers coolers...spreadsteatofu meat alternativestortillas flat breadtrail mix snack mixtrash bags linersvitamins supplementswater seltzer sparkling waterwhite winesyogurt
user_id
10000000000...1000000001
20300002000...31100002042
30000000000...4100000200
40000000000...0001000100
50200000000...0000000003
60000000000...0000000000
70000002000...0000000005
80100001000...0000000000
90000602000...00000002019
100100000000...0000000002

10 rows × 134 columns

# 主成分分析,保留n.n% 的信息
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
 
# 1、数据:使用上面代码生成的data04
data = data04

#2.实例化一个转换器类
transfer = PCA(n_components=0.9) #实例化一个转换器类
    # n_components: ·小数:表示保留百分之多少的信息 ·整数:减少到多少特征
#3.#调用fit_transform()
xi = transfer.fit_transform(data) #调用fit_transform()
#查看构成新的几个变量,查看单个变量的方差贡献率
print(xi.shape,transfer.explained_variance_ratio_)  
#4.输出新构造出来的主成分变量
Fi=[ ]
for i in range(1,xi.shape[1]+1):
    F="F" + str(i)
    Fi.append(F)
data02 = pd.DataFrame(xi,columns=Fi)
display(data02.head(3))
(206209, 27) [0.48237998 0.09585824 0.05185877 0.03590181 0.0293466  0.02393094
 0.01899492 0.0183208  0.01487788 0.0134451  0.01121877 0.01102918
 0.01052171 0.00980307 0.00832174 0.00726185 0.00712991 0.00683061
 0.00640343 0.00580483 0.00534075 0.00487297 0.00477908 0.00462158
 0.00444346 0.00413755 0.00408034]
F1F2F3F4F5F6F7F8F9F10...F18F19F20F21F22F23F24F25F26F27
0-24.2156592.429427-2.466370-0.1456860.269042-1.4329322.140677-2.738031-2.714316-1.743135...-3.225987-4.5800760.777403-3.6991291.9072142.9953860.7729230.6868001.694394-2.343230
16.46320836.7511168.38255315.097530-6.920938-0.9783756.0115673.787725-8.180749-9.040861...-0.737606-0.7374020.740042-0.0913385.151285-4.584815-3.2378944.1212132.446897-4.283485
2-7.9903022.404383-11.0300640.672230-0.442368-2.823272-6.2841406.512509-2.148634-1.585257...5.434733-3.6048424.282794-0.4458343.039337-1.469566-2.9466561.775345-0.4441940.786666

3 rows × 27 columns

评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值