特征工程案例--(合并表,交叉表、主成分分析)

目标:特征降维处理主成分分析APA

方法:

关联表:user_id---->aisle

交叉表:构造每个用户购买了哪些物品细分类别的商品及数量

降维处理:主成分分析APA

数据来源:https://www.kaggle.com/c/instacart-market-basket-analysis/data

·order_products_prior.csv:订单与商品信息
    。字段:order_id,product_id,add_to_cart_order,reordered
    。解释:订单id,产品id,加入购物车订单,再次订购(不止一次订购)
·products.csv:商品信息
    。字段:product_id,product_name,aisle_id,department_id
    。解释:产品id,产品名称,物品类别id,产品大分类id
·orders.csv:用户的订单信息
    。字段:order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
    。解释:订单编号,用户编号,评价等级,订单数量,星期几,当天的购买时段h,距离预定日期的天数
·aisles.csv:商品所属具体物品类别
    。字段:aisle_id,aisle 
    。解释:物品细分类别id,物品细分类别名称
import numpy as np
import pandas as pd
#获取数据
aisles = pd.read_csv(r"E:\instacart-market-basket-analysis\aisles.csv",sep=",",encoding="utf-8")
orders = pd.read_csv(r"E:\instacart-market-basket-analysis\orders.csv",sep=",",encoding="utf-8")
products = pd.read_csv(r"E:\instacart-market-basket-analysis\products.csv",sep=",",encoding="utf-8")
order_products_prior = pd.read_csv(r"E:\instacart-market-basket-analysis\order_products__prior.csv",sep=",",encoding="utf-8")
#查验数据
display(aisles.head(3))
display(orders.head(3))
display(products.head(3))
display(order_products_prior.head(3))
aisle_idaisle
01prepared soups salads
12specialty cheeses
23energy granola bars
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
025393291prior128NaN
123987951prior23715.0
24737471prior331221.0
product_idproduct_nameaisle_iddepartment_id
01Chocolate Sandwich Cookies6119
12All-Seasons Salt10413
23Robust Golden Unsweetened Oolong Tea947
order_idproduct_idadd_to_cart_orderreordered
023312011
122898521
22932730
import time
#关联表:user_id---->aisle
data01 = pd.merge(orders,order_products_prior,how='inner',on=["order_id","order_id"])
time.sleep(15)
data02 = pd.merge(data01,products,on=["product_id","product_id"])
data03 = pd.merge(data02,aisles,on=["aisle_id","aisle_id"])
time.sleep(3)
display(data03.shape,data03.tail(10000))
(32434489, 14)
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_idadd_to_cart_orderreorderedproduct_nameaisle_iddepartment_idaisle
32424489254224075675prior125125.04447171Free & Clear Unscented Baby Wipes8218baby accessories
32424490326048375675prior160914.044471211Free & Clear Unscented Baby Wipes8218baby accessories
32424491219640775675prior3001112.04447191Free & Clear Unscented Baby Wipes8218baby accessories
3242449253267275675prior385137.044471201Free & Clear Unscented Baby Wipes8218baby accessories
32424493170504775675prior395130.044471201Free & Clear Unscented Baby Wipes8218baby accessories
3242449499867275675prior4851411.044471131Free & Clear Unscented Baby Wipes8218baby accessories
32424495214974675675prior49698.04447161Free & Clear Unscented Baby Wipes8218baby accessories
3242449648380475804prior126154.044471190Free & Clear Unscented Baby Wipes8218baby accessories
32424497178319176027prior641613.044471130Free & Clear Unscented Baby Wipes8218baby accessories
32424498307420276027prior72155.04447181Free & Clear Unscented Baby Wipes8218baby accessories
3242449943115576081prior801416.04447180Free & Clear Unscented Baby Wipes8218baby accessories
32424500287952976238prior366106.044471250Free & Clear Unscented Baby Wipes8218baby accessories
32424501165287776238prior395106.044471101Free & Clear Unscented Baby Wipes8218baby accessories
3242450273797276466prior200107.04447170Free & Clear Unscented Baby Wipes8218baby accessories
32424503315463276556prior803182.04447170Free & Clear Unscented Baby Wipes8218baby accessories
32424504177686176576prior70157.04447120Free & Clear Unscented Baby Wipes8218baby accessories
32424505269582476726prior401128.044471260Free & Clear Unscented Baby Wipes8218baby accessories
32424506317638876823prior1612NaN44471190Free & Clear Unscented Baby Wipes8218baby accessories
32424507144176476866prior1301625.04447170Free & Clear Unscented Baby Wipes8218baby accessories
32424508288844676868prior1751016.044471190Free & Clear Unscented Baby Wipes8218baby accessories
32424509267073377148prior191912.044471240Free & Clear Unscented Baby Wipes8218baby accessories
32424510232830077187prior119NaN4447110Free & Clear Unscented Baby Wipes8218baby accessories
32424511192358177229prior2131117.04447120Free & Clear Unscented Baby Wipes8218baby accessories
32424512204275077229prior2401412.04447161Free & Clear Unscented Baby Wipes8218baby accessories
32424513268575477238prior2096.04447150Free & Clear Unscented Baby Wipes8218baby accessories
32424514140119777265prior6159.04447180Free & Clear Unscented Baby Wipes8218baby accessories
32424515291719577265prior104205.04447141Free & Clear Unscented Baby Wipes8218baby accessories
32424516132167477265prior3101011.04447121Free & Clear Unscented Baby Wipes8218baby accessories
32424517126858977265prior3711829.04447171Free & Clear Unscented Baby Wipes8218baby accessories
32424518304430377280prior234231.04447130Free & Clear Unscented Baby Wipes8218baby accessories
.............................................
32434459814403161964prior106125.026478200Frozen Apple Juice1131frozen juice
32434460503516175436prior451613.026478180Frozen Apple Juice1131frozen juice
32434461385156183189prior412322.02647820Frozen Apple Juice1131frozen juice
3243446247138285005prior75013.02434410Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
32434463183301692263prior52138.02434420Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344642624885136840prior26104.024344110Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344651604793136840prior65103.024344171Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344663154099136840prior162163.02434441Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344673135581151840prior70091.02434460Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344683297537181495prior211415.02434490Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
32434469823196181495prior31140.02434411Frozen Concentrate Non-Alcoholic Pina Colada1131frozen juice
324344702471510107801prior86154.05500190Blended Juice Beverage, Mango Orange1131frozen juice
324344712181814135090prior531410.0550030Blended Juice Beverage, Mango Orange1131frozen juice
32434472962734167413prior1112NaN550090Blended Juice Beverage, Mango Orange1131frozen juice
324344732928960167413prior401210.0550031Blended Juice Beverage, Mango Orange1131frozen juice
324344741393242167413prior50127.05500211Blended Juice Beverage, Mango Orange1131frozen juice
324344752601337181750prior1302030.0550020Blended Juice Beverage, Mango Orange1131frozen juice
324344762125702109046prior33168.0264230Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344772849065138824prior1613NaN2642200Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344782634996138824prior601628.02642151Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344791857751181888prior20710.0264250Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344802131276181888prior71118.0264261Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344811466142181888prior931416.0264241Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344821022794204495prior48095.0264290Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344833249444204495prior506144.0264281Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344842231925204495prior511159.0264281Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
32434485327001204495prior53287.0264211Frozen Concentrated Orange Juice With Added Ca...1131frozen juice
324344861997103110030prior42165.02418980Tropical Fruit Smoothie Tasty American Favorites1131frozen juice
324344871362143113181prior333175.024189120Tropical Fruit Smoothie Tasty American Favorites1131frozen juice
32434488777464179210prior751520.024189160Tropical Fruit Smoothie Tasty American Favorites1131frozen juice

10000 rows × 14 columns

#构造交叉表user_id---->aisle
data04 = pd.crosstab(data03["user_id"],data03["aisle"])
display(data04.shape,data04.head(10))
(206209, 134)
aisleair fresheners candlesasian foodsbaby accessoriesbaby bath body carebaby food formulabakery dessertsbaking ingredientsbaking supplies decorbeautybeers coolers...spreadsteatofu meat alternativestortillas flat breadtrail mix snack mixtrash bags linersvitamins supplementswater seltzer sparkling waterwhite winesyogurt
user_id
10000000000...1000000001
20300002000...31100002042
30000000000...4100000200
40000000000...0001000100
50200000000...0000000003
60000000000...0000000000
70000002000...0000000005
80100001000...0000000000
90000602000...00000002019
100100000000...0000000002

10 rows × 134 columns

# 主成分分析,保留n.n% 的信息
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
 
# 1、数据:使用上面代码生成的data04
data = data04

#2.实例化一个转换器类
transfer = PCA(n_components=0.9) #实例化一个转换器类
    # n_components: ·小数:表示保留百分之多少的信息 ·整数:减少到多少特征
#3.#调用fit_transform()
xi = transfer.fit_transform(data) #调用fit_transform()
#查看构成新的几个变量,查看单个变量的方差贡献率
print(xi.shape,transfer.explained_variance_ratio_)  
#4.输出新构造出来的主成分变量
Fi=[ ]
for i in range(1,xi.shape[1]+1):
    F="F" + str(i)
    Fi.append(F)
data02 = pd.DataFrame(xi,columns=Fi)
display(data02.head(3))
(206209, 27) [0.48237998 0.09585824 0.05185877 0.03590181 0.0293466  0.02393094
 0.01899492 0.0183208  0.01487788 0.0134451  0.01121877 0.01102918
 0.01052171 0.00980307 0.00832174 0.00726185 0.00712991 0.00683061
 0.00640343 0.00580483 0.00534075 0.00487297 0.00477908 0.00462158
 0.00444346 0.00413755 0.00408034]
F1F2F3F4F5F6F7F8F9F10...F18F19F20F21F22F23F24F25F26F27
0-24.2156592.429427-2.466370-0.1456860.269042-1.4329322.140677-2.738031-2.714316-1.743135...-3.225987-4.5800760.777403-3.6991291.9072142.9953860.7729230.6868001.694394-2.343230
16.46320836.7511168.38255315.097530-6.920938-0.9783756.0115673.787725-8.180749-9.040861...-0.737606-0.7374020.740042-0.0913385.151285-4.584815-3.2378944.1212132.446897-4.283485
2-7.9903022.404383-11.0300640.672230-0.442368-2.823272-6.2841406.512509-2.148634-1.585257...5.434733-3.6048424.282794-0.4458343.039337-1.469566-2.9466561.775345-0.4441940.786666

3 rows × 27 columns

相关推荐
©️2020 CSDN 皮肤主题: 技术黑板 设计师:CSDN官方博客 返回首页