JD消费行为分析-数据清洗

4.3.1生成购买最多的类目(df_cate_Order)

1.数据概况

数据概况:
customer_id :用户ID
product_id :产品名
action_date :动作执行时间(2018/2/1-2018-4/15)
action_id :动作ID
type :'PageView':浏览, 'Follow':收藏, 'Order':购买, 'SavedCart'加入购物车, 'Comment':评论
age_range :年龄等级1-6
gender :男M女W,后续替换成1和0,方便计算
customer_register_date :用户注册时间
customer_level :用户等级1-7
city_level :用户所在城市等级1-6,大概是超一线城市到五线城市
brand :品牌
shop_id : 店铺名称
category :商品分类
product_market_date :商品上架日期
vender_id :仓库代码
fans_number :收藏人数
vip_number :这个真没看出来,得去找上传数据的人问清楚
shop_register_date :店铺注册时间,但是根据抽样结果来看,有38.56%的缺失,对结果的影响比较大
shop_category :店铺分类
shop_score:店铺评分

2.观察数据的具体情况

#数据清洗
import numpy as np
import pandas as pd
import gc
import warnings
warnings.filterwarnings('ignore')
from datetime import datetime
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei']
# 用来设置字体样式以正常显示中文标签
plt.rcParams['axes.unicode_minus']=False
# 默认是使用Unicode负号,设置正常显示字符,如正常显示负号

#读取数据
df_original = pd.read_excel('京东消费者分析数据.xlsx')
df=df_original.sample(frac=0.3,random_state =None)
del df_original
gc.collect()
print(df)

df.info()
df.isnull().sum()
df.describe()

 Out:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 55148 entries, 107652 to 183471
Data columns (total 20 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   customer_id             55148 non-null  int64         
 1   product_id              55148 non-null  int64         
 2   action_date             55148 non-null  datetime64[ns]
 3   action_id               55148 non-null  int64         
 4   type                    55148 non-null  object        
 5   age_range               55118 non-null  float64       
 6   gender                  55148 non-null  object        
 7   customer_register_date  55148 non-null  datetime64[ns]
 8   customer_level          55148 non-null  int64         
 9   city_level              55107 non-null  float64       
 10  brand                   55148 non-null  object        
 11  shop_id                 55148 non-null  int64         
 12  category                55148 non-null  object        
 13  product_market_date     55148 non-null  datetime64[ns]
 14  vender_id               55148 non-null  int64         
 15  fans_number             55148 non-null  int64         
 16  vip_number              55148 non-null  int64         
 17  shop_register_date      33574 non-null  datetime64[ns]
 18  shop_category           55148 non-null  object        
 19  shop_score              55148 non-null  float64       
dtypes: datetime64[ns](4), float64(3), int64(8), object(5)
memory usage: 8.8+ MB

Out:

customer_id                   0
product_id                    0
action_date                   0
action_id                     0
type                          0
age_range                    30
gender                        0
customer_register_date        0
customer_level                0
city_level                   41
brand                         0
shop_id                       0
category                      0
product_market_date           0
vender_id                     0
fans_number                   0
vip_number                    0
shop_register_date        21574
shop_category                 0
shop_score                    0
dtype: int64
 

 ​​​​

customer_idproduct_idaction_idage_rangecustomer_levelcity_levelshop_idvender_idfans_numbervip_numbershop_score
count5.514800e+0455148.0000005.514800e+0455118.00000055148.00000055107.00000055148.00000055148.0000005.514800e+045.514800e+0455148.000000
mean8.023935e+05188737.6705415.919336e+064.6763854.9673793.2826325275.2549684319.9718946.875368e+041.019048e+055.685935
std4.642326e+05108832.3446803.355640e+061.5311932.3340621.4152852909.4569962216.3994992.467923e+052.707927e+054.658851
min2.900000e+0111.0000009.700000e+011.0000001.0000001.0000002.0000001.0000000.000000e+000.000000e+00-1.000000
25%3.994782e+0594598.0000003.015719e+065.0000005.0000003.0000002836.0000003666.0000000.000000e+000.000000e+000.000000
50%8.006390e+05187265.0000005.979096e+065.0000006.0000004.0000005523.0000003666.0000004.440000e+031.247200e+049.305797
75%1.205906e+06283693.5000008.958055e+066.0000007.0000004.0000007794.0000005527.2500004.262900e+049.209900e+049.537014
max1.608703e+06378410.0000001.136353e+076.0000007.0000006.00000010394.0000009751.0000009.293487e+061.384168e+0710.000000
df['type'].unique()
#去除type中的重复元素
Out:array(['PageView', 'Order', 'Comment', 'Follow', 'SavedCart'],
      dtype=object)
df['action_id'].value_counts()
10439703    2689
4099416        4
6069655        4
9401782        4
1553885        3
            ... 
199997         1
8050723        1
1858919        1
2400243        1
7382395        1
Name: action_id, Length: 51971, dtype: int64
df[df['action_id']==10439703]
customer_idproduct_idaction_dateaction_idtypeage_rangegendercustomer_register_datecustomer_levelcity_levelbrandshop_idcategoryproduct_market_datevender_idfans_numbervip_numbershop_register_dateshop_categoryshop_score
96135565504701802018-03-02 19:01:0010439703Comment6.0W2017-10-3011.0Haier9439Refrigerator2017-07-19651813481825692017-07-13Household Eletric Appliance9.732879
9297210435703388412018-04-02 13:41:0010439703Comment4.0W2017-01-1711.0Other8156Cookie2017-06-294483258981186812017-06-06Food9.546145
918266653263278262018-03-27 13:46:0010439703Comment5.0W2011-04-2973.0Three Squirrels8111Nut2014-11-1065021245111595542013-06-28Food9.382478
9460916076043637962018-02-23 23:33:0010439703Follow4.0M2014-07-2555.0Other4114Coat2016-03-035321135287862013-08-26Clothes9.291452
961241375615196432018-03-18 14:27:0010439703Comment5.0W2018-03-0761.0Loreal4349Face Cream2017-09-196551278372794492017-07-27Beauty Makeup9.615594
...............................................................
909421797162717812018-04-14 13:07:0010439703SavedCart6.0W2010-06-2673.0Gree2197Air Conditioner2018-02-2833066727395872017-11-20Household Eletric Appliance9.907947
9782416059852596542018-04-12 07:05:0010439703SavedCart5.0M2015-06-1361.0Other1756Facial Mask2015-09-1212714217201532015-09-11Beauty Makeup9.389397
93014400983488172018-04-13 19:25:0010439703SavedCart2.0M2013-05-1874.0Lenovo6662Notebook2015-03-10366600NaTElectronics0.000000
9131015472182775652018-04-11 15:03:0010439703SavedCart5.0W2017-03-0413.0Kuka4943Sofa2017-03-2736582965173712017-03-26Furniture9.692748
9645313566752428242018-04-10 23:37:0010439703SavedCart1.0W2016-03-0163.0Other9031Makerup Water2015-03-2651278972714412015-03-20Beauty Makeup9.031195

2689 rows × 20 columns

理论上说,一个action_id应该是针对一个用户一次动作在系统中的标志,应该是具有唯一性的,但是这个action_id在不同用户的不同场景下重复使用,因为这个数据集只是随机抽样了30%的数据,可以理解为系统随机生成且重复使用的。
#查看产品和商品,大概能知道哪一家店和商品的热度较高
df['product_id'].value_counts()
df['shop_id'].value_counts()
#查看仓库数据,最多的应该是京东自己的仓库
df['vender_id'].value_counts()
#查看注册时间,发现根据抽样结果来看,有38.56%的缺失,对结果的影响比较大
df['shop_register_date'].value_counts()
#删除shop_register_date
df.drop('shop_register_date',axis=1,inplace=True)

3.对有关数据进行替换

3.1把性别替换成可识别的1和0,把行为替换为可识别的1-2-3-4-5

#把性别替换成可识别的1和0
df.loc[df['gender'] == 'M', 'gender'] = 1
df.loc[df['gender'] == 'W', 'gender'] = 0
#把行为替换为可识别的1-2-3-4-5
df.loc[df['type'] == 'PageView', 'type'] = 1
df.loc[df['type'] == 'Follow', 'type'] = 2
df.loc[df['type'] == 'SavedCart', 'type'] = 3
df.loc[df['type'] == 'Order', 'type'] = 4
df.loc[df['type'] == 'Comment', 'type'] = 5
df

3.2   用户最喜欢浏览和下单的时间

df['date'] = df['action_date'].dt.date
df['time'] = df['action_date'].dt.hour
df['Zone'] = pd.cut(df['time'],bins=[-1,5,10,13,18,24],labels=['凌晨','上午','中午','下午','晚上'])
#把时间段分为'凌晨'、'上午'、'中午'、'下午'、'晚上'
df.drop('action_date',axis=1,inplace=True)
df

3.2.1 用户(labels)

users = df['customer_id'].unique()
labels = pd.DataFrame(users,columns=['customer_id'])
labels

 

3.2.2 用户活跃时间(time_browse)

time_browse=df[df['type']==1].groupby(['customer_id','Zone']).product_id.count().reset_index()
time_browse.rename(columns={'product_id':'hour_counts'}, inplace=True)
time_browse

 

3.2.3 每个用户浏览次数最多的时间(time_browse_max)

#每个用户浏览次数最多的时段
time_browse_max = time_browse.groupby('customer_id').hour_counts.max().reset_index()
time_browse_max.rename(columns={'hour_counts':'read_counts_max'},inplace=True)
time_browse_max
#重新构建time_browse
time_browse = pd.merge(time_browse,time_browse_max,how='left',on='customer_id')
time_browse

 3.2.3 每个用户浏览次数最多的时段(time_browse_Zone)

#选取各用户浏览次数最多的时段,如有并列最多的时段,用逗号连接
time_browse_Zone=time_browse.loc[time_browse['hour_counts']==time_browse['read_counts_max'],'Zone'].groupby(time_browse['customer_id']).aggregate(lambda x:','.join(x)).reset_index()
time_browse_Zone

 

3.2.4将用户浏览的时段加入到用户标签表(lables)中

labels = pd.merge(labels,time_browse_Zone,how='left',on='customer_id')
labels.rename(columns={'Zone':'time_browse'},inplace=True)
#labels#因为抽样数据,所以出现空值也很正常
labels

 

 

3.2.5得出客户最喜欢下单的时间段(time_Order_hour)

#得出客户最喜欢下单的时间段
time_Order = df[df['type']==4].groupby(['customer_id','Zone']).product_id.count().reset_index()
time_Order.rename(columns={'product_id':'hour_counts'},inplace=True)
time_Order_max = time_Order.groupby('customer_id').hour_counts.max().reset_index()
time_Order_max.rename(columns={'hour_counts':'buy_counts_max'},inplace=True)
time_Order = pd.merge(time_Order,time_Order_max,how='left',on='customer_id')
time_Order_hour = time_Order.loc[time_Order['hour_counts']==time_Order['buy_counts_max'],'Zone'].groupby(time_Order['customer_id']).aggregate(lambda x:','.join(x)).reset_index()
time_Order_hour

 

3.2.6客户最喜欢下单的时间段用户标签表(lables)中

labels = pd.merge(labels,time_Order_hour,how='left',on='customer_id')
labels.rename(columns={'Zone':'time_Order'},inplace=True)
labels

 

4.商品类别作为锚点,进行店铺以及品牌的推荐

df_browse = df.loc[df['type']==1,['customer_id','product_id','brand','category','shop_category']]#浏览
df_Follow = df.loc[df['type']==2,['customer_id','product_id','brand','category','shop_category']]#关注
df_SavedCart = df.loc[df['type']==3,['customer_id','product_id','brand','category','shop_category']]#加入购物车
df_Order = df.loc[df['type']==4,['customer_id','product_id','brand','category','shop_category']]#购买
df_Comment = df.loc[df['type']==5,['customer_id','product_id','brand','category','shop_category']]#评论

4.1浏览最多的类目

#对用户与类目进行分组,统计浏览次数
df_cate_most_browse = df_browse.groupby(['customer_id','category']).product_id.count().reset_index()
df_cate_most_browse.rename(columns={'product_id':'category_counts'},inplace=True)
#每个用户浏览次数最多的类目
df_cate_most_browse_max = df_cate_most_browse.groupby('customer_id').category_counts.max().reset_index()
df_cate_most_browse_max.rename(columns={'category_counts':'category_counts_max'},inplace=True)

4.1.1各用户浏览次数最多的类目(df_cate_browse)

df_cate_most_browse = pd.merge(df_cate_most_browse,df_cate_most_browse_max,how='left',on='customer_id')
df_cate_most_browse['category'] = df_cate_most_browse['category'].astype(str)
#选取各用户浏览次数最多的类目,如有并列最多的类目,用逗号连接
df_cate_browse = df_cate_most_browse.loc[df_cate_most_browse['category_counts']==df_cate_most_browse['category_counts_max'],'category'].groupby(df_cate_most_browse['customer_id']).aggregate(lambda x:','.join(x)).reset_index()
df_cate_browse

4.1.2将用户浏览最多的类目加入到用户标签表(lables)中

#将用户浏览最多的类目加入到用户标签表中
labels = pd.merge(labels,df_cate_browse,how='left',on='customer_id')
labels.rename(columns={'category':'cate_most_browse'},inplace=True)
labels

 

 4.2关注最多的目录

4.2.1生成产品与浏览最多的类目(df_cate_Follow)

#生成产品与浏览最多的类目
df_cate_most_Follow = df_Follow.groupby(['customer_id','category']).product_id.count().reset_index()
df_cate_most_Follow.rename(columns={'product_id':'category_counts'},inplace=True)

df_cate_most_Follow_max = df_cate_most_Follow.groupby('customer_id').category_counts.max().reset_index()
df_cate_most_Follow_max.rename(columns={'category_counts':'category_counts_max'},inplace=True)

df_cate_most_Follow = pd.merge(df_cate_most_Follow,df_cate_most_Follow_max,how='left',on='customer_id')
df_cate_most_Follow['category'] = df_cate_most_Follow['category'].astype(str)
df_cate_Follow = df_cate_most_Follow.loc[df_cate_most_Follow['category_counts']==df_cate_most_Follow['category_counts_max'],'category'].groupby(df_cate_most_Follow['customer_id']).aggregate(lambda x:','.join(x)).reset_index()
df_cate_Follow

 4.2.2生成产品与浏览最多的类目加入到用户标签表(lables)中

labels = pd.merge(labels,df_cate_Follow,how='left',on='customer_id')
labels.rename(columns={'category':'cate_most_Follow'},inplace=True)
labels

 4.2.3用最大加购数定位收藏种类,以‘,’分割

df_cate_most_SavedCart = df_SavedCart.groupby(['customer_id','category']).product_id.count().reset_index()#根据用户id和加购类目统计每个类目下有多少个ITEM
df_cate_most_SavedCart.rename(columns={'product_id':'category_counts'},inplace=True)#更改item计数列名称

df_cate_most_SavedCart_max = df_cate_most_SavedCart.groupby('customer_id').category_counts.max().reset_index()#用户加购item最多的数量
df_cate_most_SavedCart_max.rename(columns={'category_counts':'category_counts_max'},inplace=True)#更改加购计数列名

df_cate_most_SavedCart = pd.merge(df_cate_most_SavedCart,df_cate_most_SavedCart_max,how='left',on='customer_id')
df_cate_most_SavedCart['category'] = df_cate_most_SavedCart['category'].astype(str)
df_cate_SavedCart = df_cate_most_SavedCart.loc[df_cate_most_SavedCart['category_counts']==df_cate_most_SavedCart['category_counts_max'],'category'].groupby(df_cate_most_SavedCart['customer_id']).aggregate(lambda x:','.join(x)).reset_index()
# 用最大加购数定位收藏种类,以','分割
df_cate_most_SavedCart

  4.2.4将最大加购数定位收藏种类加入到用户标签表(lables)中

labels = pd.merge(labels,df_cate_SavedCart,how='left',on='customer_id')
labels.rename(columns={'category':'cate_most_SavedCart'},inplace=True)
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCart
01458305下午NaNTabletNaNNaN
11467825晚上NaNCoatNaNNaN
2275569下午NaNBadmintonNaNNaN
31015490晚上NaNFoundation Make-upNaNNaN
41467471凌晨NaNNotebookNaNNaN
.....................
48848588171凌晨NaNCookieNaNNaN
48849157689晚上NaNSunscreen CreamNaNNaN
48850809793下午NaNPhoneNaNNaN
48851441831中午NaNPhoneNaNNaN
488521402044NaNNaNNaNNaNNaN

48853 rows × 6 columns

4.3购买最多的类目

4.3.1生成购买最多的类目(df_cate_Order)

df_cate_most_Order = df_Order.groupby(['customer_id','category']).product_id.count().reset_index()
df_cate_most_Order.rename(columns={'product_id':'category_counts'},inplace=True)

df_cate_most_Order_max = df_cate_most_Order.groupby('customer_id').category_counts.max().reset_index()
df_cate_most_Order_max.rename(columns={'category_counts':'category_counts_max'},inplace=True)

df_cate_most_Order = pd.merge(df_cate_most_Order,df_cate_most_Order_max,how='left',on='customer_id')
df_cate_most_Order['category'] = df_cate_most_Order['category'].astype(str)
df_cate_Order = df_cate_most_Order.loc[df_cate_most_Order['category_counts']==df_cate_most_Order['category_counts_max'],'category'].groupby(df_cate_most_Order['customer_id']).aggregate(lambda x:','.join(x)).reset_index()

df_cate_Order

4.3.2将生成购买最多的类目加入到用户标签表(lables)中

labels = pd.merge(labels,df_cate_Order,how='left',on='customer_id')
labels.rename(columns={'category':'cate_most_Order'},inplace=True)
labels

 4.4 30天内用户行为(month_active)

数据集的最后一个action_date是2018/04/15,往前倒30天就是2018/03/16

df_short = df[['customer_id','product_id','type','brand','category','shop_category','date']]
df_short['date'] = pd.to_datetime(df_short['date'],format='%Y-%m-%d')
df_short
df_near_30 = df_short[df_short['date']>datetime.strptime('2018-03-15', '%Y-%m-%d')]
#df_near_30
df_near_30_Order = df_near_30[df_near_30['type']==4].groupby('customer_id').product_id.count().reset_index()
labels = pd.merge(labels,df_near_30_Order,how='left',on='customer_id')
labels.rename(columns={'product_id':'month_buy'},inplace=True)
df_near_30_SavedCart = df_near_30[df_near_30['type']==3].groupby('customer_id').product_id.count().reset_index()
labels = pd.merge(labels,df_near_30_SavedCart,how='left',on='customer_id')
labels.rename(columns={'product_id':'month_Cart'},inplace=True)
month_active = df_near_30.groupby('customer_id')['date'].nunique()
labels = pd.merge(labels,month_active,how='left',on='customer_id')
labels.rename(columns={'date':'month_active'},inplace=True)
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_active
01458305下午NaNTabletNaNNaNNaNNaNNaNNaN
11467825晚上NaNCoatNaNNaNNaNNaNNaNNaN
2275569下午NaNBadmintonNaNNaNNaNNaNNaN1.0
31015490晚上NaNFoundation Make-upNaNNaNNaNNaNNaN1.0
41467471凌晨NaNNotebookNaNNaNNaNNaNNaN1.0
.................................
48848588171凌晨NaNCookieNaNNaNNaNNaNNaN1.0
48849157689晚上NaNSunscreen CreamNaNNaNNaNNaNNaNNaN
48850809793下午NaNPhoneNaNNaNNaNNaNNaNNaN
48851441831中午NaNPhoneNaNNaNNaNNaNNaNNaN
488521402044NaNNaNNaNNaNNaNNaNNaNNaN1.0

48853 rows × 10 columns

4.5 7天内用户行为(df_week_buy)

数据集的最后一个action_date是2018/04/15,往前倒7天就是2018/04/07

df_week = df_short[df_short['date']>datetime.strptime('2018-04-07', '%Y-%m-%d')]

df_week_buy = df_week[df_week['type']==4].groupby('customer_id').product_id.count().reset_index()
labels = pd.merge(labels,df_week_buy,how='left',on='customer_id')
labels.rename(columns={'product_id':'week_buy'},inplace=True)


df_week_SavedCart = df_week[df_week['type']==3].groupby('customer_id').product_id.count().reset_index()
labels = pd.merge(labels,df_week_SavedCart,how='left',on='customer_id')
labels.rename(columns={'product_id':'week_Cart'},inplace=True)


week_active = df_near_30.groupby('customer_id')['date'].nunique()
labels = pd.merge(labels,week_active,how='left',on='customer_id')
labels.rename(columns={'date':'week_active'},inplace=True)
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_activeweek_buyweek_Cartweek_active
01458305下午NaNTabletNaNNaNNaNNaNNaNNaNNaNNaNNaN
11467825晚上NaNCoatNaNNaNNaNNaNNaNNaNNaNNaNNaN
2275569下午NaNBadmintonNaNNaNNaNNaNNaN1.0NaNNaN1.0
31015490晚上NaNFoundation Make-upNaNNaNNaNNaNNaN1.0NaNNaN1.0
41467471凌晨NaNNotebookNaNNaNNaNNaNNaN1.0NaNNaN1.0
..........................................
48848588171凌晨NaNCookieNaNNaNNaNNaNNaN1.0NaNNaN1.0
48849157689晚上NaNSunscreen CreamNaNNaNNaNNaNNaNNaNNaNNaNNaN
48850809793下午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN
48851441831中午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN
488521402044NaNNaNNaNNaNNaNNaNNaNNaN1.0NaNNaN1.0

48853 rows × 13 columns

4.6上次浏览(last_browse)、加购(last_SavedCart

)、购买(last_Order)距最后一天(2018/04/15)的天数

#上次浏览距今(2018/04/15)天数
last_browse = df_short[df_short['type']==1].groupby('customer_id')['date'].max().apply(lambda x:(datetime.strptime('2018-04-15','%Y-%m-%d')-x).days)
labels = pd.merge(labels,last_browse,how='left',on='customer_id')
labels.rename(columns={'date':'last_browse'},inplace=True)
labels
#上次加购距今(2018/04/15)天数
last_SavedCart = df_short[df_short['type']==3].groupby('customer_id')['date'].max().apply(lambda x:(datetime.strptime('2018-04-15','%Y-%m-%d')-x).days)
labels = pd.merge(labels,last_SavedCart,how='left',on='customer_id')
labels.rename(columns={'date':'last_SavedCart'},inplace=True)
labels
#上次购买距今(2018/04/15)天数
last_Order = df_short[df_short['type']==4].groupby('customer_id')['date'].max().apply(lambda x:(datetime.strptime('2018-04-15','%Y-%m-%d')-x).days)
labels = pd.merge(labels,last_Order,how='left',on='customer_id')
labels.rename(columns={'date':'last_Orger'},inplace=True)
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_activeweek_buyweek_Cartweek_activelast_browselast_SavedCartlast_Orger
01458305下午NaNTabletNaNNaNNaNNaNNaNNaNNaNNaNNaN51.0NaNNaN
11467825晚上NaNCoatNaNNaNNaNNaNNaNNaNNaNNaNNaN69.0NaNNaN
2275569下午NaNBadmintonNaNNaNNaNNaNNaN1.0NaNNaN1.01.0NaNNaN
31015490晚上NaNFoundation Make-upNaNNaNNaNNaNNaN1.0NaNNaN1.025.0NaNNaN
41467471凌晨NaNNotebookNaNNaNNaNNaNNaN1.0NaNNaN1.05.0NaNNaN
...................................................
48848588171凌晨NaNCookieNaNNaNNaNNaNNaN1.0NaNNaN1.020.0NaNNaN
48849157689晚上NaNSunscreen CreamNaNNaNNaNNaNNaNNaNNaNNaNNaN46.0NaNNaN
48850809793下午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN34.0NaNNaN
48851441831中午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN59.0NaNNaN
488521402044NaNNaNNaNNaNNaNNaNNaNNaN1.0NaNNaN1.0NaNNaNNaN

48853 rows × 16 columns

4.7最近两次购买间隔天数(interval_buy)

#最近两次购买间隔天数
df_interval_buy = df_short[df_short['type']==4].groupby(['customer_id','date']).product_id.count().reset_index()
interval_buy = df_interval_buy.groupby('customer_id')['date'].apply(lambda x:x.sort_values().diff(1).dropna().head(1)).reset_index()
interval_buy['date'] = interval_buy['date'].apply(lambda x : x.days)
interval_buy.drop('level_1',axis=1,inplace=True)
interval_buy.rename(columns={'date':'interval_buy'},inplace=True)
labels = pd.merge(labels,interval_buy,how='left',on='customer_id')
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_activeweek_buyweek_Cartweek_activelast_browselast_SavedCartlast_Orgerinterval_buy
01458305下午NaNTabletNaNNaNNaNNaNNaNNaNNaNNaNNaN51.0NaNNaNNaN
11467825晚上NaNCoatNaNNaNNaNNaNNaNNaNNaNNaNNaN69.0NaNNaNNaN
2275569下午NaNBadmintonNaNNaNNaNNaNNaN1.0NaNNaN1.01.0NaNNaNNaN
31015490晚上NaNFoundation Make-upNaNNaNNaNNaNNaN1.0NaNNaN1.025.0NaNNaNNaN
41467471凌晨NaNNotebookNaNNaNNaNNaNNaN1.0NaNNaN1.05.0NaNNaNNaN
......................................................
48848588171凌晨NaNCookieNaNNaNNaNNaNNaN1.0NaNNaN1.020.0NaNNaNNaN
48849157689晚上NaNSunscreen CreamNaNNaNNaNNaNNaNNaNNaNNaNNaN46.0NaNNaNNaN
48850809793下午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN34.0NaNNaNNaN
48851441831中午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN59.0NaNNaNNaN
488521402044NaNNaNNaNNaNNaNNaNNaNNaN1.0NaNNaN1.0NaNNaNNaNNaN

48853 rows × 17 columns

4.8用户浏览但是并未购买(browse_not_buy)

df_browse_buy = df.loc[(df['type']==1) | (df['type']==4),['customer_id','product_id','type','time']]
browse_not_buy = pd.pivot_table(df_browse_buy,index=['customer_id','product_id'],columns=['type'],values=['time'],aggfunc=['count'])#

browse_not_buy.columns = ['browse','buy']#分列
browse_not_buy.fillna(0,inplace=True)#把空值替换成0
browse_not_buy['browse_not_buy'] = 0
browse_not_buy.loc[(browse_not_buy['browse']>0) & (browse_not_buy['buy']==0),'browse_not_buy'] = 1#标志浏览缺没下单的为1
browse_not_buy = browse_not_buy.groupby('customer_id')['browse_not_buy'].sum().reset_index()
labels = pd.merge(labels,browse_not_buy,how='left',on='customer_id')
labels['browse_not_buy'] = labels['browse_not_buy'].apply(lambda x: '是' if x>0 else '否')
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_activeweek_buyweek_Cartweek_activelast_browselast_SavedCartlast_Orgerinterval_buybrowse_not_buy
01458305下午NaNTabletNaNNaNNaNNaNNaNNaNNaNNaNNaN51.0NaNNaNNaN
11467825晚上NaNCoatNaNNaNNaNNaNNaNNaNNaNNaNNaN69.0NaNNaNNaN
2275569下午NaNBadmintonNaNNaNNaNNaNNaN1.0NaNNaN1.01.0NaNNaNNaN
31015490晚上NaNFoundation Make-upNaNNaNNaNNaNNaN1.0NaNNaN1.025.0NaNNaNNaN
41467471凌晨NaNNotebookNaNNaNNaNNaNNaN1.0NaNNaN1.05.0NaNNaNNaN
.........................................................
48848588171凌晨NaNCookieNaNNaNNaNNaNNaN1.0NaNNaN1.020.0NaNNaNNaN
48849157689晚上NaNSunscreen CreamNaNNaNNaNNaNNaNNaNNaNNaNNaN46.0NaNNaNNaN
48850809793下午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN34.0NaNNaNNaN
48851441831中午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN59.0NaNNaNNaN
488521402044NaNNaNNaNNaNNaNNaNNaNNaN1.0NaNNaN1.0NaNNaNNaNNaN

48853 rows × 18 columns

4.9用户加购但是并未购买(cart_not_buy)

df_cart_buy = df.loc[(df['type']==1) | (df['type']==3),['customer_id','product_id','type','time']]
cart_not_buy = pd.pivot_table(df_cart_buy,index=['customer_id','product_id'],columns=['type'],values=['time'],aggfunc=['count'])
cart_not_buy.columns = ['cart','buy']
cart_not_buy.fillna(0,inplace=True)
cart_not_buy['cart_not_buy'] = 0
cart_not_buy.loc[(cart_not_buy['cart']>0) & (cart_not_buy['buy']==0),'cart_not_buy'] = 1
cart_not_buy = cart_not_buy.groupby('customer_id')['cart_not_buy'].sum().reset_index()
labels = pd.merge(labels,cart_not_buy,how='left',on='customer_id')
labels['cart_not_buy'] = labels['cart_not_buy'].apply(lambda x: '是' if x>0 else '否')
labels
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_active...week_Cartweek_activelast_browselast_SavedCartlast_Orgerinterval_buybrowse_not_buycart_not_buy_xcart_not_buy_ycart_not_buy
01458305下午NaNTabletNaNNaNNaNNaNNaNNaN...NaNNaN51.0NaNNaNNaN1.0
11467825晚上NaNCoatNaNNaNNaNNaNNaNNaN...NaNNaN69.0NaNNaNNaN1.0
2275569下午NaNBadmintonNaNNaNNaNNaNNaN1.0...NaN1.01.0NaNNaNNaN1.0
31015490晚上NaNFoundation Make-upNaNNaNNaNNaNNaN1.0...NaN1.025.0NaNNaNNaN1.0
41467471凌晨NaNNotebookNaNNaNNaNNaNNaN1.0...NaN1.05.0NaNNaNNaN1.0
..................................................................
48848588171凌晨NaNCookieNaNNaNNaNNaNNaN1.0...NaN1.020.0NaNNaNNaN1.0
48849157689晚上NaNSunscreen CreamNaNNaNNaNNaNNaNNaN...NaNNaN46.0NaNNaNNaN1.0
48850809793下午NaNPhoneNaNNaNNaNNaNNaNNaN...NaNNaN34.0NaNNaNNaN1.0
48851441831中午NaNPhoneNaNNaNNaNNaNNaNNaN...NaNNaN59.0NaNNaNNaN1.0
488521402044NaNNaNNaNNaNNaNNaNNaNNaN1.0...NaN1.0NaNNaNNaNNaNNaN

48853 rows × 21 columns

4.10是否是复购用户(buy_again)

未购买的用户标记为‘未购买’,有购买未复购的用户标记为‘否’,有复购的用户标记为‘是’

#是否复购用户
buy_again = df[df['type']==4].groupby('customer_id')['product_id'].count().reset_index()
buy_again.rename(columns={'product_id':'buy_again'},inplace=True)
labels = pd.merge(labels,buy_again,how='left',on='customer_id')
#未购买的用户标记为‘未购买’,有购买未复购的用户标记为‘否’,有复购的用户标记为‘是’
labels['buy_again'] = labels['buy_again'].apply(lambda x: '是' if x>1 else  '否' if x==1 else '未购买')
labels
#导出文件
labels.to_csv('JD_labels.csv')
customer_idtime_browsetime_Ordercate_most_browsecate_most_Followcate_most_SavedCartcate_most_Ordermonth_buymonth_Cartmonth_activeweek_buyweek_Cartweek_activelast_browselast_SavedCartlast_Orgerinterval_buybrowse_not_buycart_not_buybuy_again
0730494下午NaNLightNaNNaNNaNNaNNaN1.0NaNNaN1.04.0NaNNaNNaN未购买
11546231上午NaNPhoneNaNNaNNaNNaNNaN1.0NaNNaN1.022.0NaNNaNNaN未购买
2485903中午NaNDigital CameraNaNNaNNaNNaNNaNNaNNaNNaNNaN69.0NaNNaNNaN未购买
31150917晚上NaNCandyNaNNaNNaNNaNNaN1.0NaNNaN1.013.0NaNNaNNaN未购买
4992362凌晨NaNDigital CameraNaNNaNNaNNaNNaN1.0NaNNaN1.014.0NaNNaNNaN未购买
...............................................................
48968486260凌晨NaNCoatNaNNaNNaNNaNNaN1.0NaNNaN1.017.0NaNNaNNaN未购买
489691308135下午NaNPhoneNaNNaNNaNNaNNaN1.0NaNNaN1.02.0NaNNaNNaN未购买
48970658719上午NaNPhoneNaNNaNNaNNaNNaNNaNNaNNaNNaN70.0NaNNaNNaN未购买
489711465022上午NaNPhoneNaNNaNNaNNaNNaN1.0NaNNaN1.026.0NaNNaNNaN未购买
48972216989晚上NaNTeaNaNNaNNaNNaNNaNNaNNaNNaNNaN70.0NaNNaNNaN未购买

48973 rows × 20 columns

  • 2
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值