日萌社
人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新)
阿里云官网:天池新人实战赛o2o优惠券使用预测
链接:https://pan.baidu.com/s/1ACzlCYAZhpg1dm7SBVw-Sw?pwd=bea7
提取码:bea7
Sklearn:天池新人实战赛o2o优惠券使用预测 part1
Sklearn:天池新人实战赛o2o优惠券使用预测 part2
Sklearn:天池新人实战赛o2o优惠券使用预测 part3
数据
本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。
注意: 为了保护用户和商家的隐私,所有数据均作匿名处理,同时采用了有偏采样和必要过滤。
评价方式
本赛题目标是预测投放的优惠券是否核销。针对此任务及一些相关背景知识,使用优惠券核销预测的平均AUC(ROC曲线下面积)作为评价标准。 即对每个优惠券coupon_id单独计算核销预测的AUC值,再对所有优惠券的AUC值求平均作为最终的评价标准。 关于AUC的含义与具体计算方法。
字段表
Table 1: 用户线下消费和优惠券领取行为
Field | Description |
User_id | 用户ID |
Merchant_id | 商户ID |
Coupon_id | 优惠券ID:null表示无优惠券消费,此时Discount_rate和Date_received字段无意义 |
Discount_rate | 优惠率:x \in [0,1]代表折扣率;x:y表示满x减y。单位是元 |
Distance | user经常活动的地点离该merchant的最近门店距离是x*500米(如果是连锁店,则取最近的一家门店),x\in[0,10];null表示无此信息,0表示低于500米,10表示大于5公里; |
Date_received | 领取优惠券日期 |
Date | 消费日期:如果Date=null & Coupon_id != null,该记录表示领取优惠券但没有使用,即负样本;如果Date!=null & Coupon_id = null,则表示普通消费日期;如果Date!=null & Coupon_id != null,则表示用优惠券消费日期,即正样本; |
Table 2: 用户线上点击/消费和优惠券领取行为
Field | Description |
User_id | 用户ID |
Merchant_id | 商户ID |
Action | 0 点击, 1购买,2领取优惠券 |
Coupon_id | 优惠券ID:null表示无优惠券消费,此时Discount_rate和Date_received字段无意义。“fixed”表示该交易是限时低价活动。 |
Discount_rate | 优惠率:x \in [0,1]代表折扣率;x:y表示满x减y;“fixed”表示低价限时优惠; |
Date_received | 领取优惠券日期 |
Date | 消费日期:如果Date=null & Coupon_id != null,该记录表示领取优惠券但没有使用;如果Date!=null & Coupon_id = null,则表示普通消费日期;如果Date!=null & Coupon_id != null,则表示用优惠券消费日期; |
Table 3:用户O2O线下优惠券使用预测样本
Field | Description |
User_id | 用户ID |
Merchant_id | 商户ID |
Coupon_id | 优惠券ID |
Discount_rate | 优惠率:x \in [0,1]代表折扣率;x:y表示满x减y. |
Distance | user经常活动的地点离该merchant的最近门店距离是x*500米(如果是连锁店,则取最近的一家门店),x\in[0,10];null表示无此信息,0表示低于500米,10表示大于5公里; |
Date_received | 领取优惠券日期 |
Table 4:选手提交文件字段,其中user_id,coupon_id和date_received均来自Table 3,而Probability为预测值
Field | Description |
User_id | 用户ID |
Coupon_id | 优惠券ID |
Date_received | 领取优惠券日期 |
Probability | 15天内用券概率,由参赛选手给出 |
一:测试集中有重复的USER_ID, COUPON_ID, DATE_RECEIVED,怎么测评
答:三者都重复只计算一个,提交分数取max那个,真实label则取(1如果有至少一个被使用了)
二:为什么提交的结果AUC=0.5的原因说明
答:
1.没有严格按照user_id,coupon_id,date_received,probability这样4列提交。特别地,第二列是coupon_id不是merchant_id,
第二列是coupon_id不是merchant_id,第二列是coupon_id不是merchant_id;
2.分隔符使用了中文','而不是英文',';
3.提交的user_id,coupon_id,date_received与测试集不同(比如直接提交sample_submission.csv)
4. 预测结果的所有概率值相等
GBDT调参总结
https://www.cnblogs.com/nxf-rabbit75/p/10593524.html
GBDT+LR融合代码
https://www.cnblogs.com/vegbirds/p/12191016.html
特征说明
- **用户线下相关的特征**
- 用户领取优惠券次数
- 用户获得优惠券但没有消费的次数
- 用户获得优惠券并核销次数
- 用户领取优惠券后进行核销率
- 用户满200:20, 500:50 减的优惠券核销率
- 用户核销满200:20, 500:50减的优惠券占所有核销优惠券的比重
- 用户核销优惠券的平均/最低/最高消费折率
- 用户核销过优惠券的不同商家数量,及其占所有不同商家的比重
- 用户核销过的不同优惠券数量,及其占所有不同优惠券的比重
- 用户平均核销每个商家多少张优惠券
- 用户核销优惠券中的平均/最大/最小用户-商家距离
- **商家相关的特征**
- 商家优惠券被领取次数
- 商家优惠券被领取后不核销次数
- 商家优惠券被领取后核销次数
- 商家优惠券被领取后核销率
- 商家优惠券核销的平均/最小/最大消费折率
- 核销商家优惠券的不同用户数量,及其占领取不同的用户比重
- 商家优惠券平均每个用户核销多少张
- 商家被核销过的不同优惠券数量
- 商家被核销过的不同优惠券数量占所有领取过的不同优惠券数量的比重
- 商家平均每种优惠券核销多少张
- 商家被核销优惠券的平均时间率
- 商家被核销优惠券中的平均/最小/最大用户-商家距离
- **用户-商家交互特征**
- 用户领取商家的优惠券次数
- 用户领取商家的优惠券后不核销次数
- 用户领取商家的优惠券后核销次数
- 用户领取商家的优惠券后核销率
- 用户对每个商家的不核销次数占用户总的不核销次数的比重
- 用户对每个商家的优惠券核销次数占用户总的核销次数的比重
- 用户对每个商家的不核销次数占商家总的不核销次数的比重
- 用户对每个商家的优惠券核销次数占商家总的核销次数的比重
- **优惠券相关的特征**
- 优惠券类型(直接优惠为0, 满减为1)
- 优惠券折率
- 满减优惠券的最低消费
- 历史出现次数
- 历史核销次数
- 历史核销率
- 历史核销时间率
- 领取优惠券是一周的第几天
- 领取优惠券是一月的第几天
- 历史上用户领取该优惠券次数
- 历史上用户消费该优惠券次数
- 历史上用户对该优惠券的核销率
- **其它特征**
- 用户领取的所有优惠券数目
- 用户领取的特定优惠券数目
- 用户此次之后/前领取的所有优惠券数目
- 用户此次之后/前领取的特定优惠券数目
- 用户上/下一次领取的时间间隔
- 用户领取特定商家的优惠券数目
- 用户领取的不同商家数目
- 用户当天领取的优惠券数目
- 用户当天领取的特定优惠券数目
- 用户领取的所有优惠券种类数目
- 商家被领取的优惠券数目
- 商家被领取的特定优惠券数目
- 商家被多少不同用户领取的数目
- 商家发行的所有优惠券种类数目
- **用户线上相关的特征**
- 用户线上操作次数
- 用户线上点击率
- 用户线上购买率
- 用户线上领取率
- 用户线上不消费次数
- 用户线上优惠券核销次数
- 用户线上优惠券核销率
- 用户线下不消费次数占线上线下总的不消费次数的比重
- 用户线下的优惠券核销次数占线上线下总的优惠券核销次数的比重
- 用户线下领取的记录数量占总的记录数量的比重
特征分析
In [1]:
import pandas as pd
import numpy as np
from datetime import date
import datetime as dt
In [3]:
#keep_default_na=False 表示读取出来的是null , 如果不加就是缺失值nan
获取数据¶
In [2]:
off_train = pd.read_csv('./data/ccf_offline_stage1_train.csv', keep_default_na=False, header=0)
off_test = pd.read_csv('./data/ccf_offline_stage1_test_revised.csv',keep_default_na=False, header=0)
on_train = pd.read_csv('./data/ccf_online_stage1_train.csv', keep_default_na=False, header=0)
观察数据训练集和测试集的数据表特征
In [4]:
off_train.head()
Out[4]:
User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | Date | |
---|---|---|---|---|---|---|---|
0 | 1439408 | 2632 | null | null | 0 | null | 20160217 |
1 | 1439408 | 4663 | 11002 | 150:20 | 1 | 20160528 | null |
2 | 1439408 | 2632 | 8591 | 20:1 | 0 | 20160217 | null |
3 | 1439408 | 2632 | 1078 | 20:1 | 0 | 20160319 | null |
4 | 1439408 | 2632 | 8591 | 20:1 | 0 | 20160613 | null |
In [8]:
off_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754884 entries, 0 to 1754883
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Coupon_id object
Discount_rate object
Distance object
Date_received object
Date object
dtypes: int64(2), object(5)
memory usage: 93.7+ MB
In [5]:
off_test.head()
Out[5]:
User_id | Merchant_id | Coupon_id | Discount_rate | Distance | Date_received | |
---|---|---|---|---|---|---|
0 | 4129537 | 450 | 9983 | 30:5 | 1 | 20160712 |
1 | 6949378 | 1300 | 3429 | 30:5 | null | 20160706 |
2 | 2166529 | 7113 | 6928 | 200:20 | 5 | 20160727 |
3 | 2166529 | 7113 | 1808 | 100:10 | 5 | 20160727 |
4 | 6172162 | 7605 | 6500 | 30:1 | 2 | 20160708 |
In [9]:
off_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113640 entries, 0 to 113639
Data columns (total 6 columns):
User_id 113640 non-null int64
Merchant_id 113640 non-null int64
Coupon_id 113640 non-null int64
Discount_rate 113640 non-null object
Distance 113640 non-null object
Date_received 113640 non-null int64
dtypes: int64(4), object(2)
memory usage: 5.2+ MB
In [6]:
on_train.head()
Out[6]:
User_id | Merchant_id | Action | Coupon_id | Discount_rate | Date_received | Date | |
---|---|---|---|---|---|---|---|
0 | 13740231 | 18907 | 2 | 100017492 | 500:50 | 20160513 | null |
1 | 13740231 | 34805 | 1 | null | null | null | 20160321 |
2 | 14336199 | 18907 | 0 | null | null | null | 20160618 |
3 | 14336199 | 18907 | 0 | null | null | null | 20160618 |
4 | 14336199 | 18907 | 0 | null | null | null | 20160618 |
In [10]:
on_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11429826 entries, 0 to 11429825
Data columns (total 7 columns):
User_id int64
Merchant_id int64
Action int64
Coupon_id object
Discount_rate object
Date_received object
Date object
dtypes: int64(3), object(4)
memory usage: 610.4+ MB
查看优惠券和购买商品不同组合的情况
In [11]:
print('有优惠券,购买商品条数', off_train[(off_train['Date_received'] != 'null') & (off_train['Date'] != 'null')].shape[0])
print('无优惠券,购买商品条数', off_train[(off_train['Date_received'] == 'null') & (off_train['Date'] != 'null')].shape[0])
print('有优惠券,不购买商品条数', off_train[(off_train['Date_received'] != 'null') & (off_train['Date'] == 'null')].shape[0])
print('无优惠券,不购买商品条数', off_train[(off_train['Date_received'] == 'null') & (off_train['Date'] == 'null')].shape[0])
有优惠券,购买商品条数 75382
无优惠券,购买商品条数 701602
有优惠券,不购买商品条数 977900
无优惠券,不购买商品条数 0
查看测试集和训练集中用户和商户的不同
In [19]:
# 在测试集中出现的用户但训练集没有出现 这个是新用户编号 有结果知道有两个新用户
print('1. User_id in training set but not in test set', set(off_test['User_id']) - set(off_train['User_id']))
# 在测试集中出现的商户但训练集没有出现 商户有一个是新的
print('2. Merchant_id in training set but not in test set', set(off_test['Merchant_id']) - set(off_train['Merchant_id']))
1. User_id in training set but not in test set {2495873, 1286474}
2. Merchant_id in training set but not in test set {5920}
查看数据类型
In [21]:
print('Discount_rate 类型:',off_train['Discount_rate'].unique())
print('Distance 类型:', off_train['Distance'].unique())
Discount_rate 类型: ['null' '150:20' '20:1' '200:20' '30:5' '50:10' '10:5' '100:10' '200:30'
'20:5' '30:10' '50:5' '150:10' '100:30' '200:50' '100:50' '300:30'
'50:20' '0.9' '10:1' '30:1' '0.95' '100:5' '5:1' '100:20' '0.8' '50:1'
'200:10' '300:20' '100:1' '150:30' '300:50' '20:10' '0.85' '0.6' '150:50'
'0.75' '0.5' '200:5' '0.7' '30:20' '300:10' '0.2' '50:30' '200:100'
'150:5']
Distance 类型: ['0' '1' 'null' '2' '10' '4' '7' '9' '3' '5' '6' '8']
更换特征名称
In [23]:
# 更换特征名称
off_train.columns=['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']
off_test.columns = ['user_id','merchant_id','coupon_id','discount_rate','distance','date_received']
on_train.columns = ['user_id','merchant_id','action','coupon_id','discount_rate','date_received','date']
数据分批处理
#将数据分为3个数据集 利用滑窗法
#将2016年1月1日到4月13日的数据提取特征,利用4月14日的到5月14日的作为测试集
#将2月1日到5月14日的作为数据集提取特征,利用5月15日6月15日的作为测试集
#将3月15日到6月30日作为数据集提取特征,再测试7月1日到7月31日的数据
交叉验证集一
In [216]:
# 将2016年1月1日到4月13日的数据提取特征
feature1 = off_train[(off_train.date >= '20160101') & (off_train.date <= '20160413') | (
(off_train.date == 'null') & (off_train.date_received >= '20160101') & (off_train.date_received <= '20160413'))]
# 利用4月14日的到5月14日的作为测试集
dataset1 = off_train[(off_train.date_received >= '201604014')
& (off_train.date_received <= '20160514')]
交叉验证集二
In [25]:
# 在2月1日到5月14日之间使用了券,只要领取时间在2月1日到5月14日之间,并包括没有数据中没有领取券的
feature2 = off_train[(off_train.date >= '20160201') & (off_train.date <= '20160514') | (
(off_train.date == 'null') & (off_train.date_received >= '20160201') & (off_train.date_received <= '20160514'))]
# 提取数据集2的测试集
dataset2 = off_train[(off_train.date_received >= '20160515')
& (off_train.date_received <= '20160615')]
测试集
In [26]:
# 数据集3的特征为 取数据中领券和用券日期大于3月15日和小于6月30日的
feature3 = off_train[((off_train.date >= '20160315') & (off_train.date <= '20160630')) | (
(off_train.date == 'null') & (off_train.date_received >= '20160315') & (off_train.date_received <= '20160630'))]
# 使数据集3等于test集 没有label标签
dataset3 = off_test
特征工程
其他特征处理
分析这段实际时间用户收取的所有的优惠券数目
In [45]:
t = dataset1[['user_id']]
# 相当于给原有数据加上一列,这个月用户收取的所有优惠券数目,并初始化为1
t['this_month_user_receive_all_coupon_count'] = 1
# reset_index()--drop:默认为False,不删除原来索引,如果为True,删除原来的索引值
t = t.groupby('user_id').agg('sum').reset_index()
分析用户收到相同优惠券的数量
In [218]:
dataset1.head()
Out[218]:
user_id | merchant_id | coupon_id | discount_rate | distance | date_received | date | |
---|---|---|---|---|---|---|---|
7 | 1832624 | 3381 | 7610 | 200:20 | 0 | 20160429 | null |
18 | 163606 | 1569 | 5054 | 200:30 | 10 | 20160421 | null |
20 | 94107 | 3381 | 7610 | 200:20 | 2 | 20160412 | null |
43 | 4061024 | 3381 | 7610 | 200:20 | 10 | 20160426 | null |
44 | 4061024 | 7555 | 9871 | 30:5 | 10 | 20160409 | null |
In [47]:
# 提取数据集1的优惠券Id和用户Id
t1 = dataset1[['user_id', 'coupon_id']]
# 提取这个月用户收到的相同的优惠券的数量
t1['this_month_user_receive_same_coupn_count'] = 1
t1 = t1.groupby(['user_id', 'coupon_id']).agg('sum').reset_index()
In [214]:
t1.head()
Out[214]:
user_id | merchant_id | user_merchant_received | |
---|---|---|---|
0 | 4 | 1433 | 1 |
1 | 35 | 3381 | 4 |
2 | 36 | 1041 | 1 |
3 | 36 | 5717 | 1 |
4 | 64 | 2146 | 1 |
分析用户接受优惠券的最大时间和最小时间
In [49]:
# 提取数据集1的用户id,优惠券id以及优惠券接收的时间
t2 = dataset1[['user_id', 'coupon_id', 'date_received']]
# 将数据转换为str类型
t2.date_received = t2.date_received.astype('str')
# 如果出现相同的用户接收相同的优惠券在接收时间上用‘:’连接上第n次接受优惠券的时间
t2 = t2.groupby(['user_id', 'coupon_id'])['date_received'].agg(
lambda x: ':'.join(x)).reset_index()
In [51]:
# 将接收时间的一组按着':'分开,这样就可以计算接受了优惠券的数量,apply是合并
t2['receive_number'] = t2.date_received.apply(lambda s: len(s.split(':')))
t2 = t2[t2.receive_number > 1]
In [57]:
# 最大接受的日期
t2['max_date_received'] = t2.date_received.apply(
lambda s: max([int(d) for d in s.split(':')]))
# 最小的接收日期
t2['min_date_received'] = t2.date_received.apply(
lambda s: min([int(d) for d in s.split(':')]))
In [59]:
t2 = t2[['user_id', 'coupon_id', 'max_date_received', 'min_date_received']]
In [65]:
t2.head()
Out[65]:
user_id | coupon_id | max_date_received | min_date_received | |
---|---|---|---|---|
51 | 2227 | 7088 | 20160511 | 20160426 |
198 | 8397 | 8556 | 20160506 | 20160425 |
218 | 9527 | 9416 | 20160409 | 20160406 |
220 | 9653 | 2109 | 20160506 | 20160424 |
224 | 9690 | 7224 | 20160419 | 20160417 |
分析用户使用的优惠券数量 一次还是多次
In [62]:
t3 = dataset1[['user_id', 'coupon_id', 'date_received']]
# 将两表融合只保留左表数据,这样得到的表,相当于保留了最近接收时间和最远接受时间
t3 = pd.merge(t3, t2, on=['user_id', 'coupon_id'], how='left')
In [67]:
t3.head()
Out[67]:
user_id | coupon_id | date_received | max_date_received | min_date_received | |
---|---|---|---|---|---|
0 | 1832624 | 7610 | 20160429 | NaN | NaN |
1 | 163606 | 5054 | 20160421 | NaN | NaN |
2 | 94107 | 7610 | 20160412 | NaN | NaN |
3 | 4061024 | 7610 | 20160426 | NaN | NaN |
4 | 4061024 | 9871 | 20160409 | NaN | NaN |
In [68]:
# 这个优惠券最近接受时间
t3['this_month_user_receive_same_coupon_lastone'] = t3.max_date_received - \
t3.date_received.astype(int)
# 这个优惠券最远接受时间
t3['this_month_user_receive_same_coupon_firstone'] = t3.date_received.astype(
int)-t3.min_date_received
In [72]:
t3.max_date_received.head()
Out[72]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: max_date_received, dtype: float64
In [70]:
t3.tail()
Out[70]:
user_id | coupon_id | date_received | max_date_received | min_date_received | this_month_user_receive_same_coupon_lastone | this_month_user_receive_same_coupon_firstone | |
---|---|---|---|---|---|---|---|
182073 | 457900 | 13118 | 20160418 | NaN | NaN | NaN | NaN |
182074 | 4056416 | 3469 | 20160514 | NaN | NaN | NaN | NaN |
182075 | 188086 | 4723 | 20160415 | NaN | NaN | NaN | NaN |
182076 | 212662 | 3739 | 20160504 | 20160508.0 | 20160504.0 | 4.0 | 0.0 |
182077 | 212662 | 3739 | 20160508 | 20160508.0 | 20160504.0 | 0.0 | 4.0 |
In [74]:
def is_firstlastone(x):
if x==0:
return 1
elif x>0:
return 0
else:
return -1 # 表明这个优惠券只接受了一次
In [75]:
t3.this_month_user_receive_same_coupon_lastone = t3.this_month_user_receive_same_coupon_lastone.apply(
is_firstlastone)
t3.this_month_user_receive_same_coupon_firstone = t3.this_month_user_receive_same_coupon_firstone.apply(
is_firstlastone)
In [76]:
t3.tail()
Out[76]:
user_id | coupon_id | date_received | max_date_received | min_date_received | this_month_user_receive_same_coupon_lastone | this_month_user_receive_same_coupon_firstone | |
---|---|---|---|---|---|---|---|
182073 | 457900 | 13118 | 20160418 | NaN | NaN | -1 | -1 |
182074 | 4056416 | 3469 | 20160514 | NaN | NaN | -1 | -1 |
182075 | 188086 | 4723 | 20160415 | NaN | NaN | -1 | -1 |
182076 | 212662 | 3739 | 20160504 | 20160508.0 | 20160504.0 | 0 | 1 |
182077 | 212662 | 3739 | 20160508 | 20160508.0 | 20160504.0 | 1 | 0 |
In [78]:
# 将表格中接收优惠券日期中为最近和最远的日期时置为1其余为0,若只接受了一次优惠券为-1
t3 = t3[['user_id', 'coupon_id', 'date_received', 'this_month_user_receive_same_coupon_lastone',
'this_month_user_receive_same_coupon_firstone']]
t3.head()
Out[78]:
user_id | coupon_id | date_received | this_month_user_receive_same_coupon_lastone | this_month_user_receive_same_coupon_firstone | |
---|---|---|---|---|---|
0 | 1832624 | 7610 | 20160429 | -1 | -1 |
1 | 163606 | 5054 | 20160421 | -1 | -1 |
2 | 94107 | 7610 | 20160412 | -1 | -1 |
3 | 4061024 | 7610 | 20160426 | -1 | -1 |
4 | 4061024 | 9871 | 20160409 | -1 | -1 |
分析用户所接收到的所有优惠券的数量
In [79]:
# 一个用户所接收到的所有优惠券的数量
t4 = dataset1[['user_id', 'date_received']]
t4['this_day_receive_all_coupon_count'] = 1
t4 = t4.groupby(['user_id', 'date_received']).agg('sum').reset_index()
In [81]:
t4.head()
Out[81]:
user_id | date_received | this_day_receive_all_coupon_count | |
---|---|---|---|
0 | 173 | 20160414 | 1 |
1 | 285 | 20160501 | 1 |
2 | 316 | 20160430 | 1 |
3 | 377 | 20160512 | 1 |
4 | 387 | 20160421 | 1 |
用户不同时间所接收到不同优惠券的数量
In [83]:
# 一个用户不同时间所接收到不同优惠券的数量
t5 = dataset1[['user_id', 'coupon_id', 'date_received']]
t5['this_day_user_receive_same_coupon_count'] = 1
t5 = t5.groupby(['user_id', 'coupon_id', 'date_received']
).agg('sum').reset_index()
In [84]:
t5.head()
Out[84]:
user_id | coupon_id | date_received | this_day_user_receive_same_coupon_count | |
---|---|---|---|---|
0 | 173 | 7610 | 20160414 | 1 |
1 | 285 | 1532 | 20160501 | 1 |
2 | 316 | 8952 | 20160430 | 1 |
3 | 377 | 2857 | 20160512 | 1 |
4 | 387 | 7610 | 20160421 | 1 |
一个用户不同优惠券 的接受时间
In [86]:
# 一个用户不同优惠券 的接受时间
t6 = dataset1[['user_id', 'coupon_id', 'date_received']]
t6.date_received = t6.date_received.astype('str')
t6 = t6.groupby(['user_id', 'coupon_id'])['date_received'].agg(
lambda x: ':'.join(x)).reset_index()
# 重命名inplace代表深拷贝
t6.rename(columns={'date_received': 'dates'}, inplace=True)
In [89]:
t6.tail(10)
Out[89]:
user_id | coupon_id | dates | |
---|---|---|---|
170756 | 7360488 | 7610 | 20160405 |
170757 | 7360582 | 7610 | 20160404 |
170758 | 7360602 | 10888 | 20160424:20160421:20160430 |
170759 | 7360603 | 3732 | 20160409 |
170760 | 7360608 | 11860 | 20160407 |
170761 | 7360608 | 4872 | 20160407 |
170762 | 7360845 | 3732 | 20160410 |
170763 | 7360845 | 5054 | 20160413 |
170764 | 7360863 | 3992 | 20160422 |
170765 | 7360928 | 5439 | 20160423 |
分析用户领取优惠券和不用优惠券之间的最小时间差
In [91]:
t7 = dataset1[['user_id', 'coupon_id', 'date_received']]
# 将t6和t7融合
t7 = pd.merge(t7, t6, on=['user_id', 'coupon_id'], how='left')
In [94]:
t7.tail()
Out[94]:
user_id | coupon_id | date_received | dates | |
---|---|---|---|---|
182073 | 457900 | 13118 | 20160418 | 20160418 |
182074 | 4056416 | 3469 | 20160514 | 20160514 |
182075 | 188086 | 4723 | 20160415 | 20160415 |
182076 | 212662 | 3739 | 20160504 | 20160504:20160508 |
182077 | 212662 | 3739 | 20160508 | 20160504:20160508 |
In [93]:
t7.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 182078 entries, 0 to 182077
Data columns (total 4 columns):
user_id 182078 non-null int64
coupon_id 182078 non-null object
date_received 182078 non-null object
dates 182078 non-null object
dtypes: int64(1), object(3)
memory usage: 6.9+ MB
In [95]:
# 注意这里所有的时间格式都已经是'str'格式
t7['date_received_date'] = t7.date_received.astype('str')+'-'+t7.dates
# print(t7)
In [97]:
t7.tail()
Out[97]:
user_id | coupon_id | date_received | dates | date_received_date | |
---|---|---|---|---|---|
182073 | 457900 | 13118 | 20160418 | 20160418 | 20160418-20160418 |
182074 | 4056416 | 3469 | 20160514 | 20160514 | 20160514-20160514 |
182075 | 188086 | 4723 | 20160415 | 20160415 | 20160415-20160415 |
182076 | 212662 | 3739 | 20160504 | 20160504:20160508 | 20160504-20160504:20160508 |
182077 | 212662 | 3739 | 20160508 | 20160504:20160508 | 20160508-20160504:20160508 |
In [98]:
# 计算领券日期-领取不同优惠券之间的时间差
def get_day_gap_before(s):
date_received, dates = s.split('-')
dates = dates.split(':')
gaps = []
for d in dates:
this_gap = (date(int(date_received[0:4]), int(date_received[4:6]), int(
date_received[6:8]))-date(int(d[0:4]), int(d[4:6]), int(d[6:8]))).days
if this_gap > 0:
gaps.append(this_gap)
if len(gaps) == 0:
return -1
else:
return min(gaps)
# 计算领取不用优惠券-领券日期的时间差
def get_day_gap_after(s):
date_received, dates = s.split('-')
dates = dates.split(':')
gaps = []
for d in dates:
this_gap = (date(int(d[0:4]), int(d[4:6]), int(d[6:8]))-date(
int(date_received[0:4]), int(date_received[4:6]), int(date_received[6:8]))).days
if this_gap > 0:
gaps.append(this_gap)
if len(gaps) == 0:
return -1
else:
return min(gaps)
In [99]:
t7['day_gap_before'] = t7.date_received_date.apply(get_day_gap_before)
t7['day_gap_after'] = t7.date_received_date.apply(get_day_gap_after)
t7 = t7[['user_id', 'coupon_id', 'date_received',
'day_gap_before', 'day_gap_after']]
In [101]:
t7.tail()
Out[101]:
user_id | coupon_id | date_received | day_gap_before | day_gap_after | |
---|---|---|---|---|---|
182073 | 457900 | 13118 | 20160418 | -1 | -1 |
182074 | 4056416 | 3469 | 20160514 | -1 | -1 |
182075 | 188086 | 4723 | 20160415 | -1 | -1 |
182076 | 212662 | 3739 | 20160504 | -1 | 4 |
182077 | 212662 | 3739 | 20160508 | 4 | -1 |
将所有特征融合在一张表中
In [102]:
# 将所有特征融合在一张表中
other_feature1 = pd.merge(t1, t, on='user_id')
other_feature1 = pd.merge(other_feature1, t3, on=['user_id', 'coupon_id'])
other_feature1 = pd.merge(other_feature1, t4, on=['user_id', 'date_received'])
other_feature1 = pd.merge(other_feature1, t5, on=[
'user_id', 'coupon_id', 'date_received'])
other_feature1 = pd.merge(other_feature1, t7, on=[
'user_id', 'coupon_id', 'date_received'])
other_feature1.to_csv('./other_feature1.csv',index=None)
In [104]:
other_feature1.head()
Out[104]:
user_id | coupon_id | this_month_user_receive_same_coupn_count | this_month_user_receive_all_coupon_count | date_received | this_month_user_receive_same_coupon_lastone | this_month_user_receive_same_coupon_firstone | this_day_receive_all_coupon_count | this_day_user_receive_same_coupon_count | day_gap_before | day_gap_after | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 173 | 7610 | 1 | 1 | 20160414 | -1 | -1 | 1 | 1 | -1 | -1 |
1 | 285 | 1532 | 1 | 1 | 20160501 | -1 | -1 | 1 | 1 | -1 | -1 |
2 | 316 | 8952 | 1 | 1 | 20160430 | -1 | -1 | 1 | 1 | -1 | -1 |
3 | 377 | 2857 | 1 | 1 | 20160512 | -1 | -1 | 1 | 1 | -1 | -1 |
4 | 387 | 7610 | 1 | 1 | 20160421 | -1 | -1 | 1 | 1 | -1 | -1 |
优惠券相关特征处理
In [107]:
# 计算折扣率
def calc_discount_rate(s):
s = str(s)
s = s.split(':')
if len(s) == 1:
return float(s[0])
else:
return 1.0-float(s[1])/float(s[0])
def get_discount_man(s):
s = str(s)
s = s.split(':')
if len(s)==1:
return 'null'
else:
return int(s[0])
def get_discount_jian(s):
s = str(s)
s = s.split(':')
if len(s) == 1:
return 'null'
else:
return int(s[1])
def is_man_jian(s):
s = str(s)
s = s.split(':')
if len(s)==1:
return 0
else:
return 1
In [118]:
# 将时间转化为第几周
# 显示时间是第几周
# tt是获取到的特征中消费的最大时间
tt = feature1[feature1.date != 'null'].date.unique().max()
dataset1['day_of_week'] = dataset1.date_received.astype('str').apply(
lambda x: date(int(x[0:4]), int(x[4:6]), int(x[6:8])).weekday()+1)
# 显示时间是月中的第几天
dataset1['day_of_month'] = dataset1.date_received.astype(
'str').apply(lambda x: int(x[6:8]))
In [129]:
# 显示领取优惠券时期和截止日之间的天数
dataset1['days_distance'] = dataset1.date_received.astype('str').apply(
lambda x: (date(int(x[0:4]), int(x[4:6]), int(x[6:8]))-date(int(tt[0:4]), int(tt[4:6]), int(tt[6:8]))).days)
# 显示满了多少钱后开始减
dataset1['discount_man'] = dataset1.discount_rate.apply(get_discount_man)
# 显示满减的减少的钱
dataset1['discount_jian'] = dataset1.discount_rate.apply(get_discount_jian)
# 返回优惠券是否是满减券
dataset1['is_man_jian'] = dataset1.discount_rate.apply(is_man_jian)
# 显示打折力度
dataset1['discount_rate'] = dataset1.discount_rate.apply(
calc_discount_rate)
In [131]:
dataset1.head()
Out[131]:
user_id | merchant_id | coupon_id | discount_rate | distance | date_received | date | day_of_week | day_of_month | days_distance | discount_man | discount_jian | is_man_jian | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 1832624 | 3381 | 7610 | 0.900000 | 0 | 20160429 | null | 5 | 29 | 16 | 200 | 20 | 1 |
18 | 163606 | 1569 | 5054 | 0.850000 | 10 | 20160421 | null | 4 | 21 | 8 | 200 | 30 | 1 |
20 | 94107 | 3381 | 7610 | 0.900000 | 2 | 20160412 | null | 2 | 12 | -1 | 200 | 20 | 1 |
43 | 4061024 | 3381 | 7610 | 0.900000 | 10 | 20160426 | null | 2 | 26 | 13 | 200 | 20 | 1 |
44 | 4061024 | 7555 | 9871 | 0.833333 | 10 | 20160409 | null | 6 | 9 | -4 | 30 | 5 | 1 |
In [193]:
d = dataset1[['coupon_id']]
d['coupon_count'] = 1
# 显示每一种优惠券的数量
d = d.groupby('coupon_id').agg('sum').reset_index()
dataset1 = pd.merge(dataset1, d, on='coupon_id', how='left')
dataset1.to_csv('./coupon1_feature.csv',index=None)
商户特征处理
In [134]:
merchant1 = feature1[['merchant_id','coupon_id','distance','date_received','date']]
t = merchant1[['merchant_id']]
#删除重复行数据
t.drop_duplicates(inplace=True)
显示每个商品的销售数量
In [135]:
#显示卖出的商品
t1 = merchant1[merchant1.date!='null'][['merchant_id']]
t1['total_sales'] = 1
#显示每个商品的销售数量
t1 = t1.groupby('merchant_id').agg('sum').reset_index()
显示使用了优惠券消费的商品,正样本
In [136]:
#显示使用了优惠券消费的商品,正样本
t2 = merchant1[(merchant1.date!='null')&(merchant1.coupon_id!='null')][['merchant_id']]
t2['sales_use_coupon'] = 1
t2 = t2.groupby('merchant_id').agg('sum').reset_index()
In [138]:
t2.head()
Out[138]:
merchant_id | sales_use_coupon | |
---|---|---|
0 | 13 | 1 |
1 | 14 | 1 |
2 | 15 | 11 |
3 | 17 | 2 |
4 | 20 | 1 |
显示商品的优惠券的总数量
In [140]:
#显示了商品的优惠券的总数量
t3 = merchant1[merchant1.coupon_id != 'null'][['merchant_id']]
t3 ['total_coupon'] = 1
t3 = t3.groupby('merchant_id').agg('sum').reset_index()
In [141]:
t3.head()
Out[141]:
merchant_id | total_coupon | |
---|---|---|
0 | 8 | 2 |
1 | 13 | 3 |
2 | 14 | 10 |
3 | 15 | 79 |
4 | 17 | 26 |
显示商品销量和距离的关系
In [142]:
#显示商品销量和距离的关系
t4 = merchant1[(merchant1.date != 'null')&(merchant1.coupon_id != 'null')][['merchant_id','distance']]
#把数据中的null值全部替换为-1
t4.replace('null',-1,inplace=True)
t4.distance = t4.distance.astype('int')
#再把数据中的-1全部替换为NaN
t4.replace(-1,np.nan,inplace=True)
显示用户离商品的最小距离
In [143]:
#返回用户离商品的距离最小值
t5 = t4.groupby('merchant_id').agg('min').reset_index()
t5.rename(columns={'distance':'merchant_min_distance'},inplace = True)
显示用户离商品的最大距离
In [144]:
#返回用户离商品的距离最大值
t6 = t4.groupby('merchant_id').agg('max').reset_index()
t6.rename(columns={'distance':'merchant_max_distance'},inplace = True)
显示距离的平均值
In [145]:
#返回距离的平均值 和直接使用聚合函数没什么区别
t7 = t4.groupby('merchant_id').agg('mean').reset_index()
t7.rename(columns = {'distance':'merchant_mean_distance'},inplace= True)
In [150]:
t7.head()
Out[150]:
merchant_id | merchant_mean_distance | |
---|---|---|
0 | 13 | 0.000000 |
1 | 14 | 0.000000 |
2 | 15 | 1.727273 |
3 | 17 | 0.500000 |
4 | 20 | 0.000000 |
显示距离的中位置
In [151]:
#返回距离的中位值
t8 = t4.groupby('merchant_id').agg('median').reset_index()
t8.rename(columns={'distance':'merchant_median_distance'},inplace = True)
合并处理后的特征表
In [152]:
merchant1_feature = pd.merge(t,t1,on='merchant_id',how='left')
merchant1_feature = pd.merge(merchant1_feature,t2,on='merchant_id',how='left')
merchant1_feature = pd.merge(merchant1_feature,t3,on='merchant_id',how='left')
merchant1_feature = pd.merge(merchant1_feature,t5,on='merchant_id',how='left')
merchant1_feature = pd.merge(merchant1_feature,t6,on='merchant_id',how='left')
merchant1_feature = pd.merge(merchant1_feature,t7,on='merchant_id',how='left')
merchant1_feature = pd.merge(merchant1_feature,t8,on='merchant_id',how='left')
In [154]:
merchant1_feature.head()
Out[154]:
merchant_id | total_sales | sales_use_coupon | total_coupon | merchant_min_distance | merchant_max_distance | merchant_mean_distance | merchant_median_distance | |
---|---|---|---|---|---|---|---|---|
0 | 2632 | 14.0 | 1.0 | 28.0 | 1.0 | 1.0 | 1.000000 | 1.0 |
1 | 3381 | 14962.0 | 2168.0 | 100369.0 | 0.0 | 10.0 | 1.650457 | 1.0 |
2 | 2099 | 5783.0 | 1705.0 | 16824.0 | 0.0 | 10.0 | 0.968072 | 0.0 |
3 | 4833 | 522.0 | 116.0 | 8321.0 | 0.0 | 10.0 | 3.037736 | 2.0 |
4 | 8390 | 1056.0 | 133.0 | 690.0 | 0.0 | 10.0 | 0.865079 | 0.0 |
In [155]:
#将数据中的NaN用0来替换
merchant1_feature.sales_use_coupon = merchant1_feature.sales_use_coupon.replace(np.nan,0)
#即优惠券的使用率
merchant1_feature['merchant_coupon_transfer_rate'] = merchant1_feature.sales_use_coupon.astype('float')/merchant1_feature.total_coupon
#即卖出商品中使用优惠券的占比
merchant1_feature['coupon_rate'] = merchant1_feature.sales_use_coupon.astype('float') / merchant1_feature.total_sales
#将数据中的NaN用0来替换
merchant1_feature.total_coupon = merchant1_feature.total_coupon.replace(np.nan,0)
保存处理结果
In [156]:
merchant1_feature.to_csv('./merchant1_feature.csv',index=None)
用户的相关特征处理
获取处理的特征项
In [158]:
user1 = feature1[['user_id','merchant_id','coupon_id','discount_rate','distance','date_received','date']]
t = user1[['user_id']]
#去掉数据中重复的用户Id
t.drop_duplicates(inplace=True)
统计用户买的商品种类
In [159]:
#用户购买商品的种类数
t1 = user1[user1.date!='null'][['user_id','merchant_id']]
#同样去掉重复用的用户id和商品id
t1.drop_duplicates(inplace=True)
t1.merchant_id = 1
t1 = t1.groupby('user_id').agg('sum').reset_index()
t1.rename(columns={'merchant_id':'count_merchant'},inplace=True)
统计使用优惠券购买商品的用户和到店的距离
In [160]:
#使用了优惠券购买商品的用户id和距离
t2 = user1[(user1.date!='null')&(user1.coupon_id!='null')][['user_id','distance']]
#将null值替换为-1
t2.replace('null',-1,inplace=True)
t2.distance = t2.distance.astype('int')#转换数据类型为int
t2.replace(-1,np.nan,inplace=True)
In [161]:
t2.head()
Out[161]:
user_id | distance | |
---|---|---|
38 | 2881376 | 0.0 |
75 | 114747 | 0.0 |
76 | 114747 | 0.0 |
77 | 114747 | 0.0 |
84 | 114747 | 0.0 |
统计使用优惠券购买商品的用户离店铺的最短距离
In [162]:
#得到使用优惠券购买商品的用户离店铺的最短距离
t3 = t2.groupby('user_id').agg('min').reset_index()
t3.rename(columns={'distance':'user_min_distance'},inplace=True)
统计使用优惠券购买商品的用户离店铺的最大距离
In [163]:
#得到最大距离
t4 = t2.groupby('user_id').agg('max').reset_index()
t4.rename(columns={'distance':'user_max_distance'},inplace=True)
统计使用优惠券购买商品的用户离店铺的平均距离
In [164]:
#得到平均距离
t5 = t2.groupby('user_id').agg('mean').reset_index()
t5.rename(columns={'distance':'user_mean_distance'},inplace=True)
统计使用优惠券购买商品的用户离店铺的中间距离
In [165]:
#得到中间距离
t6 = t2.groupby('user_id').agg('median').reset_index()
t6.rename(columns={'distance':'user_median_distance'},inplace=True)
统计每个用户使用优惠券购买的物品数量
In [167]:
#每个用户使用优惠券购买的物品数量
t7 = user1[(user1.date != 'null')&(user1.coupon_id != 'null')][['user_id']]
t7['buy_use_coupon'] = 1
t7 = t7.groupby('user_id').agg('sum').reset_index()
统计用户购买的物品总数
In [168]:
#购买物品的总数
t8 = user1[user1.date != 'null'][['user_id']]
t8['buy_total'] = 1
t8 = t8.groupby('user_id').agg('sum').reset_index()
统计用户收到的优惠券数量
In [170]:
#接受的优惠券的总数
t9 = user1[user1.coupon_id != 'null'][['user_id']]
t9['coupon_received'] = 1
t9 = t9.groupby('user_id').agg('sum').reset_index()
统计用户接受到优惠券的日期和使用之间的间隔
In [171]:
def get_user_date_datereceived_gap(s):
s = s.split(':')
return (date(int(s[0][0:4]), int(s[0][4:6]), int(s[0][6:8])) - date(int(s[1][0:4]), int(s[1][4:6]), int(s[1][6:8]))).days
In [172]:
#接受到优惠券的日期和使用之间的间隔***
t10 = user1[(user1.date_received != 'null')&(user1.date != 'null')][['user_id','date_received','date']]
t10['user_date_datereceived_gap'] = t10.date + ':'+ t10.date_received
t10.user_date_datereceived_gap = t10.user_date_datereceived_gap.apply(get_user_date_datereceived_gap)
t10 = t10[['user_id','user_date_datereceived_gap']]
In [174]:
t10.head()
Out[174]:
user_id | user_date_datereceived_gap | |
---|---|---|
38 | 2881376 | 8 |
75 | 114747 | 25 |
76 | 114747 | 11 |
77 | 114747 | 1 |
84 | 114747 | 7 |
统计用户优惠券使用时间的间隔取平均数
In [175]:
#将用户优惠券使用时间的间隔取平均数
t11 = t10.groupby('user_id').agg('mean').reset_index()
t11.rename(columns={'user_date_datereceived_gap':'avg_user_date_datereceived_gap'},inplace=True)
优惠券使用的间隔天数最小值
In [177]:
#间隔天数的最小值
t12 = t10.groupby('user_id').agg('min').reset_index()
t12.rename(columns={'user_date_datereceived_gap':'min_user_date_datereceived_gap'},inplace=True)
优惠券使用的间隔天数最大值
In [178]:
#间隔天数的最大值
t13 = t10.groupby('user_id').agg('max').reset_index()
t13.rename(columns={'user_date_datareceived_gap':'max_user_date_datereceived_gap'},inplace=True)
合并特征表
In [180]:
user1_feature = pd.merge(t,t1,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t3,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t4,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t5,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t6,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t7,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t8,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t9,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t11,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t12,on='user_id',how='left')
user1_feature = pd.merge(user1_feature,t13,on='user_id',how='left')
处理特征表中的缺失值
In [181]:
user1_feature.count_merchant = user1_feature.count_merchant.replace(np.nan,0)
user1_feature.buy_use_coupon = user1_feature.buy_use_coupon.replace(np.nan,0)
user1_feature['buy_use_coupon_rate'] = user1_feature.buy_use_coupon.astype('float') / user1_feature.buy_total.astype('float')
user1_feature['user_coupon_transfer_rate'] = user1_feature.buy_use_coupon.astype('float') / user1_feature.coupon_received.astype('float')
user1_feature.buy_total = user1_feature.buy_total.replace(np.nan,0)
user1_feature.coupon_received = user1_feature.coupon_received.replace(np.nan,0)
user1_feature.to_csv('./user1_feature.csv',index=None)
用户和商家之间的特征处理
获取用户和商家去重数量
In [182]:
all_user_merchant = feature1[['user_id','merchant_id']]
all_user_merchant.drop_duplicates(inplace=True)
统计用户买了同一家商户的商品有多少
In [183]:
#只保留销售了商品的商户id
t = feature1[['user_id','merchant_id','date']]
t = t[t.date!='null'][['user_id','merchant_id']]
#用户一共买了这家商户的多少商品
t['user_merchant_buy_total'] = 1
t = t.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t.drop_duplicates(inplace=True)
统计用户一共收到一个商户的多少优惠券
In [184]:
t1 = feature1[['user_id','merchant_id','coupon_id']]
t1 = t1[t1.coupon_id!='null'][['user_id','merchant_id']]
t1['user_merchant_received'] = 1
t1 = t1.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t1.drop_duplicates(inplace = True)
用户在一家商户中使用优惠券购买的商品的数目
In [185]:
t2 = feature1[['user_id','merchant_id','date','date_received']]
t2 = t2[(t2.date!='null')&(t2.date_received!='null')][['user_id','merchant_id']]
t2['user_merchant_buy_use_coupon'] = 1
t2 = t2.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t2.drop_duplicates(inplace = True)
统计用户在一家商家的所有记录的总数
In [187]:
t3 = feature1[['user_id','merchant_id']]
t3['user_merchant_any'] = 1
t3 = t3.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t3.drop_duplicates(inplace = True)
统计用户没有使用优惠券购买的商品的数目
In [189]:
t4 = feature1[['user_id','merchant_id','date','coupon_id']]
t4 = t4[(t4.date!='null')&(t4.coupon_id=='null')][['user_id','merchant_id']]
t4['user_merchant_buy_common'] = 1
t4 = t4.groupby(['user_id','merchant_id']).agg('sum').reset_index()
t4.drop_duplicates(inplace = True)
合并用户和商家的特征处理表
In [190]:
user_merchant1 = pd.merge(all_user_merchant,t,on=['user_id','merchant_id'],how='left')
user_merchant1 = pd.merge(user_merchant1,t1,on=['user_id','merchant_id'],how='left')
user_merchant1 = pd.merge(user_merchant1,t2,on=['user_id','merchant_id'],how='left')
user_merchant1 = pd.merge(user_merchant1,t3,on=['user_id','merchant_id'],how='left')
user_merchant1 = pd.merge(user_merchant1,t4,on=['user_id','merchant_id'],how='left')
统计用户和商家之间的优惠券的使用概率, 缺失值的处理
In [191]:
# 都是针对一家商户和一个用户
user_merchant1.user_merchant_buy_use_coupon = user_merchant1.user_merchant_buy_use_coupon.replace(
np.nan, 0)
user_merchant1.user_merchant_buy_common = user_merchant1.user_merchant_buy_common.replace(
np.nan, 0)
# 优惠券的转换率,用户使用了的优惠券/一共收到的优惠券
user_merchant1['user_merchant_coupon_transfer_rate'] = user_merchant1.user_merchant_buy_use_coupon.astype(
'float') / user_merchant1.user_merchant_received.astype('float')
# 用户使用优惠券的概率,在一家商户使用优惠券购买的商品/在一家商户购买商品的总数
user_merchant1['user_merchant_coupon_buy_rate'] = user_merchant1.user_merchant_buy_use_coupon.astype(
'float') / user_merchant1.user_merchant_buy_total.astype('float')
# 用户在商户消费的概率 用户在商户购买的总数/在一家商户浏览的总次数
user_merchant1['user_merchant_rate'] = user_merchant1.user_merchant_buy_total.astype(
'float') / user_merchant1.user_merchant_any.astype('float')
# 用户在一家商户不适用优惠券购买的概率 普通购买的商品数/购买商品的总数
user_merchant1['user_merchant_common_buy_rate'] = user_merchant1.user_merchant_buy_common.astype(
'float') / user_merchant1.user_merchant_buy_total.astype('float')
user_merchant1.to_csv('./user_merchant1.csv', index=None)
训练集和测试集的再次处理
获取之前保存的处理完成的特征
In [203]:
coupon1 = pd.read_csv('./coupon1_feature.csv', keep_default_na=False)
merchant1 = pd.read_csv('./merchant1_feature.csv', keep_default_na=False)
user1 = pd.read_csv('./user1_feature.csv', keep_default_na=False)
user_merchant1 = pd.read_csv('./user_merchant1.csv', keep_default_na=False)
other_feature1 = pd.read_csv('./other_feature1.csv', keep_default_na=False)
dataset1 = pd.merge(coupon1,merchant1,on='merchant_id',how='left')
dataset1 = pd.merge(dataset1,user1,on='user_id',how='left')
dataset1 = pd.merge(dataset1,user_merchant1,on=['user_id','merchant_id'],how='left')
dataset1 = pd.merge(dataset1,other_feature1,on=['user_id','coupon_id','date_received'],how='left')
dataset1.drop_duplicates(inplace=True)
In [204]:
dataset1.shape
Out[204]:
(181040, 55)
添加is_weeekend特征
In [205]:
dataset1.user_merchant_buy_total = dataset1.user_merchant_buy_total.replace(np.nan,0)
dataset1.user_merchant_any = dataset1.user_merchant_any.replace(np.nan,0)
dataset1.user_merchant_received = dataset1.user_merchant_received.replace(np.nan,0)
dataset1['is_weekend'] = dataset1.day_of_week.apply(lambda x:1 if x in (6,7) else 0)
# 使用get_dummies进行one_hot编码
weekday_dummies = pd.get_dummies(dataset1.day_of_week)
weekday_dummies.columns = ['weekday'+str(i+1) for i in range(weekday_dummies.shape[1])]
#one-hot编码与原数据合并
dataset1 = pd.concat([dataset1,weekday_dummies],axis=1)
In [206]:
def get_label(s):
s = s.split(':')
if s[0]=='null':
return 0
elif (date(int(s[0][0:4]),int(s[0][4:6]),int(s[0][6:8]))-date(int(s[1][0:4]),int(s[1][4:6]),int(s[1][6:8]))).days<=15:
return 1
else:
return -1
In [209]:
dataset1.head()
Out[209]:
user_id | coupon_id | discount_rate | distance | day_of_month | days_distance | discount_man | discount_jian | is_man_jian | coupon_count_x | ... | day_gap_after | is_weekend | weekday1 | weekday2 | weekday3 | weekday4 | weekday5 | weekday6 | weekday7 | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1832624 | 7610 | 0.900000 | 0 | 29 | 16 | 200 | 20 | 1 | 37647 | ... | -1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 163606 | 5054 | 0.850000 | 10 | 21 | 8 | 200 | 30 | 1 | 19874 | ... | -1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 94107 | 7610 | 0.900000 | 2 | 12 | -1 | 200 | 20 | 1 | 37647 | ... | -1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 4061024 | 7610 | 0.900000 | 10 | 26 | 13 | 200 | 20 | 1 | 37647 | ... | -1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 4061024 | 9871 | 0.833333 | 10 | 9 | -4 | 30 | 5 | 1 | 12674 | ... | -1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 59 columns
处理标签值
In [208]:
dataset1['label'] = dataset1.date.astype(
'str') + ':' + dataset1.date_received.astype('str')
dataset1.label = dataset1.label.apply(get_label)
dataset1.drop(['merchant_id', 'day_of_week', 'date', 'date_received',
'coupon_count'], axis=1, inplace=True)
In [211]:
# 所有的表都要一起处理null
dataset1 = dataset1.replace('null', np.nan)
In [212]:
dataset1.to_csv('./dataset1.csv',index=None)
In [ ]:
# 1、深入理解数据分布,尤其是测试目标的数据分布,是成功的第一步
# 2、特征决定上限,调参只是帮助你逼近这个上限而已
# 3、多与人交流,自己埋头苦干很容易跳进坑里一辈子出不来