目录
一、项目背景与目标
O2O行业关联数亿消费者,各类APP每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。个性化投放是提高优惠券核销率的重要技术,它能让具有一定偏好的消费者得到真正的实惠,同时赋予商家更强的营销能力。
利用用户在2016年1月1日至2016年6月30日之间真实线下消费行为数据,预测用户在2016年7月领取优惠券后15天以内的使用情况。
预测前的进行了初步数据分析:O2O优惠券数据分析报告
二、数据描述
编程语言:Python
数据来源:https://tianchi.aliyun.com/competition/entrance/231593/information
数据字段:
三、问题分析
问题一:预测数据集特征只有6个,如何全面构造特征工程来表达样本?
问题二:如何划分数据集,利用历史数据预测未来数据?
历史数据——>提取特征——>代表一种习惯或者固有惯性——>不易改变
因此可以利用7月份(待预测)的前几个月数据集——>提取固有特征——>基于固有特征进行预测
数据划分:
- 2016.01.01-2016.04.30预测→2016.05.01-2016.05.31(数据集1)
- 2016.02.01-2016.05.31预测→2016.06.01-2016.06.30(数据集2)
- 2016.03.01-2016.06.30预测→2016.07.01-2016.07.31(待预测数据集)
实现过程:
- 对2016.01.01-2016.04.30的数据提取特征
- 将提取的特征应用于2016.05.01-2016.05.31数据中,另外两组同理
- 将处理后的2016.05.01-2016.05.31和2016.06.01-2016.06.30的数据合并为一个数据集
- 将合并后的数据集划分train、test进行模型训练
- 将训练好的模型用于预测2016.07.01-2016.07.31。
四、数据探索与预处理
import pandas as pd
import numpy as np
data_off = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_train.csv")
off_test = pd.read_csv("/项目准备/O2O优惠券使用预测/offline_test.csv")
off_test1 = off_test
off_test.head()
data_off.shape
data_off.info()
data_off.describe()
# 消费日期的最大最小值
# 领券日期的最大最小值
print(data_off['Date'].max(),data_off['Date'].min())
print(data_off['Date_received'].max(),data_off['Date_received'].min())
输出:
20160630.0 20160101.0
20160615.0 20160101.0
# 缺失值
data_off.isnull().sum()
# 没有优惠券时coupon_id,字段discount_rate和date_received也同时没有
nan1 = data_off["Discount_rate"].isnull()
nan2 = data_off['Date_received'].isnull()
nan3 = data_off['Coupon_id'].isnull()
np.all(nan1==nan2),np.all(nan1==nan3)
输出:
(True, True)
# 删除重复值
data_off.drop_duplicates(inplace=True)
data_off.info()
# 将日期float64类型转换为日期类型
data_off['Date'] = pd.to_datetime(data_off['Date'],format='%Y%m%d')
data_off['Date_received'] = pd.to_datetime(data_off['Date_received'],format='%Y%m%d')
off_test['Date_received'] = pd.to_datetime(off_test['Date_received'],format='%Y%m%d')
data_off.info()
五、特征工程(构造特征)
5.1 特征构造-整体数据
5.1.1 时间特征
# 从领券到消费的天数
date_interval = data_off['Date']-data_off['Date_received']
data_off['date_interval'] = [d.days for d in date_interval]
#领券日期是周几
data_off['receive_week']=[d.weekday()+1 for d in data_off['Date_received']]
off_test['receive_week']=[d.weekday()+1 for d in off_test['Date_received']]
#优惠券领取时间是否是周末
data_off['receive_isWeekend']=data_off['receive_week'].apply(lambda x:1 if x>5 else 0)
off_test['receive_isWeekend']=off_test['receive_week'].apply(lambda x:1 if x>5 else 0)
5.1.2 优惠券特征
# 折扣率
def deal_rate(x):
if pd.isna(x):
y =float(x)
elif ":" in x:
a = float(x.split(":")[0])# 分母
b = a-float(x.split(":")[1])# 分子
y = np.round(b/a,2)
else:
y = float(x)
return y
data_off['Discount_rate_%'] = data_off['Discount_rate'].map(deal_rate)
off_test['Discount_rate_%'] = off_test['Discount_rate'].map(deal_rate)
# 门槛
def deal_mk(x):
if pd.isna(x):# nan
y =float(x)
elif ":" in x:# 满减券
y = int(x.split(":")[0])# 分母
else:# 打折券
y = np.nan
return y
data_off['Discount_rate_mk'] = data_off['Discount_rate'].apply(deal_mk,1)
off_test['Discount_rate_mk'] = off_test['Discount_rate'].apply(deal_mk,1)
data_off.head()
5.1.3 预测目标值构造
data_off['Y'] = data_off['date_interval'].apply(lambda x:1 if x<=15 else 0)
data_off.head()
5.2 数据划分-时间滑窗
feature1=data_off[((data_off['Date_received']>='2016-01-01')&(data_off['Date_received']<='2016-04-30')) | ((data_off['Date']>='2016-01-01')&(data_off['Date']<='2016-04-30'))]
feature1.reset_index(drop=True,inplace=True)
database1=data_off[((data_off['Date_received']>='2016-05-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-05-01')&(data_off['Date']<='2016-05-31'))]
database1.reset_index(drop=True,inplace=True)
print(' 1-4月数据总计%i行'%len(feature1))
print(' 5月数据总计%i行'%len(database1))
feature2=data_off[((data_off['Date_received']>='2016-02-01')&(data_off['Date_received']<='2016-05-31')) | ((data_off['Date']>='2016-02-01')&(data_off['Date']<='2016-05-31'))]
feature2.reset_index(drop=True,inplace=True)
database2=data_off[((data_off['Date_received']>='2016-06-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-06-01')&(data_off['Date']<='2016-06-30'))]
database2.reset_index(drop=True,inplace=True)
print(' 2-5月数据总计%i行'%len(feature2))
print(' 6月数据总计%i行'%len(database2))
feature3=data_off[((data_off['Date_received']>='2016-03-01')&(data_off['Date_received']<='2016-06-30')) | ((data_off['Date']>='2016-03-01')&(data_off['Date']<='2016-06-30'))]
feature3.reset_index(drop=True,inplace=True)
database3=off_test
print(' 3-5月数据总计%i行'%len(feature3))
print(' 7月数据总计%i行'%len(database3))
5.3 特征构造-滑窗数据
对每个划分后的数据集分别进行指标提取
5.3.1 用户特征
def user_feature(feature):
all_users = feature['User_id']
users = all_users.drop_duplicates()
# 1.用户消费次数(不对商家去重)
users_goods = feature[pd.notna(feature.Date)][['User_id','Merchant_id']]
users_goods['Merchant_id']=1
users_goods_nums = users_goods.groupby(by = 'User_id').sum('Merchant_id')
users_goods_nums.columns=['buy_num']
users = pd.merge(users,users_goods_nums,on='User_id',how = 'left')
# 2.每个用户的领券次数
Coupon = feature[pd.notna(feature['Coupon_id'])][['User_id','Coupon_id']]
Coupon['Coupon_id'] = 1
Coupon_num = Coupon.groupby(by='User_id').sum('Coupon_id')
Coupon_num.columns = ['Coupon_get_num']
users = pd.merge(users,Coupon_num,on='User_id',how='left')
users['Coupon_get_num']=users['Coupon_get_num'].replace(np.nan,0)
# 3.用户领券消费次数
Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))][['User_id','Coupon_id']]
Used_Coupon['Coupon_id'] = 1
Used_Coupon_num = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
Used_Coupon_num.columns = ['Coupon_use_num']
users = pd.merge(users,Used_Coupon_num,on='User_id',how='left')
users['Coupon_use_num']=users['Coupon_use_num'].replace(np.nan,0)
# 4.用户用券购买概率
users['yqgmgl'] = users['Coupon_use_num']/users['buy_num']
# 5.用户核销率
users['Coupon_use_rate'] = users['Coupon_use_num']/users['Coupon_get_num']
# 6.每个用户15天内核销优惠券的张数
Used_Coupon = feature[(pd.notna(feature['Date']))&(pd.notna(feature['Date_received']))&(feature['date_interval']<=15)][['User_id','Coupon_id']]
Used_Coupon['Coupon_id'] = 1
Used_Coupon_num15 = Used_Coupon.groupby(by='User_id').sum('Coupon_id')
Used_Coupon_num15.columns = ['Coupon_use_num15']
users =