赛题回顾
赛题描述
本次大赛基于脱敏和采样后的约 40,000 条用户消费行为数据,预测在未来的一段时间内,用户对于某产品是否会有购买和收藏的行为。
数据背景
本数据集为经过脱敏和采样后用户在某网站的消费行为数据,其中大致包含了三大类信息,即用户的基础信息,商品的基础信息,用户的行为信息,其中用户的基础信息除了用户的id之外皆为脱敏后的具体行为信息,商品信息除卖家id、商品id 之外皆为脱敏后的具体商品信息, 用户的行为信息包含了两个用户的具体行为(收藏和购买),除此之外都是脱敏后的具体行为信息。
字段列表
日期信息进行统一编号,第一天编号为 01, 第二天为 02,以此类推。所有的特征(除上文中已经说明的之外)无具体指代。
列名 | 类型 | 说明 | 示例 |
user_id | String | 用户唯一标识(脱敏后) | abc123 |
Product_id | String | 产品id(脱敏后) | abc123 |
seller | String | 卖家id(脱敏后) | abc123 |
day | String | 日期 | 1,2,..30 |
action_type | Int | 行为相关变量(脱敏后) | 1 |
ProductInfo_X | Int | 产品相关变量 | 1,50,100 |
Webinfo_X | Int | 网络行为相关变量 | 1,50,100 |
UserInfo_X | Int | 用户相关变量 | 1,50,100 |
purchase | Int | 用户是否购买 | 0,1 |
favorite | Int | 用户是否收藏 | 0,1 |
评估标准
1.本次测评算法为: AUC(Area Under the Curve) ,我们将分别针对购买和收藏的预测值求出一个AUC。
最终的测评结果为 AUC* = (AUC(购买) + AUC(收藏)) / 2。
2.最终使用 AUC* 作为参赛选手得分。AUC* 越大,代表结果越优,该团队的排名越靠前。
数据处理及探索性分析
使用pandas-profiling对训练集及测试集进行分析
- 缺失值
所有属性中,这两列缺失值较多,用列均值填补,其中个别负数视为异常值将其取绝对值
train_test['UserInfo_92']=train_test['UserInfo_92'].fillna(train_test['UserInfo_92'].mean()) train_test['UserInfo_92']=train_test['UserInfo_92'].abs()
- string类型转换,将user_id,Product_id,seller,day转化为数值类型
# 字符特征转化,factorize返回一元组,转为无符号整数 train_test["user_id"] = pd.factorize(train_test["user_id"])[0].astype(np.uint16) train_test["Product_id"] = pd.factorize(train_test["Product_id"])[0].astype(np.uint16) train_test["seller"] = pd.factorize(train_test["seller"])[0].astype(np.uint16) train_test["day"] = pd.to_numeric(train_test["day"])
- 测试集训练集对应列合成一个train_test的dataframe,和两个lable
# lable标签 lable_purchase = train_data['purchase'] lable_favorite = train_data['favorite'] # 合并train和test,对应列名 col_name=train_data.columns features=test_data.columns.tolist() train_data=train_data[features] train_test=pd.concat([train_data,test_data],sort=False)
购买部分
初赛
- 数据预处理
- 删去字符类型特征
train_test= train_test.drop(['user_id','Product_id','seller'], axis=1)
- 特征选取及交叉
-
对其他特征做统计特征,取列中值与本列标准差为一新特征A,将新特征A取绝对值为新特征B
col_name=data_test.columns.tolist() for i in col_name: data_test[i+'std']=data_test[i]-data_test[i].std()# 与标准差作差 data_test[i+'std_abs']=data_test[i+'std'].abs()# 去差的绝对值
- 模型选择
- 模型选取为xgboost
import xgboost as xgb def xgb_model(X_t, X_v, y_t, y_v, test): print("XGB model start") xgb_val = xgb.DMatrix(X_v, label=y_v) xgb_train = xgb.DMatrix(X_t, label=y_t) xgb_test = xgb.DMatrix(test) params = { 'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'lambda': 10, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。 'subsample': 0.7, # 随机采样训练样本 'silent': True, # 设置成1则没有运行信息输出,最好是设置为0. 'eta': 0.01, # 如同学习率 'n_jobs': -1, } plst = list(params.items()) num_rounds = 5000 # 迭代次数 watchlist = [(xgb_train, 'train'), (xgb_val, 'val')] model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=200)# 训练模型 res = model.predict(xgb_test)# 预测测试集 print('test prediction finished') return res # 返回测试集预测结果
- 结果融合
- 无
复赛
- 数据预处理
- 删除单一值占比极高的几列:'ProductInfo_84':83.1%,'action_type':96.1%,'ProductInfo_107':99.3%和'UserInfo_112','UserInfo_148'
- 特征选取
- 交叉特征:将'user_id','Product_id','seller','day'四个属性做交叉特征(三类交叉,两类交叉),然后筛选对模型有提高效果的特征保留
- 特征按购买率转化: 对所有int类型的特征求每个特征每个类别的购买率(trans_list),选取购买率为占比稍大些的特征进行购买率转化(在trans_list中删掉购买率为0的类别占比为0.0%和0.1%的特征,即dlist),若类别购买率为0则为0购买,不为0则为1,即将每列特征转化为01的二值特征,表示购买或不购买,增加模型鲁棒性(不保留原始特征)
trans_list=['UserInfo_70','UserInfo_266','UserInfo_244','UserInfo_246','ProductInfo_46','ProductInfo_53', 'UserInfo_62','UserInfo_119','UserInfo_216','UserInfo_151','ProductInfo_102','UserInfo_167', 'UserInfo_6','UserInfo_223','UserInfo_45','UserInfo_226','UserInfo_180','UserInfo_254', 'UserInfo_166','UserInfo_168','ProductInfo_81','UserInfo_18','ProductInfo_19','ProductInfo_7', 'ProductInfo_191','ProductInfo_73','WebInfo_2','ProductInfo_6','ProductInfo_37','UserInfo_22', 'UserInfo_198','UserInfo_230','UserInfo_174','UserInfo_51','UserInfo_38','UserInfo_171', 'UserInfo_116','UserInfo_145','ProductInfo_45','UserInfo_154','UserInfo_78','UserInfo_10', 'UserInfo_123','UserInfo_107','UserInfo_50','ProductInfo_62','UserInfo_41','ProductInfo_125', 'UserInfo_68','UserInfo_124','UserInfo_87','ProductInfo_94','ProductInfo_104','UserInfo_35', 'UserInfo_239','UserInfo_89','ProductInfo_18','UserInfo_252','UserInfo_177','UserInfo_150', 'UserInfo_212','UserInfo_236','UserInfo_257','UserInfo_55','UserInfo_164','UserInfo_40','ProductInfo_215', 'ProductInfo_63','ProductInfo_90','UserInfo_26','UserInfo_120','UserInfo_221','UserInfo_247', 'UserInfo_39','UserInfo_110','UserInfo_214','UserInfo_241','ProductInfo_10','ProductInfo_44', 'UserInfo_32','UserInfo_111','UserInfo_170','ProductInfo_179','UserInfo_163','UserInfo_175', 'UserInfo_182']# 要转化的,取int类型中购买率为0的多的 dlist=['UserInfo_89','UserInfo_41','UserInfo_266','UserInfo_26','UserInfo_236', 'UserInfo_230','UserInfo_221','UserInfo_198','UserInfo_174','UserInfo_170', 'UserInfo_168','UserInfo_163','UserInfo_10','ProductInfo_6','ProductInfo_46','UserInfo_107', 'UserInfo_110','UserInfo_116','UserInfo_119','UserInfo_120','UserInfo_123','UserInfo_124', 'UserInfo_154','UserInfo_164','UserInfo_166','UserInfo_171','UserInfo_175','UserInfo_180', 'UserInfo_182','UserInfo_214','UserInfo_223','UserInfo_241','UserInfo_252','UserInfo_254', 'UserInfo_257','UserInfo_35','UserInfo_45','UserInfo_50','UserInfo_62','UserInfo_68','UserInfo_70', 'UserInfo_78','UserInfo_87']# drop掉0.0%,0.1%的
train_data['purchase']=lable_purchase # 此时train_test中没有purchase_label列,加上,用完再删 for k in trans_list: # k是第几列,j是第几个属性值 feature=train_data[k].unique() for j in feature: # tmp是属性的购买率 tmp=train_data[train_data[k]==j]['purchase'].sum()/train_data[train_data[k]==j].shape[0] if tmp==0:# 如果购买率为0,赋值为0 train_test.ix[(train_test[k]==j),k]=-9999 # 消除特征本身有0和1的影响 else: train_test.ix[(train_test[k]==j),k]=9999 train_test.ix[(train_test[k]==9999),k]=1 train_test.ix[(train_test[k]==-9999),k]=0 train_data = train_data.drop('purchase', axis=1)
- 统计特征:对所有列做1.这列值与这列标准差的差2.1中结果的绝对值(与初赛相同操作)
- 模型选择
- 与初赛相同为xgboost,参数基本一致
- 结果融合
- 与UserInfo_82融合(1:9),UserInfo_82列取复制然后归一化后的结果直接提交auc为0.556,效果与不做购买率直接用xgb训练的结果相似,说明这列取负归一化后可以大致说明购买率的倾向。做购买力率转换后xgb得分为0.586,将这个结果与UserInfo_82融合,线上结果可以达到0.594。
purchase_result = xgb_model(X_train_p, X_valid_p, y_train_p, y_valid_p, test)#做购买率变化之后xgb结果 # 对UserInfo_82取负值然后minmax归一化 test_data['UserInfo_82']=-test_data['UserInfo_82'] test_data['UserInfo_82']=(test_data['UserInfo_82']-test_data['UserInfo_82'].min())/(test_data['UserInfo_82'].max()-test_data['UserInfo_82'].min()) u82=test_data['UserInfo_82'] # 线上556 res=purchase_result*0.9+u82*0.1# 购买率转化后结果与u82融合
那么问题就来了,为什么要对UserInfo_82这列取负值再做归一化呢?
收藏部分
初赛
- 数据预处理
- 与初赛购买相同
- 特征选取及交叉
- 选取交叉特征(全都为有意义的交叉特征,复赛排列组合了各特征,出现了一些无意义但是效果好的交叉特征)
# 手动选取一些有意义的特征
train_test['browse_Product_un']=train_test.groupby("user_id").Product_id.transform('count')
train_test['browse_seller']=train_test.groupby("user_id").seller.transform('nunique')
train_test['browse_days']=train_test.groupby("user_id").day.transform('nunique')
train_test['last_days'] = train_test.groupby("user_id").day.transform('max')
train_test['later_days'] = train_test.groupby("user_id").day.transform('min')
train_test['Product_count']=train_test.groupby('Product_id')['Product_id'].transform('count')
train_test['user_action']=train_test.groupby('user_id')['action_type'].transform('nunique')
train_test['seller_seller']=train_test.groupby('seller')['seller'].transform('count')
train_test['day_seller_count']=train_test.groupby('day')['seller'].transform('count')
train_test['user_seller_user']=train_test.groupby(['user_id','seller'])['user_id'].transform('count')
train_test['Product_day_user']=train_test.groupby(['Product_id','day'])['user_id'].transform('count')
train_test['seller_day_user']=train_test.groupby(['seller','day'])['user_id'].transform('count')
train_test['day_user_count']=train_test.groupby('day')['day'].transform('count')
train_test['seller_user']=train_test.groupby(['seller'])['seller'].transform('count')
train_test['u123_']=train_test.groupby(['user_id'])['UserInfo_123'].transform('count')
# ---网络行为数量
train_test['webaction1']=train_test['WebInfo_1']+train_test['WebInfo_2']+train_test['WebInfo_3']
# ---product每天收藏最多的和最少的
train_test['product_day_max']=train_test.groupby(['day'])['Product_id'].transform('max')
train_test['product_day_min']=train_test.groupby(['day'])['Product_id'].transform('min')
- 统计类型特征
# ~~~~~~~~~~~~~~~~~~~~floatl64类型处理~~~~~~~~~~~~~~~~~~~~~~~~ col_name=train_test.columns.tolist() product_float = [] user_float = [] web_float = [] pu_col = [] product_int = [] user_int = [] web_int = [] for i in range(0, len(col_name)): if train_test[col_name[i]].dtype == 'float64': pu_col.append(col_name[i]) if col_name[i][0] =='P': product_float.append(col_name[i]) elif col_name[i][0] =='U': user_float.append(col_name[i]) elif col_name[i][0] =='W':# web info 没有float64类型的 web_float.append(col_name[i]) if train_test[col_name[i]].dtype == 'int64': if col_name[i][0] =='P': product_int.append(col_name[i]) elif col_name[i][0] =='U': user_int.append(col_name[i]) elif col_name[i][0] =='W': web_int.append(col_name[i]) for i in product_float: train_test[i] = train_test[i].apply(lambda x:np.exp(x)) train_test[i+'_mean']=train_test.groupby(['Product_id'])[i].transform('mean') train_test[i+"_min"]=train_test.groupby('Product_id')[i].transform("min") train_test[i+"_max"]=train_test.groupby('Product_id')[i].transform("max") train_test[i+"_median"]=train_test.groupby('Product_id')[i].transform("median") for i in user_float: train_test[i] = train_test[i].apply(lambda x:np.exp(x)) train_test[i+'_mean']=train_test.groupby(['user_id'])[i].transform('mean') train_test[i+"_min"]=train_test.groupby('user_id')[i].transform("min") train_test[i+"_max"]=train_test.groupby('user_id')[i].transform("max") train_test[i+"_median"]=train_test.groupby('user_id')[i].transform("median") train_test['seller_product_score_2'] = train_test.groupby(['seller','Product_id'])[i].transform('mean') for i in user_int:#0.633645 线上0.60074 train_test[i+"u_cont"]=train_test.groupby(['user_id'])[i].transform('count') # 0.628459 train_test[i+"u_contm"]=train_test.groupby(['user_id'])[i].transform('mean') # 0.632236 for i in product_int: train_test[i+"p_cont"]=train_test.groupby(['Product_id'])[i].transform('count') train_test[i+"p_contm"]=train_test.groupby(['Product_id'])[i].transform('mean') # ---将所有ProductInfo相加 0.634057 train_test['psum'] = np.zeros(train_test.shape[0]) for i in range(1, len(train_test.columns)): if train_test.columns[i][0]=='P': train_test['psum'] = train_test['psum']+train_test[train_test.columns[i]] train_test['psum'] = train_test['psum'].apply(lambda x:np.log(x+1)) # 从图像上看训练集测试集分布不一致的int64类型列 d_col=['UserInfo_150','UserInfo_14','UserInfo_216','UserInfo_239','UserInfo_22','UserInfo_6','UserInfo_246','WebInfo_2']
- 模型选择
- 与初赛购买模型相同-Xgboost
复赛
- 数据预处理
- 与复赛购买相同
- 特征选取及交叉
- 收藏值使用了部分初赛已知含义的交叉特征
- 匿名特征不做处理对模型起负相关作用,所以删掉匿名特征,只保留'user_id','seller','Product_id','day','action_type','WebInfo_1','WebInfo_3','UserInfo_123' 对保留的特征做交叉特征(两类交叉,三类交叉)
- 不同于购买,收藏的交叉特征为排列组合后递归筛选对模型正相关的特征,所以一些交叉特征无实际含义。
features=['user_id','seller','Product_id','day','action_type','WebInfo_1','WebInfo_3','UserInfo_123'] train_data_f=train_data[features] test_data_f=test_data[features] train_test_f=pd.concat([train_data_f,test_data_f],sort=False) train_test_f=train_test_f.fillna(train_test_f.mean()) # 用户角度出发 train_test_f['u_p_count']=train_test_f.groupby("user_id").Product_id.transform('count')# 0.6900794一个用户浏览商品的数量 train_test_f['u_p_nunique']=train_test_f.groupby("user_id").Product_id.transform('nunique')# 0.735705288用户浏览不同product的数量 train_test_f['u_s_nunique']=train_test_f.groupby("user_id").seller.transform('nunique')## 用户浏览不同的商家数量 train_test_f['u_d_nunique']=train_test_f.groupby("user_id").day.transform('nunique')# 用户分别在哪几天浏览过 train_test_f['last_day'] = train_test_f.groupby("user_id").day.transform('max')# 0.691511用户最晚是哪天浏览 train_test_f['first_day'] = train_test_f.groupby("user_id").day.transform('min')# 用户最早是哪天浏览 train_test_f['during_days'] = train_test_f['last_day'] - train_test_f['first_day']# 用户持续浏览几天 train_test_f['u_s_u_count']=train_test_f.groupby(['user_id','seller'])['user_id'].transform('count')# 0.734514318强特 # 从商品角度出发 train_test_f['p_p_count']=train_test_f.groupby('Product_id')['Product_id'].transform('count')# 0.709898商品出现次数,强特 train_test_f['p_u_nunique']=train_test_f.groupby('Product_id')['user_id'].transform('nunique')#一个商品不同用户数量 train_test_f['p_d_u_count']=train_test_f.groupby(['Product_id','day'])['user_id'].transform('count')#0.712725 一个商品每天的用户浏览量 train_test_f['p_u_count']=train_test_f.groupby(['Product_id'])['user_id'].transform('count')# 一个商品的用户总数量 train_test_f['p_d_unique']=train_test_f.groupby(['Product_id']).day.transform('nunique')# 0.735715770 train_test_f['p_u_nunique']=train_test_f.groupby(['Product_id'])['user_id'].transform('nunique') #0.73514088 # 从卖家角度出发 train_test_f['s_s_count']=train_test_f.groupby('seller')['seller'].transform('count')# 0.7134725卖家出现次数 train_test_f['s_u_count']=train_test_f.groupby('seller')['user_id'].transform('count')# 一个卖家用户浏览量 train_test_f['s_u_nunique']=train_test_f.groupby('seller')['user_id'].transform('nunique')# 一个卖家不同用户浏览量 train_test_f['s_d_u_count']=train_test_f.groupby(['seller','day'])['user_id'].transform('count')# 0.7252421一个卖家每天的用户浏览量 train_test_f['s_p_count']=train_test_f.groupby(['seller'])['Product_id'].transform('count')# 一个卖家的商品数量 #从天数角度出发 train_test_f['d_u_count']=train_test_f.groupby('day')['day'].transform('count')# 0.7256288购买日期出现的次数 train_test_f['d_s_count']=train_test_f.groupby('day')['seller'].transform('count')# 0.7257619一天中一个卖家出现的次数 train_test_f['d_u_count']=train_test_f.groupby('day')['user_id'].transform('count')# 0.7257619一天中出现的用户数 train_test_f['d_p_count']=train_test_f.groupby('day')['Product_id'].transform('count')# 一天中一个商品出现的次数 # u123 train_test_f['u_u123_count']=train_test_f.groupby(['user_id'])['UserInfo_123'].transform('count')# 0.735175687 #=======三特征交叉所有都加上后筛选结果======= train_test_f['u_d_s_count']=train_test_f.groupby(['user_id','day'])['seller'].transform('count')# 0.741555 train_test_f['u_s_p_unique']=train_test_f.groupby(['user_id','seller'])['Product_id'].transform('nunique')#0.7417228218 train_test_f['u_s_d_mean']=train_test_f.groupby(['user_id','seller'])['day'].transform('mean')#0.7417228218 train_test_f['u_p_d_mean']=train_test_f.groupby(['user_id','Product_id'])['day'].transform('mean')#0.7417228218 train_test_f['s_u_d_mean']=train_test_f.groupby(['seller','user_id'])['day'].transform('mean')#0.7417228218 train_test_f['p_s_d_mean']=train_test_f.groupby(['Product_id','seller'])['day'].transform('mean')#0.7417228218 train_test_f['u_s_u_count']=train_test_f.groupby(['user_id','seller'])['user_id'].transform('count') train_test_f['u_p_u_count']=train_test_f.groupby(['user_id','Product_id'])['user_id'].transform('count') train_test_f['u_d_p_count']=train_test_f.groupby(['user_id','day'])['Product_id'].transform('count')#0.7464669548 train_test_f['s_p_d_max']=train_test_f.groupby(['seller','Product_id'])['day'].transform('max')#0.7469330771 train_test_f['u_s_u_count'] = train_test_f.groupby(['user_id','seller'])['user_id'].transform('count')#0.7469330771 train_test_f['u_p_u_count'] = train_test_f.groupby(['user_id','Product_id'])['user_id'].transform('count')#0.7469330771 train_test_f['u_d_u_count'] = train_test_f.groupby(['user_id','day'])['user_id'].transform('count')#0.7469734682 train_test_f['u_s_a_nunique'] = train_test_f.groupby(['user_id', 'seller'])['action_type'].transform('nunique')#0.7504861997 #一下0.75052 train_test_f['u_s_d_max'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('max') train_test_f['u_s_d_min'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('min') train_test_f['u_s_d_mean'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('mean') train_test_f['u_s_d_std'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('std') train_test_f['u_s_d_median'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('median') train_test_f['u_s_d_nunique'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('nunique') train_test_f['u_s_d_count'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('count') train_test_f['p_s_u_nunique'] = train_test_f.groupby(['Product_id','seller'])['user_id'].transform('nunique')#0.7506777329 # 0.7504883551 train_test_f['p_s_d_max'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('max') train_test_f['p_s_d_min'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('min') train_test_f['p_s_d_mean'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('mean') train_test_f['p_s_d_std'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('std') train_test_f['p_s_d_median'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('median') train_test_f['p_s_d_nunique'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('nunique') train_test_f['p_s_d_count'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('count') train_test_f['p_s_a_nunique'] = train_test_f.groupby(['Product_id','seller'])['action_type'].transform('nunique') train_test_f['u_p_s_nunique'] = train_test_f.groupby(['user_id', 'Product_id'])['seller'].transform('nunique')
- 模型选择
- 收藏模型为CatBoost,将'user_id','Product_id','seller'三列作为类别变量输入模型,学习率0.04,训练1000轮,100轮早停
from catboost import CatBoostClassifier def cb_model(X_t, X_v, y_t, y_v, test): print('cb_model start') category=[] # c_list = ['user_id','Product_id','seller']# ,'day' for index,value in enumerate(X_t.columns): if(value in c_list): category.append(index) continue model= CatBoostClassifier(iterations=1000,learning_rate=0.04,cat_features=category, loss_function='Logloss', logging_level='Verbose',eval_metric='AUC') model.fit(X_t,y_t,eval_set=(X_v,y_v),early_stopping_rounds=100) res=model.predict_proba(test)[:,1] # importance=model.get_feature_importance(prettified=True)# 显示特征重要程度 # print(importance) return res
- 进行5折交叉验证
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2019)# 15折交叉 for train_index, test_index in kf.split(data_f,lable_favorite): #data_f为收藏训练集,lable_purchase是lable X_train_p, y_train_p = data_f.iloc[train_index], lable_favorite.iloc[train_index] X_valid_p, y_valid_p = data_f.iloc[test_index], lable_favorite.iloc[test_index] favorite_result +=cb_model(X_train_p, X_valid_p, y_train_p, y_valid_p, test_f)/5.0