赛题简介
1) 使用工具,python,pandas,sklearn
2)思路:利用前一天的用户商品对的交互行为统计量,预测今天的用户商品对的购买行为。
需注意:特征变量的时间窗口为天,分类数据点是用户商品对即(user_id - item_id),该篇博客,主要介绍参赛的流程,未对具体的特征构造、特征处理,特征选择做详细说明
从该篇博客,你可以获得,从官网下载数据、处理数据、提取每个用户商品对每天的4种交互行为特征量,形成训练测试数据集,到模型的训练,测试,预测等全流程。同时给出,sklearn对分类样本不均衡问题的解决方法。让你真正体验到脚踏实地的做天猫移动推荐赛的感觉。
step1:查看、处理user表格
import pandas as pd
import numpy as np
%time userAll = pd.read_csv('E:/python/gbdt/fresh_comp_offline/tianchi_fresh_comp_train_user.csv',\
usecols = ['user_id','item_id','behavior_type','time'])
Wall time: 14.2 s
userAll.head()
| user_id | item_id | behavior_type | time |
---|
0 | 10001082 | 285259775 | 1 | 2014-12-08 18 |
---|
1 | 10001082 | 4368907 | 1 | 2014-12-12 12 |
---|
2 | 10001082 | 4368907 | 1 | 2014-12-12 12 |
---|
3 | 10001082 | 53616768 | 1 | 2014-12-02 15 |
---|
4 | 10001082 | 151466952 | 1 | 2014-12-12 11 |
---|
userAll.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23291027 entries, 0 to 23291026
Data columns (total 4 columns):
user_id int64
item_id int64
behavior_type int64
time object
dtypes: int64(3), object(1)
memory usage: 710.8+ MB
userAll.duplicated().sum()
11505107
step2:下载、查看、处理item子集表格
%time itemSub = pd.read_csv('tianchi_fresh_comp_train_item.csv',usecols = ['item_id'])
Wall time: 428 ms
itemSub.item_id.is_unique
False
itemSub.item_id.value_counts().head()
25013404 8724
311093202 5999
228198932 5597
238357777 5522
313822206 4517
Name: item_id, dtype: int64
itemSub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620918 entries, 0 to 620917
Data columns (total 1 columns):
item_id 620918 non-null int64
dtypes: int64(1)
memory usage: 4.7 MB
itemSub.duplicated().sum()
198060
itemSet = itemSub[['item_id']].drop_duplicates()
itemSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 422858 entries, 0 to 620917
Data columns (total 1 columns):
item_id 422858 non-null int64
dtypes: int64(1)
memory usage: 6.5 MB
step3:取user与item子集上的交集
由于预测user-item(哪些用户买了哪些商品)是在item子集上进行,因此,可以自考虑user在这些商品子集上的交互行为,来预测user-item。
当然,还可以用全部的user表格,通过分析user在不同种类商品的交互行为,来预测user-item
%time userSub = pd.merge(userAll,itemSet,on = 'item_id',how = 'inner')
Wall time: 4.4 s
userSub.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2084859 entries, 0 to 2084858
Data columns (total 4 columns):
user_id int64
item_id int64
behavior_type int64
time object
dtypes: int64(3), object(1)
memory usage: 79.5+ MB
userSub.head()
| user_id | item_id | behavior_type | time |
---|
0 | 10001082 | 275221686 | 1 | 2014-12-03 01 |
---|
1 | 10001082 | 275221686 | 1 | 2014-12-13 14 |
---|
2 | 10001082 | 275221686 | 1 | 2014-12-08 07 |
---|
3 | 10001082 | 275221686 | 1 | 2014-12-08 07 |
---|
4 | 10001082 | 275221686 | 1 | 2014-12-08 00 |
---|
将该数据集保存到csv文件里
%time userSub.to_csv('userSub.csv')
Wall time: 4.53 s
step4:处理时间数据
读取userSub,(先保存userSub,在读取userSub,是更换index为time的一种间接方法,此外,userSub作为我们作预测的主要数据集,是需要保存的。)
%time userSub = pd.read_csv('userSub.csv',usecols = ['user_id','item_id','behavior_type','time'],parse_dates = True)
Wall time: 14 s
userSub.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2084859 entries, 0 to 2084858
Data columns (total 4 columns):
user_id int64
item_id int64
behavior_type int64
time object
dtypes: int64(3), object(1)
memory usage: 63.6+ MB
userSub.head()
| user_id | item_id | behavior_type | time |
---|
0 | 10001082 | 275221686 | 1 | 2014-12-03 01 |
---|
1 | 10001082 | 275221686 | 1 | 2014-12-13 14 |
---|
2 | 10001082 | 275221686 | 1 | 2014-12-08 07 |
---|
3 | 10001082 | 275221686 | 1 | 2014-12-08 07 |
---|
4 | 10001082 | 275221686 | 1 | 2014-12-08 00 |
---|
%time userSub = userSub.sort_index().copy()
Wall time: 66 ms
userSub.index
DatetimeIndex(['2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
'2014-11-18 00:00:00', '2014-11-18 00:00:00',
...
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00',
'2014-12-18 23:00:00', '2014-12-18 23:00:00'],
dtype='datetime64[ns]', name=u'time', length=2084859, freq=None)
userSub.head()
| user_id | item_id | behavior_type |
---|
time | | | |
---|
2014-11-18 | 129403050 | 52900329 | 1 |
---|
2014-11-18 | 23246977 | 353606633 | 1 |
---|
2014-11-18 | 140763800 | 369393023 | 1 |
---|
2014-11-18 | 140763800 | 187769381 | 1 |
---|
2014-11-18 | 32363170 | 134081514 | 1 |
---|
step5:进行特征处理
特征处理包括两部分:
1)将user-item(用户商品对)的交互行为进行哑变量编码
2)设置时间窗口,提取交互行为的一段时间内的统计量
pd.get_dummies(userSub['behavior_type'],prefix = 'type').head()
| type_1 | type_2 | type_3 | type_4 |
---|
0 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
1 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
2 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
3 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
4 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
typeDummies = pd.get_dummies(userSub['behavior_type'],prefix = 'type')
userSubOneHot = pd.concat([userSub[['user_id','item_id','time']],typeDummies],axis = 1)
usertem = pd.concat([userSub[['user_id','item_id']],typeDummies,userSub[['time']]],axis = 1)
usertem.head()
| user_id | item_id | type_1 | type_2 | type_3 | type_4 | time |
---|
0 | 10001082 | 275221686 | 1.0 | 0.0 | 0.0 | 0.0 | 2014-12-03 01 |
---|
1 | 10001082 | 275221686 | 1.0 | 0.0 | 0.0 | 0.0 | 2014-12-13 14 |
---|
2 | 10001082 | 275221686 | 1.0 | 0.0 | 0.0 | 0.0 | 2014-12-08 07 |
---|
3 | 10001082 | 275221686 | 1.0 | 0.0 | 0.0 | 0.0 | 2014-12-08 07 |
---|
4 | 10001082 | 275221686 | 1.0 | 0.0 | 0.0 | 0.0 | 2014-12-08 00 |
---|
usertem.groupby(['time','user_id','item_id'],as_index = False).sum().head()
| time | user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
0 | 2014-11-18 00 | 1409053 | 58649567 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
1 | 2014-11-18 00 | 1446949 | 2432119 | 3.0 | 0.0 | 0.0 | 0.0 |
---|
2 | 2014-11-18 00 | 1446949 | 206833072 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
3 | 2014-11-18 00 | 1446949 | 347745633 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
4 | 2014-11-18 00 | 2903578 | 395200199 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
userSubOneHot.head()
| user_id | item_id | time | type_1 | type_2 | type_3 | type_4 |
---|
0 | 10001082 | 275221686 | 2014-12-03 01 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
1 | 10001082 | 275221686 | 2014-12-13 14 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
2 | 10001082 | 275221686 | 2014-12-08 07 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
3 | 10001082 | 275221686 | 2014-12-08 07 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
4 | 10001082 | 275221686 | 2014-12-08 00 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
userSubOneHot.info()
userSubOneHotGroup = userSubOneHot.groupby(['time','user_id','item_id'],as_index = False).sum()
userSubOneHotGroup.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 968243 entries, 0 to 968242
Data columns (total 7 columns):
time 968243 non-null object
user_id 968243 non-null int64
item_id 968243 non-null int64
type_1 968243 non-null float64
type_2 968243 non-null float64
type_3 968243 non-null float64
type_4 968243 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 59.1+ MB
userSubOneHotGroup.head()
| time | user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
0 | 2014-11-18 00 | 1409053 | 58649567 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
1 | 2014-11-18 00 | 1446949 | 2432119 | 3.0 | 0.0 | 0.0 | 0.0 |
---|
2 | 2014-11-18 00 | 1446949 | 206833072 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
3 | 2014-11-18 00 | 1446949 | 347745633 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
4 | 2014-11-18 00 | 2903578 | 395200199 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
拆分天和小时
userSubOneHotGroup['time_day'] = pd.to_datetime(userSubOneHotGroup.time.values).date
userSubOneHotGroup['time_hour'] = pd.to_datetime(userSubOneHotGroup.time.values).time
userSubOneHotGroup.head()
| time | user_id | item_id | type_1 | type_2 | type_3 | type_4 | time_day | time_hour |
---|
0 | 2014-11-18 00 | 1409053 | 58649567 | 2.0 | 0.0 | 0.0 | 0.0 | 2014-11-18 | 00:00:00 |
---|
1 | 2014-11-18 00 | 1446949 | 2432119 | 3.0 | 0.0 | 0.0 | 0.0 | 2014-11-18 | 00:00:00 |
---|
2 | 2014-11-18 00 | 1446949 | 206833072 | 2.0 | 0.0 | 0.0 | 0.0 | 2014-11-18 | 00:00:00 |
---|
3 | 2014-11-18 00 | 1446949 | 347745633 | 1.0 | 0.0 | 0.0 | 0.0 | 2014-11-18 | 00:00:00 |
---|
4 | 2014-11-18 00 | 2903578 | 395200199 | 2.0 | 0.0 | 0.0 | 0.0 | 2014-11-18 | 00:00:00 |
---|
dataHour = userSubOneHotGroup.ix[:,0:7]
dataHour.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 968243 entries, 0 to 968242
Data columns (total 7 columns):
time 968243 non-null object
user_id 968243 non-null int64
item_id 968243 non-null int64
type_1 968243 non-null float64
type_2 968243 non-null float64
type_3 968243 non-null float64
type_4 968243 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 59.1+ MB
dataHour.to_csv('dataHour.csv')
dataHour.duplicated().sum()
0
dataDay = userSubOneHotGroup.groupby(['time_day','user_id','item_id'],as_index = False).sum()
dataDay.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 904397 entries, 0 to 904396
Data columns (total 7 columns):
time_day 904397 non-null object
user_id 904397 non-null int64
item_id 904397 non-null int64
type_1 904397 non-null float64
type_2 904397 non-null float64
type_3 904397 non-null float64
type_4 904397 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 55.2+ MB
dataDay.head()
| time_day | user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
0 | 2014-11-18 | 492 | 76093985 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
1 | 2014-11-18 | 492 | 110036513 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
2 | 2014-11-18 | 492 | 176404510 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
3 | 2014-11-18 | 492 | 178412255 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
4 | 2014-11-18 | 492 | 335961429 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
dataDay.to_csv('dataDay.csv')
dataDay.duplicated().sum()
0
dataDay.type_4.max()
20.0
step6:构造训练测试数据集
本篇博客使用的采样频率为天的数据表,对每个用户商品对进行是否发生购买行为进行分类,发生购买行为分类标签为1,反之为0.
dataDay_load = pd.read_csv('dataDay.csv',usecols = ['time_day','user_id','item_id','type_1',\
'type_2','type_3','type_4'], index_col = 'time_day',parse_dates = True)
dataDay_load.head()
| user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
time_day | | | | | | |
---|
2014-11-18 | 492 | 76093985 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
2014-11-18 | 492 | 110036513 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
2014-11-18 | 492 | 176404510 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
2014-11-18 | 492 | 178412255 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
2014-11-18 | 492 | 335961429 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
dataDay_load.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 904397 entries, 2014-11-18 to 2014-12-18
Data columns (total 6 columns):
user_id 904397 non-null int64
item_id 904397 non-null int64
type_1 904397 non-null float64
type_2 904397 non-null float64
type_3 904397 non-null float64
type_4 904397 non-null float64
dtypes: float64(4), int64(2)
memory usage: 48.3 MB
train_x = dataDay_load.ix['2014-12-16',:]
train_x.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 30183 entries, 2014-12-16 to 2014-12-16
Data columns (total 6 columns):
user_id 30183 non-null int64
item_id 30183 non-null int64
type_1 30183 non-null float64
type_2 30183 non-null float64
type_3 30183 non-null float64
type_4 30183 non-null float64
dtypes: float64(4), int64(2)
memory usage: 1.6 MB
train_x.describe()
| user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
count | 3.018300e+04 | 3.018300e+04 | 30183.000000 | 30183.000000 | 30183.000000 | 30183.000000 |
---|
mean | 7.186918e+07 | 2.032869e+08 | 2.181890 | 0.036776 | 0.058278 | 0.026803 |
---|
std | 4.595509e+07 | 1.172341e+08 | 1.352044 | 0.189442 | 0.241651 | 0.174335 |
---|
min | 5.943600e+04 | 1.540200e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 3.000949e+07 | 1.014034e+08 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
50% | 5.858117e+07 | 2.036895e+08 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
75% | 1.178801e+08 | 3.056362e+08 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
max | 1.424396e+08 | 4.045617e+08 | 17.000000 | 2.000000 | 4.000000 | 5.000000 |
---|
train_y = dataDay_load.ix['2014-12-17',['user_id','item_id','type_4']]
train_y.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
Data columns (total 3 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_4 29749 non-null float64
dtypes: float64(1), int64(2)
memory usage: 929.7 KB
train_y.describe()
| user_id | item_id | type_4 |
---|
count | 2.974900e+04 | 2.974900e+04 | 29749.000000 |
---|
mean | 6.997416e+07 | 2.016876e+08 | 0.024202 |
---|
std | 4.685978e+07 | 1.170012e+08 | 0.165070 |
---|
min | 5.943600e+04 | 6.619000e+03 | 0.000000 |
---|
25% | 2.783149e+07 | 9.903570e+07 | 0.000000 |
---|
50% | 5.562218e+07 | 2.005868e+08 | 0.000000 |
---|
75% | 1.176616e+08 | 3.039699e+08 | 0.000000 |
---|
max | 1.424157e+08 | 4.045616e+08 | 4.000000 |
---|
dataSet = pd.merge(train_x,train_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)
dataSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30183 entries, 0 to 30182
Data columns (total 7 columns):
user_id 30183 non-null int64
item_id 30183 non-null int64
type_1 30183 non-null float64
type_2 30183 non-null float64
type_3 30183 non-null float64
type_4_x 30183 non-null float64
type_4_y 30183 non-null float64
dtypes: float64(5), int64(2)
memory usage: 1.8 MB
dataSet.describe()
| user_id | item_id | type_1 | type_2 | type_3 | type_4_x | type_4_y |
---|
count | 3.018300e+04 | 3.018300e+04 | 30183.000000 | 30183.000000 | 30183.000000 | 30183.000000 | 30183.000000 |
---|
mean | 7.186918e+07 | 2.032869e+08 | 2.181890 | 0.036776 | 0.058278 | 0.026803 | 0.004705 |
---|
std | 4.595509e+07 | 1.172341e+08 | 1.352044 | 0.189442 | 0.241651 | 0.174335 | 0.075343 |
---|
min | 5.943600e+04 | 1.540200e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 3.000949e+07 | 1.014034e+08 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
50% | 5.858117e+07 | 2.036895e+08 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
75% | 1.178801e+08 | 3.056362e+08 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
max | 1.424396e+08 | 4.045617e+08 | 17.000000 | 2.000000 | 4.000000 | 5.000000 | 3.000000 |
---|
np.sign(dataSet.type_4_y.values).sum()
129.0
np.sign(0.0)
0.0
dataSet['labels'] = dataSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
dataSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30183 entries, 0 to 30182
Data columns (total 8 columns):
user_id 30183 non-null int64
item_id 30183 non-null int64
type_1 30183 non-null float64
type_2 30183 non-null float64
type_3 30183 non-null float64
type_4_x 30183 non-null float64
type_4_y 30183 non-null float64
labels 30183 non-null float64
dtypes: float64(6), int64(2)
memory usage: 2.1 MB
dataSet.head()
| user_id | item_id | type_1 | type_2 | type_3 | type_4_x | type_4_y | labels |
---|
0 | 59436 | 184081436 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
1 | 61797 | 83261906 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
2 | 134211 | 6491625 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
3 | 134211 | 79679783 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
4 | 134211 | 96616269 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
---|
np.sign(dataSet.type_3.values).sum()
1713.0
trainSet = dataSet.copy()
trainSet.to_csv('trainSet.csv')
test_x = dataDay_load.ix['2014-12-17',:]
test_x.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
Data columns (total 6 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_1 29749 non-null float64
type_2 29749 non-null float64
type_3 29749 non-null float64
type_4 29749 non-null float64
dtypes: float64(4), int64(2)
memory usage: 1.6 MB
test_x.head()
| user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
time_day | | | | | | |
---|
2014-12-17 | 59436 | 238861461 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
2014-12-17 | 60723 | 202829025 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
2014-12-17 | 60723 | 371933634 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
2014-12-17 | 106362 | 38830684 | 1.0 | 0.0 | 0.0 | 0.0 |
---|
2014-12-17 | 106362 | 149517272 | 2.0 | 0.0 | 0.0 | 0.0 |
---|
test_y = dataDay_load.ix['2014-12-18',['user_id','item_id','type_4']]
testSet = pd.merge(test_x,test_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)
testSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29749 entries, 0 to 29748
Data columns (total 7 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_1 29749 non-null float64
type_2 29749 non-null float64
type_3 29749 non-null float64
type_4_x 29749 non-null float64
type_4_y 29749 non-null float64
dtypes: float64(5), int64(2)
memory usage: 1.8 MB
testSet.describe()
| user_id | item_id | type_1 | type_2 | type_3 | type_4_x | type_4_y |
---|
count | 2.974900e+04 | 2.974900e+04 | 29749.000000 | 29749.000000 | 29749.000000 | 29749.000000 | 29749.000000 |
---|
mean | 6.997416e+07 | 2.016876e+08 | 2.168241 | 0.038153 | 0.059296 | 0.024202 | 0.004336 |
---|
std | 4.685978e+07 | 1.170012e+08 | 1.334966 | 0.192093 | 0.242364 | 0.165070 | 0.069681 |
---|
min | 5.943600e+04 | 6.619000e+03 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 2.783149e+07 | 9.903570e+07 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
50% | 5.562218e+07 | 2.005868e+08 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
75% | 1.176616e+08 | 3.039699e+08 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
max | 1.424157e+08 | 4.045616e+08 | 17.000000 | 2.000000 | 3.000000 | 4.000000 | 3.000000 |
---|
testSet['labels'] = testSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
testSet.describe()
| user_id | item_id | type_1 | type_2 | type_3 | type_4_x | type_4_y | labels |
---|
count | 2.974900e+04 | 2.974900e+04 | 29749.000000 | 29749.000000 | 29749.000000 | 29749.000000 | 29749.000000 | 29749.000000 |
---|
mean | 6.997416e+07 | 2.016876e+08 | 2.168241 | 0.038153 | 0.059296 | 0.024202 | 0.004336 | 0.004101 |
---|
std | 4.685978e+07 | 1.170012e+08 | 1.334966 | 0.192093 | 0.242364 | 0.165070 | 0.069681 | 0.063909 |
---|
min | 5.943600e+04 | 6.619000e+03 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 2.783149e+07 | 9.903570e+07 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
50% | 5.562218e+07 | 2.005868e+08 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
75% | 1.176616e+08 | 3.039699e+08 | 3.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
max | 1.424157e+08 | 4.045616e+08 | 17.000000 | 2.000000 | 3.000000 | 4.000000 | 3.000000 | 1.000000 |
---|
testSet.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29749 entries, 0 to 29748
Data columns (total 8 columns):
user_id 29749 non-null int64
item_id 29749 non-null int64
type_1 29749 non-null float64
type_2 29749 non-null float64
type_3 29749 non-null float64
type_4_x 29749 non-null float64
type_4_y 29749 non-null float64
labels 29749 non-null float64
dtypes: float64(6), int64(2)
memory usage: 2.0 MB
testSet['labels'].values.sum()
122.0
testSet.to_csv('testSet.csv')
step7: 训练模型
逻辑回归模型
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
model.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99565980850147429
train_y_est =model.predict(trainSet.ix[:,2:6])
train_y_est.sum()
2.0
加权逻辑回归(针对类别不平衡,基于代价敏感函数)
lrW = LogisticRegression(class_weight ='auto')
lrW.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainLRW_y = lrW.predict(trainSet.ix[:,2:6])
trainLRW_y.sum()
4792.0
lrW.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.84292482523274692
计算训练精确率和召回率
from sklearn.cross_validation import train_test_split,cross_val_score
precisions = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
cv = 5,scoring = 'precision')
print "精确度:\n",np.mean(precisions)
精确度: 0.0217883289288
recalls = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
cv = 5,scoring = 'recall')
print "召回率:\n",np.mean(recalls)
召回率: 0.651692307692
f1 = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
cv = 5,scoring = 'f1')
print 'f1得分:\n',np.mean(f1)
f1得分:
0.0421179159024
计算测试f1得分
testLRW_y = lrW.predict(test_x.ix[:,2:6])
precision_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'precision')
recall_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'recall')
f1_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'f1')
print 'f1得分:\n',np.mean(f1_test)
f1得分:
0.0447302553442
step8:预测19号用户商品对
predict_x = dataDay_load.ix['2014-12-18',:]
predict_x.to_csv('predict_x.csv')
predict_x.info()
predict_x.describe()
| user_id | item_id | type_1 | type_2 | type_3 | type_4 |
---|
count | 2.894900e+04 | 2.894900e+04 | 28949.000000 | 28949.000000 | 28949.000000 | 28949.000000 |
---|
mean | 7.057660e+07 | 2.041997e+08 | 2.172476 | 0.033784 | 0.065080 | 0.026978 |
---|
std | 4.636984e+07 | 1.167057e+08 | 1.317706 | 0.181628 | 0.254529 | 0.174941 |
---|
min | 1.342110e+05 | 2.934200e+04 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
25% | 2.902911e+07 | 1.037920e+08 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
50% | 5.540191e+07 | 2.049606e+08 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
75% | 1.178487e+08 | 3.065798e+08 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
---|
max | 1.424116e+08 | 4.045373e+08 | 16.000000 | 2.000000 | 3.000000 | 4.000000 |
---|
predict_y = lrW.predict(predict_x.ix[:,2:])
predict_y.sum()
4636.0
user_item_19 = predict_x.ix[predict_y > 0.0,['user_id','item_id']]
user_item_19.all()
user_id True
item_id True
dtype: bool
user_item_19.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4636 entries, 2014-12-18 to 2014-12-18
Data columns (total 2 columns):
user_id 4636 non-null int64
item_id 4636 non-null int64
dtypes: int64(2)
memory usage: 108.7 KB
user_item_19.duplicated().sum()
0
user_item_19.to_csv('E:/python/gbdt/predict/tianchi_mobile_recommendation_predict.csv',index = False,encoding = 'utf-8')
其他sklearn模型的应用
GBDT模型
from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier(random_state = 10)
gbdt.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainGBDT_y = gbdt.predict(trainSet.ix[:,2:6])
trainGBDT_y.sum()
0.0
gbdt.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064
随机森林
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainRF_y = rf.predict(trainSet.ix[:,2:6])
rf.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99575920219991387
trainRF_y.sum()
1.0
trainRF_y
array([ 0., 0., 0., ..., 0., 0., 0.])
svm
from sklearn import svm
svc = svm.SVC()
svc.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainSVC_y = svc.predict(trainSet.ix[:,2:6])
svc.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064
trainSVC_y.sum()
0.0
很显然样本不均衡问题对模型的影响很大
关于分类样本不均衡问题的介绍及解决办法
不平衡数据分类算法介绍与比较
集成学习以及分类样本不均衡问题
关于超参数调节
scikit-learn 梯度提升树(GBDT)调参小结