利用python-pandas和sklearn进行天池移动推荐离线赛的全过程

赛题简介

1) 使用工具,python,pandas,sklearn

2)思路:利用前一天的用户商品对的交互行为统计量,预测今天的用户商品对的购买行为。

需注意:特征变量的时间窗口为天,分类数据点是用户商品对即(user_id - item_id),该篇博客,主要介绍参赛的流程,未对具体的特征构造、特征处理,特征选择做详细说明
从该篇博客,你可以获得,从官网下载数据、处理数据、提取每个用户商品对每天的4种交互行为特征量,形成训练测试数据集,到模型的训练,测试,预测等全流程。同时给出,sklearn对分类样本不均衡问题的解决方法。让你真正体验到脚踏实地的做天猫移动推荐赛的感觉。

step1:查看、处理user表格

import pandas as pd
import numpy as np

%time userAll = pd.read_csv('E:/python/gbdt/fresh_comp_offline/tianchi_fresh_comp_train_user.csv',\
                      usecols = ['user_id','item_id','behavior_type','time'])
    Wall time: 14.2 s
userAll.head()#查看前五行原始数据
user_iditem_idbehavior_typetime
01000108228525977512014-12-08 18
110001082436890712014-12-12 12
210001082436890712014-12-12 12
3100010825361676812014-12-02 15
41000108215146695212014-12-12 11
userAll.info()#查看数据表相关信息
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 23291027 entries, 0 to 23291026
    Data columns (total 4 columns):
    user_id          int64
    item_id          int64
    behavior_type    int64
    time             object
    dtypes: int64(3), object(1)
    memory usage: 710.8+ MB
userAll.duplicated().sum()#检查有无重复行
    11505107

step2:下载、查看、处理item子集表格

%time itemSub = pd.read_csv('tianchi_fresh_comp_train_item.csv',usecols = ['item_id'])
    Wall time: 428 ms
itemSub.item_id.is_unique#查看子集中商品item编号是否有重复
    False
itemSub.item_id.value_counts().head()#查看每个item_id有多少重复
    25013404     8724
    311093202    5999
    228198932    5597
    238357777    5522
    313822206    4517
    Name: item_id, dtype: int64
itemSub.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 620918 entries, 0 to 620917
    Data columns (total 1 columns):
    item_id    620918 non-null int64
    dtypes: int64(1)
    memory usage: 4.7 MB
itemSub.duplicated().sum()#查看重复的行数
    198060
itemSet = itemSub[['item_id']].drop_duplicates()#去除重复的行
itemSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 422858 entries, 0 to 620917
    Data columns (total 1 columns):
    item_id    422858 non-null int64
    dtypes: int64(1)
    memory usage: 6.5 MB

step3:取user与item子集上的交集

由于预测user-item(哪些用户买了哪些商品)是在item子集上进行,因此,可以自考虑user在这些商品子集上的交互行为,来预测user-item。

当然,还可以用全部的user表格,通过分析user在不同种类商品的交互行为,来预测user-item

%time userSub = pd.merge(userAll,itemSet,on = 'item_id',how = 'inner')
    Wall time: 4.4 s
userSub.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 2084859 entries, 0 to 2084858
    Data columns (total 4 columns):
    user_id          int64
    item_id          int64
    behavior_type    int64
    time             object
    dtypes: int64(3), object(1)
    memory usage: 79.5+ MB
userSub.head()
user_iditem_idbehavior_typetime
01000108227522168612014-12-03 01
11000108227522168612014-12-13 14
21000108227522168612014-12-08 07
31000108227522168612014-12-08 07
41000108227522168612014-12-08 00

将该数据集保存到csv文件里

%time userSub.to_csv('userSub.csv')
    Wall time: 4.53 s

step4:处理时间数据

读取userSub,(先保存userSub,在读取userSub,是更换index为time的一种间接方法,此外,userSub作为我们作预测的主要数据集,是需要保存的。)

%time userSub = pd.read_csv('userSub.csv',usecols = ['user_id','item_id','behavior_type','time'],parse_dates = True)
    Wall time: 14 s
userSub.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2084859 entries, 0 to 2084858
    Data columns (total 4 columns):
    user_id          int64
    item_id          int64
    behavior_type    int64
    time             object
    dtypes: int64(3), object(1)
    memory usage: 63.6+ MB
userSub.head()
user_iditem_idbehavior_typetime
01000108227522168612014-12-03 01
11000108227522168612014-12-13 14
21000108227522168612014-12-08 07
31000108227522168612014-12-08 07
41000108227522168612014-12-08 00
%time userSub = userSub.sort_index().copy()
    Wall time: 66 ms
userSub.index
    DatetimeIndex(['2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   '2014-11-18 00:00:00', '2014-11-18 00:00:00',
                   ...
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00',
                   '2014-12-18 23:00:00', '2014-12-18 23:00:00'],
                  dtype='datetime64[ns]', name=u'time', length=2084859, freq=None)
userSub.head()
user_iditem_idbehavior_type
time
2014-11-18129403050529003291
2014-11-18232469773536066331
2014-11-181407638003693930231
2014-11-181407638001877693811
2014-11-18323631701340815141

step5:进行特征处理

特征处理包括两部分:

1)将user-item(用户商品对)的交互行为进行哑变量编码

2)设置时间窗口,提取交互行为的一段时间内的统计量

pd.get_dummies(userSub['behavior_type'],prefix = 'type').head()
type_1type_2type_3type_4
01.00.00.00.0
11.00.00.00.0
21.00.00.00.0
31.00.00.00.0
41.00.00.00.0
typeDummies = pd.get_dummies(userSub['behavior_type'],prefix = 'type')#onehot哑变量编码

userSubOneHot = pd.concat([userSub[['user_id','item_id','time']],typeDummies],axis = 1)
usertem = pd.concat([userSub[['user_id','item_id']],typeDummies,userSub[['time']]],axis = 1)#将哑变量特征加入数据表中
usertem.head()
user_iditem_idtype_1type_2type_3type_4time
0100010822752216861.00.00.00.02014-12-03 01
1100010822752216861.00.00.00.02014-12-13 14
2100010822752216861.00.00.00.02014-12-08 07
3100010822752216861.00.00.00.02014-12-08 07
4100010822752216861.00.00.00.02014-12-08 00
usertem.groupby(['time','user_id','item_id'],as_index = False).sum().head()#已将关键字排序,统计用户商品对的交互行为
timeuser_iditem_idtype_1type_2type_3type_4
02014-11-18 001409053586495672.00.00.00.0
12014-11-18 00144694924321193.00.00.00.0
22014-11-18 0014469492068330722.00.00.00.0
32014-11-18 0014469493477456331.00.00.00.0
42014-11-18 0029035783952001992.00.00.00.0
userSubOneHot.head()
user_iditem_idtimetype_1type_2type_3type_4
0100010822752216862014-12-03 011.00.00.00.0
1100010822752216862014-12-13 141.00.00.00.0
2100010822752216862014-12-08 071.00.00.00.0
3100010822752216862014-12-08 071.00.00.00.0
4100010822752216862014-12-08 001.00.00.00.0
userSubOneHot.info()
userSubOneHotGroup = userSubOneHot.groupby(['time','user_id','item_id'],as_index = False).sum()#另外一种方法是在sum()后使用.reset_index()
userSubOneHotGroup.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 968243 entries, 0 to 968242
    Data columns (total 7 columns):
    time       968243 non-null object
    user_id    968243 non-null int64
    item_id    968243 non-null int64
    type_1     968243 non-null float64
    type_2     968243 non-null float64
    type_3     968243 non-null float64
    type_4     968243 non-null float64
    dtypes: float64(4), int64(2), object(1)
    memory usage: 59.1+ MB
userSubOneHotGroup.head()
timeuser_iditem_idtype_1type_2type_3type_4
02014-11-18 001409053586495672.00.00.00.0
12014-11-18 00144694924321193.00.00.00.0
22014-11-18 0014469492068330722.00.00.00.0
32014-11-18 0014469493477456331.00.00.00.0
42014-11-18 0029035783952001992.00.00.00.0

拆分天和小时


#time_day_Series = userSubOneHotGroup.time.map(lambda x:x.split(' ')[0])

#time_hour_Series = userSubOneHotGroup.time.map(lambda x:x.split(' ')[1])

userSubOneHotGroup['time_day'] = pd.to_datetime(userSubOneHotGroup.time.values).date

userSubOneHotGroup['time_hour'] = pd.to_datetime(userSubOneHotGroup.time.values).time

userSubOneHotGroup.head()
timeuser_iditem_idtype_1type_2type_3type_4time_daytime_hour
02014-11-18 001409053586495672.00.00.00.02014-11-1800:00:00
12014-11-18 00144694924321193.00.00.00.02014-11-1800:00:00
22014-11-18 0014469492068330722.00.00.00.02014-11-1800:00:00
32014-11-18 0014469493477456331.00.00.00.02014-11-1800:00:00
42014-11-18 0029035783952001992.00.00.00.02014-11-1800:00:00
dataHour = userSubOneHotGroup.ix[:,0:7]
dataHour.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 968243 entries, 0 to 968242
    Data columns (total 7 columns):
    time       968243 non-null object
    user_id    968243 non-null int64
    item_id    968243 non-null int64
    type_1     968243 non-null float64
    type_2     968243 non-null float64
    type_3     968243 non-null float64
    type_4     968243 non-null float64
    dtypes: float64(4), int64(2), object(1)
    memory usage: 59.1+ MB
#保存

dataHour.to_csv('dataHour.csv')
dataHour.duplicated().sum()#没有重复行
    0
dataDay = userSubOneHotGroup.groupby(['time_day','user_id','item_id'],as_index = False).sum()
dataDay.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 904397 entries, 0 to 904396
    Data columns (total 7 columns):
    time_day    904397 non-null object
    user_id     904397 non-null int64
    item_id     904397 non-null int64
    type_1      904397 non-null float64
    type_2      904397 non-null float64
    type_3      904397 non-null float64
    type_4      904397 non-null float64
    dtypes: float64(4), int64(2), object(1)
    memory usage: 55.2+ MB
dataDay.head()
time_dayuser_iditem_idtype_1type_2type_3type_4
02014-11-18492760939851.00.00.00.0
12014-11-184921100365132.00.00.00.0
22014-11-184921764045101.00.00.00.0
32014-11-184921784122552.00.00.00.0
42014-11-184923359614291.00.00.00.0
#保存
dataDay.to_csv('dataDay.csv')
dataDay.duplicated().sum()#没有重复行
    0
dataDay.type_4.max()
    20.0

step6:构造训练测试数据集

本篇博客使用的采样频率为天的数据表,对每个用户商品对进行是否发生购买行为进行分类,发生购买行为分类标签为1,反之为0.

dataDay_load = pd.read_csv('dataDay.csv',usecols = ['time_day','user_id','item_id','type_1',\
                                                    'type_2','type_3','type_4'], index_col = 'time_day',parse_dates = True)
dataDay_load.head()
user_iditem_idtype_1type_2type_3type_4
time_day
2014-11-18492760939851.00.00.00.0
2014-11-184921100365132.00.00.00.0
2014-11-184921764045101.00.00.00.0
2014-11-184921784122552.00.00.00.0
2014-11-184923359614291.00.00.00.0
dataDay_load.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 904397 entries, 2014-11-18 to 2014-12-18
    Data columns (total 6 columns):
    user_id    904397 non-null int64
    item_id    904397 non-null int64
    type_1     904397 non-null float64
    type_2     904397 non-null float64
    type_3     904397 non-null float64
    type_4     904397 non-null float64
    dtypes: float64(4), int64(2)
    memory usage: 48.3 MB
train_x = dataDay_load.ix['2014-12-16',:]#16号选取特征数据集
train_x.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 30183 entries, 2014-12-16 to 2014-12-16
    Data columns (total 6 columns):
    user_id    30183 non-null int64
    item_id    30183 non-null int64
    type_1     30183 non-null float64
    type_2     30183 non-null float64
    type_3     30183 non-null float64
    type_4     30183 non-null float64
    dtypes: float64(4), int64(2)
    memory usage: 1.6 MB
train_x.describe()
user_iditem_idtype_1type_2type_3type_4
count3.018300e+043.018300e+0430183.00000030183.00000030183.00000030183.000000
mean7.186918e+072.032869e+082.1818900.0367760.0582780.026803
std4.595509e+071.172341e+081.3520440.1894420.2416510.174335
min5.943600e+041.540200e+040.0000000.0000000.0000000.000000
25%3.000949e+071.014034e+081.0000000.0000000.0000000.000000
50%5.858117e+072.036895e+082.0000000.0000000.0000000.000000
75%1.178801e+083.056362e+083.0000000.0000000.0000000.000000
max1.424396e+084.045617e+0817.0000002.0000004.0000005.000000
train_y = dataDay_load.ix['2014-12-17',['user_id','item_id','type_4']]#17号的购买行为作为分类标签
train_y.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
    Data columns (total 3 columns):
    user_id    29749 non-null int64
    item_id    29749 non-null int64
    type_4     29749 non-null float64
    dtypes: float64(1), int64(2)
    memory usage: 929.7 KB
train_y.describe()
user_iditem_idtype_4
count2.974900e+042.974900e+0429749.000000
mean6.997416e+072.016876e+080.024202
std4.685978e+071.170012e+080.165070
min5.943600e+046.619000e+030.000000
25%2.783149e+079.903570e+070.000000
50%5.562218e+072.005868e+080.000000
75%1.176616e+083.039699e+080.000000
max1.424157e+084.045616e+084.000000
dataSet = pd.merge(train_x,train_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)#特征数据和标签数据构成训练数据集
dataSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 30183 entries, 0 to 30182
    Data columns (total 7 columns):
    user_id     30183 non-null int64
    item_id     30183 non-null int64
    type_1      30183 non-null float64
    type_2      30183 non-null float64
    type_3      30183 non-null float64
    type_4_x    30183 non-null float64
    type_4_y    30183 non-null float64
    dtypes: float64(5), int64(2)
    memory usage: 1.8 MB
dataSet.describe()
user_iditem_idtype_1type_2type_3type_4_xtype_4_y
count3.018300e+043.018300e+0430183.00000030183.00000030183.00000030183.00000030183.000000
mean7.186918e+072.032869e+082.1818900.0367760.0582780.0268030.004705
std4.595509e+071.172341e+081.3520440.1894420.2416510.1743350.075343
min5.943600e+041.540200e+040.0000000.0000000.0000000.0000000.000000
25%3.000949e+071.014034e+081.0000000.0000000.0000000.0000000.000000
50%5.858117e+072.036895e+082.0000000.0000000.0000000.0000000.000000
75%1.178801e+083.056362e+083.0000000.0000000.0000000.0000000.000000
max1.424396e+084.045617e+0817.0000002.0000004.0000005.0000003.000000
np.sign(dataSet.type_4_y.values).sum()
    129.0
np.sign(0.0)
    0.0
dataSet['labels'] = dataSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
dataSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 30183 entries, 0 to 30182
    Data columns (total 8 columns):
    user_id     30183 non-null int64
    item_id     30183 non-null int64
    type_1      30183 non-null float64
    type_2      30183 non-null float64
    type_3      30183 non-null float64
    type_4_x    30183 non-null float64
    type_4_y    30183 non-null float64
    labels      30183 non-null float64
    dtypes: float64(6), int64(2)
    memory usage: 2.1 MB
dataSet.head()
user_iditem_idtype_1type_2type_3type_4_xtype_4_ylabels
0594361840814364.00.00.00.00.00.0
161797832619063.00.00.00.00.00.0
213421164916252.00.00.00.00.00.0
3134211796797832.00.00.00.00.00.0
4134211966162692.00.00.00.00.00.0
np.sign(dataSet.type_3.values).sum()#发生加购物车交互行为的用户商品对
    1713.0
trainSet = dataSet.copy()#重命名并保存训练数据集

trainSet.to_csv('trainSet.csv')
test_x = dataDay_load.ix['2014-12-17',:]#17号特征数据集,最为测试输入数据集
test_x.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 29749 entries, 2014-12-17 to 2014-12-17
    Data columns (total 6 columns):
    user_id    29749 non-null int64
    item_id    29749 non-null int64
    type_1     29749 non-null float64
    type_2     29749 non-null float64
    type_3     29749 non-null float64
    type_4     29749 non-null float64
    dtypes: float64(4), int64(2)
    memory usage: 1.6 MB
test_x.head()
user_iditem_idtype_1type_2type_3type_4
time_day
2014-12-17594362388614612.00.00.00.0
2014-12-17607232028290252.00.00.00.0
2014-12-17607233719336342.00.00.00.0
2014-12-17106362388306841.00.00.00.0
2014-12-171063621495172722.00.00.00.0
test_y = dataDay_load.ix['2014-12-18',['user_id','item_id','type_4']]#18号购买行为作为测试标签数据集
testSet = pd.merge(test_x,test_y, on = ['user_id','item_id'],suffixes=('_x','_y'), how = 'left').fillna(0.0)#构成测试数据集
testSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 29749 entries, 0 to 29748
    Data columns (total 7 columns):
    user_id     29749 non-null int64
    item_id     29749 non-null int64
    type_1      29749 non-null float64
    type_2      29749 non-null float64
    type_3      29749 non-null float64
    type_4_x    29749 non-null float64
    type_4_y    29749 non-null float64
    dtypes: float64(5), int64(2)
    memory usage: 1.8 MB
testSet.describe()
user_iditem_idtype_1type_2type_3type_4_xtype_4_y
count2.974900e+042.974900e+0429749.00000029749.00000029749.00000029749.00000029749.000000
mean6.997416e+072.016876e+082.1682410.0381530.0592960.0242020.004336
std4.685978e+071.170012e+081.3349660.1920930.2423640.1650700.069681
min5.943600e+046.619000e+030.0000000.0000000.0000000.0000000.000000
25%2.783149e+079.903570e+071.0000000.0000000.0000000.0000000.000000
50%5.562218e+072.005868e+082.0000000.0000000.0000000.0000000.000000
75%1.176616e+083.039699e+083.0000000.0000000.0000000.0000000.000000
max1.424157e+084.045616e+0817.0000002.0000003.0000004.0000003.000000
testSet['labels'] = testSet.type_4_y.map(lambda x: 1.0 if x > 0.0 else 0.0 )
testSet.describe()
user_iditem_idtype_1type_2type_3type_4_xtype_4_ylabels
count2.974900e+042.974900e+0429749.00000029749.00000029749.00000029749.00000029749.00000029749.000000
mean6.997416e+072.016876e+082.1682410.0381530.0592960.0242020.0043360.004101
std4.685978e+071.170012e+081.3349660.1920930.2423640.1650700.0696810.063909
min5.943600e+046.619000e+030.0000000.0000000.0000000.0000000.0000000.000000
25%2.783149e+079.903570e+071.0000000.0000000.0000000.0000000.0000000.000000
50%5.562218e+072.005868e+082.0000000.0000000.0000000.0000000.0000000.000000
75%1.176616e+083.039699e+083.0000000.0000000.0000000.0000000.0000000.000000
max1.424157e+084.045616e+0817.0000002.0000003.0000004.0000003.0000001.000000
testSet.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 29749 entries, 0 to 29748
    Data columns (total 8 columns):
    user_id     29749 non-null int64
    item_id     29749 non-null int64
    type_1      29749 non-null float64
    type_2      29749 non-null float64
    type_3      29749 non-null float64
    type_4_x    29749 non-null float64
    type_4_y    29749 non-null float64
    labels      29749 non-null float64
    dtypes: float64(6), int64(2)
    memory usage: 2.0 MB
testSet['labels'].values.sum()#122个购买样例
    122.0
testSet.to_csv('testSet.csv')

step7: 训练模型

逻辑回归模型

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

model.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
              intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
              penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
              verbose=0, warm_start=False)
model.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
    0.99565980850147429
train_y_est =model.predict(trainSet.ix[:,2:6])
train_y_est.sum()
    2.0

加权逻辑回归(针对类别不平衡,基于代价敏感函数)

lrW = LogisticRegression(class_weight ='auto')#针对样本不均衡问题,设置参数"class_weight
lrW.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

trainLRW_y = lrW.predict(trainSet.ix[:,2:6])

trainLRW_y.sum()
    4792.0
lrW.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
    0.84292482523274692

计算训练精确率和召回率

from sklearn.cross_validation import train_test_split,cross_val_score
#精确率

precisions = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
                             cv = 5,scoring = 'precision')

print "精确度:\n",np.mean(precisions)
精确度: 0.0217883289288
#召回率

recalls = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
                             cv = 5,scoring = 'recall')

print "召回率:\n",np.mean(recalls)
召回率: 0.651692307692
#计算综合指标f1

f1 = cross_val_score(lrW,trainSet.ix[:,2:6],trainSet.ix[:,-1],\
                             cv = 5,scoring = 'f1')
print 'f1得分:\n',np.mean(f1)
    f1得分:
    0.0421179159024

计算测试f1得分

testLRW_y = lrW.predict(test_x.ix[:,2:6])
precision_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'precision')

recall_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'recall')

f1_test = cross_val_score(lrW,testSet.ix[:,2:6],testSet.ix[:,-1],cv = 5,scoring = 'f1')

print 'f1得分:\n',np.mean(f1_test)
   f1得分:
    0.0447302553442

step8:预测19号用户商品对

#构造输入数据

predict_x = dataDay_load.ix['2014-12-18',:]

predict_x.to_csv('predict_x.csv')
predict_x.info()
predict_x.describe()
user_iditem_idtype_1type_2type_3type_4
count2.894900e+042.894900e+0428949.00000028949.00000028949.00000028949.000000
mean7.057660e+072.041997e+082.1724760.0337840.0650800.026978
std4.636984e+071.167057e+081.3177060.1816280.2545290.174941
min1.342110e+052.934200e+040.0000000.0000000.0000000.000000
25%2.902911e+071.037920e+081.0000000.0000000.0000000.000000
50%5.540191e+072.049606e+082.0000000.0000000.0000000.000000
75%1.178487e+083.065798e+083.0000000.0000000.0000000.000000
max1.424116e+084.045373e+0816.0000002.0000003.0000004.000000
#预测

predict_y = lrW.predict(predict_x.ix[:,2:])
predict_y.sum()#预测出共有4636个用户商品对发生购买行为
    4636.0
user_item_19 = predict_x.ix[predict_y > 0.0,['user_id','item_id']]#选出发生购买行为的用户商品对,即标签为1的,作为最后的提交结果
user_item_19.all()
   user_id    True
    item_id    True
    dtype: bool

user_item_19.info()
    <class 'pandas.core.frame.DataFrame'>
    DatetimeIndex: 4636 entries, 2014-12-18 to 2014-12-18
    Data columns (total 2 columns):
    user_id    4636 non-null int64
    item_id    4636 non-null int64
    dtypes: int64(2)
    memory usage: 108.7 KB
user_item_19.duplicated().sum()#无重复行

    0
# 保存

user_item_19.to_csv('E:/python/gbdt/predict/tianchi_mobile_recommendation_predict.csv',index = False,encoding = 'utf-8')

其他sklearn模型的应用

GBDT模型

from sklearn.ensemble import GradientBoostingClassifier
gbdt = GradientBoostingClassifier(random_state = 10)
gbdt.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])
trainGBDT_y = gbdt.predict(trainSet.ix[:,2:6])
trainGBDT_y.sum()
0.0
gbdt.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064

随机森林

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

trainRF_y = rf.predict(trainSet.ix[:,2:6])
rf.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99575920219991387
trainRF_y.sum()
1.0
trainRF_y
array([ 0.,  0.,  0., ...,  0.,  0.,  0.])

svm

from sklearn import svm
svc = svm.SVC()
svc.fit(trainSet.ix[:,2:6],trainSet.ix[:,-1])

trainSVC_y = svc.predict(trainSet.ix[:,2:6])
svc.score(trainSet.ix[:,2:6],trainSet.ix[:,-1])
0.99572607096710064
trainSVC_y.sum()
0.0

很显然样本不均衡问题对模型的影响很大

关于分类样本不均衡问题的介绍及解决办法

不平衡数据分类算法介绍与比较

集成学习以及分类样本不均衡问题

关于超参数调节

scikit-learn 梯度提升树(GBDT)调参小结

  • 12
    点赞
  • 30
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 5
    评论
大学生参加学科竞有着诸多好处,不仅有助于个人综合素质的提升,还能为未来职业发展奠定良好基础。以下是一些分析: 首先,学科竞是提高专业知识和技能水平的有效途径。通过参与竞,学生不仅能够深入学习相关专业知识,还能够接触到最新的科研成果和技术发展趋势。这有助于拓展学生的学科视野,使其对专业领域有更深刻的理解。在竞过程中,学生通常需要解决实际问题,这锻炼了他们独立思考和解决问题的能力。 其次,学科竞培养了学生的团队合作精神。许多竞项目需要团队协作来完成,这促使学生学会有效地与他人合作、协调分工。在团队合作中,学生们能够学到如何有效沟通、共同制定目标和分工合作,这对于日后进入职场具有重要意义。 此外,学科竞是提高学生综合能力的一种途径。竞项目通常会涉及到理论知识、实际操作和创新思维等多个方面,要求参者具备全面的素质。在竞过程中,学生不仅需要展现自己的专业知识,还需要具备创新意识和解决问题的能力。这种全面的综合能力培养对于未来从事各类职业都具有积极作用。 此外,学科竞可以为学生提供展示自我、树立信心的机会。通过比的舞台,学生有机会展现自己在专业领域的优势,得到他人的认可和赞誉。这对于培养学生的自信心和自我价值感非常重要,有助于他们更加积极主动地投入学习和未来的职业生涯。 最后,学科竞对于个人职业发展具有积极的助推作用。在竞中脱颖而出的学生通常能够引起企业、研究机构等用人单位的关注。获得竞奖项不仅可以作为个人履历的亮点,还可以为进入理想的工作岗位提供有力的支持。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

墨岚❤️

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值