首先是函数 :
作用:generation and splitting to training set & valid set
def valid_train_set_construct(valid_ratio = 0.5, valid_sub_ratio = 0.5, train_np_ratio = 1, train_sub_ratio = 0.5):# generation of train set
@param valid_ratio: float ~ [0~1], the valid set ratio in total set and the rest is train set
@param valid_sub_ratio: float ~ (0~1), random sample ratio of valid set
@param train_np_ratio: (1~1200), the sub-sample ratio of training set for N/P balanced.
@param train_sub_ratio: float ~ (0~1), random sample ratio of train set after N/P subsample
@return valid_X, valid_y, train_X, train_y
首先根据形参验证集比率来对总数据集进行切分,切分方法为随机切分,首先根据随机数生成函数与已经聚好类的索引值这里是user_id,item_id以及商品种类和购买与否的数据大小进行生成一个对应大小的数组,这个数组值均为0-1,这里我们可以将其对应数组值小于该valid_ratio的下标作为验证集的采样数据。由于随机数产生频率大致均匀,从而此方法满足。以上讨论的是对负样本的采样,函数如下:
msk_1 = np.random.rand(len(df_part_1_uic_label_cluster)) < valid_ratio
msk_2 = np.random.rand(len(df_part_2_uic_label_cluster)) < valid_ratio
valid_df_part_1_uic_label_cluster = df_part_1_uic_label_cluster.loc[msk_1]
valid_df_part_2_uic_label_cluster = df_part_2_uic_label_cluster.loc[msk_2]
首先因为,对于验证集的划分不仅仅针对于负样本那些簇,同样对于正样本所在的0簇也要进行划分,以上通过随机数阈值划分,对正样本0簇也进行了划分,以下是对正样本进行再次通过sample的第二次采样。函数如下:
valid_part_1_uic_label = valid_df_part_1_uic_label_cluster[ valid_df_part_1_uic_label_cluster['class'] == 0 ].sample(frac = valid_sub_ratio)//表示对正样本的采样 因为聚类时统一将正样本分为0簇,并对其全采样,这里进行全采样之后的下采样
valid_part_2_uic_label = valid_df_part_2_uic_label_cluster[ valid_df_part_2_uic_label_cluster['class'] == 0 ].sample(frac = valid_sub_ratio)
在对整体样本进行标记之后,开始对簇内进行用sample进行已经“平衡”后的采样,依次遍历每个簇开始进行采样,然后进行连接。函数如下:
## constructing valid set
for i in range(1,1001,1):
valid_part_1_uic_label_0_i = valid_df_part_1_uic_label_cluster[valid_df_part_1_uic_label_cluster['class'] == i]
if len(valid_part_1_uic_label_0_i) != 0:
valid_part_1_uic_label_0_i = valid_part_1_uic_label_0_i.sample(frac = valid_sub_ratio)
valid_part_1_uic_label = pd.concat([valid_part_1_uic_label, valid_part_1_uic_label_0_i])
valid_part_2_uic_label_0_i = valid_df_part_2_uic_label_cluster[valid_df_part_2_uic_label_cluster['class'] == i]
if len(valid_part_2_uic_label_0_i) != 0:
valid_part_2_uic_label_0_i = valid_part_2_uic_label_0_i.sample(frac = valid_sub_ratio)
valid_part_2_uic_label = pd.concat([valid_part_2_uic_label, valid_part_2_uic_label_0_i])
然后将取样得到的uic_label表和之前构建的特征表根据user_id,iterm_id,item_category及其组合为连接关键字进行表的连接,最后将df_part_1和df_part_2进行处理的结果进行连接,函数如下:
valid_part_1_df = pd.merge(valid_part_1_uic_label, df_part_1_U, how='left', on=['user_id'])
valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_I, how='left', on=['item_id'])
valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_C, how='left', on=['item_category'])
valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_IC, how='left', on=['item_id','item_category'])
valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_UI, how='left', on=['user_id','item_id','item_category','label'])
valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_UC, how='left', on=['user_id','item_category'])
valid_part_2_df = pd.merge(valid_part_2_uic_label, df_part_2_U, how='left', on=['user_id'])
valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_I, how='left', on=['item_id'])
valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_C, how='left', on=['item_category'])
valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_IC, how='left', on=['item_id','item_category'])
valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_UI, how='left', on=['user_id','item_id','item_category','label'])
valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_UC, how='left', on=['user_id','item_category'])
样本格式
在进行了特征构建之后,我们通过合并各大类特征数据(U、I、C、UI、UC、IC)得出训练和预测所需的数据,数据样本格式如下:
# | 索引 | 特征 | 标签 |
---|---|---|---|
一行样本数据 | user_id, item_id | 约100个特征数据 | 分类结果(0-未购买,1-购买) |
在得出样本集之后,就可以进行模型的训练和预测了。
(p.s.生成的数据量规模达到10G级别,考虑到单机计算存储资源受限,在示例程序大量使用了分块操作,另外也可考虑基于HDFS+MR来实现)。
最后将所得的连接表验证集构成的样本构成训练样本格式:
valid_df = pd.concat([valid_part_1_df, valid_part_2_df])缺失值填充-1.一般RF,GBDT都进行-1的存储。
# fill the missing value as -1 (missing value are time features)valid_df.fillna(-1, inplace=True)
到此验证集构建基本完成,最后一步就是将其转化为矩阵,并且对于训练样本的特征和标签属性的分离。
# using all the features for valid rf model
valid_X = valid_df.as_matrix(['u_b1_count_in_6','u_b2_count_in_6','u_b3_count_in_6','u_b4_count_in_6','u_b_count_in_6',
'u_b1_count_in_3','u_b2_count_in_3','u_b3_count_in_3','u_b4_count_in_3','u_b_count_in_3',
'u_b1_count_in_1','u_b2_count_in_1','u_b3_count_in_1','u_b4_count_in_1','u_b_count_in_1',
'u_b4_rate','u_b4_diff_hours',
'i_u_count_in_6','i_u_count_in_3','i_u_count_in_1',
'i_b1_count_in_6','i_b2_count_in_6','i_b3_count_in_6','i_b4_count_in_6','i_b_count_in_6',
'i_b1_count_in_3','i_b2_count_in_3','i_b3_count_in_3','i_b4_count_in_3','i_b_count_in_3',
'i_b1_count_in_1','i_b2_count_in_1','i_b3_count_in_1','i_b4_count_in_1','i_b_count_in_1',
'i_b4_rate','i_b4_diff_hours',
'c_u_count_in_6','c_u_count_in_3','c_u_count_in_1',
'c_b1_count_in_6','c_b2_count_in_6','c_b3_count_in_6','c_b4_count_in_6','c_b_count_in_6',
'c_b1_count_in_3','c_b2_count_in_3','c_b3_count_in_3','c_b4_count_in_3','c_b_count_in_3',
'c_b1_count_in_1','c_b2_count_in_1','c_b3_count_in_1','c_b4_count_in_1','c_b_count_in_1',
'c_b4_rate','c_b4_diff_hours',
'ic_u_rank_in_c','ic_b_rank_in_c','ic_b4_rank_in_c',
'ui_b1_count_in_6','ui_b2_count_in_6','ui_b3_count_in_6','ui_b4_count_in_6','ui_b_count_in_6',
'ui_b1_count_in_3','ui_b2_count_in_3','ui_b3_count_in_3','ui_b4_count_in_3','ui_b_count_in_3',
'ui_b1_count_in_1','ui_b2_count_in_1','ui_b3_count_in_1','ui_b4_count_in_1','ui_b_count_in_1',
'ui_b_count_rank_in_u','ui_b_count_rank_in_uc',
'ui_b1_last_hours','ui_b2_last_hours','ui_b3_last_hours','ui_b4_last_hours',
'uc_b1_count_in_6','uc_b2_count_in_6','uc_b3_count_in_6','uc_b4_count_in_6','uc_b_count_in_6',
'uc_b1_count_in_3','uc_b2_count_in_3','uc_b3_count_in_3','uc_b4_count_in_3','uc_b_count_in_3',
'uc_b1_count_in_1','uc_b2_count_in_1','uc_b3_count_in_1','uc_b4_count_in_1','uc_b_count_in_1',
'uc_b_count_rank_in_u',
'uc_b1_last_hours','uc_b2_last_hours','uc_b3_last_hours','uc_b4_last_hours'])
标签属性列向量构建:
valid_y = valid_df['label'].values
至此valid_x,valid_y构建完成。
对于训练集的构造和以上验证集的构造类似,只是在划分比率上为1-验证集的比率。所以有如下两行代码:
### constructing training set
train_df_part_1_uic_label_cluster = df_part_1_uic_label_cluster.loc[~msk_1]
train_df_part_2_uic_label_cluster = df_part_2_uic_label_cluster.loc[~msk_2]
再次对0簇也即划分好比列的正样本进行随机采样为sample函数,代码如下:
train_part_1_uic_label = train_df_part_1_uic_label_cluster[ train_df_part_1_uic_label_cluster['class'] == 0 ].sample(frac = train_sub_ratio)
train_part_2_uic_label = train_df_part_2_uic_label_cluster[ train_df_part_2_uic_label_cluster['class'] == 0 ].sample(frac = train_sub_ratio)接下来对负样本按照采样率在负类样本的簇中进行采样,采样率为对负样本进行的不平衡乘法和自采样率的成绩计算,采样方法为依次遍历每个簇,如下:
frac_ratio = float(train_sub_ratio) * float(train_np_ratio)/float(1200)
for i in range(1,1001,1):
train_part_1_uic_label_0_i = train_df_part_1_uic_label_cluster[train_df_part_1_uic_label_cluster['class'] == i]
if len(train_part_1_uic_label_0_i) != 0:
train_part_1_uic_label_0_i = train_part_1_uic_label_0_i.sample(frac = frac_ratio)
train_part_1_uic_label = pd.concat([train_part_1_uic_label, train_part_1_uic_label_0_i])
train_part_2_uic_label_0_i = train_df_part_2_uic_label_cluster[train_df_part_2_uic_label_cluster['class'] == i]
if len(train_part_2_uic_label_0_i) != 0:
train_part_2_uic_label_0_i = train_part_2_uic_label_0_i.sample(frac = frac_ratio)
train_part_2_uic_label = pd.concat([train_part_2_uic_label, train_part_2_uic_label_0_i])
接下来就是根据索引进行特征表的连接,从而组合成样本特征的格式,如下:
# constructing training set
train_part_1_df = pd.merge(train_part_1_uic_label, df_part_1_U, how='left', on=['user_id'])
train_part_1_df = pd.merge(train_part_1_df, df_part_1_I, how='left', on=['item_id'])
train_part_1_df = pd.merge(train_part_1_df, df_part_1_C, how='left', on=['item_category'])
train_part_1_df = pd.merge(train_part_1_df, df_part_1_IC, how='left', on=['item_id','item_category'])
train_part_1_df = pd.merge(train_part_1_df, df_part_1_UI, how='left', on=['user_id','item_id','item_category','label'])
train_part_1_df = pd.merge(train_part_1_df, df_part_1_UC, how='left', on=['user_id','item_category'])
train_part_2_df = pd.merge(train_part_2_uic_label, df_part_2_U, how='left', on=['user_id'])
train_part_2_df = pd.merge(train_part_2_df, df_part_2_I, how='left', on=['item_id'])
train_part_2_df = pd.merge(train_part_2_df, df_part_2_C, how='left', on=['item_category'])
train_part_2_df = pd.merge(train_part_2_df, df_part_2_IC, how='left', on=['item_id','item_category'])
train_part_2_df = pd.merge(train_part_2_df, df_part_2_UI, how='left', on=['user_id','item_id','item_category','label'])
train_part_2_df = pd.merge(train_part_2_df, df_part_2_UC, how='left', on=['user_id','item_category'])
train_df = pd.concat([train_part_1_df, train_part_2_df])
# fill the missing value as -1 (missing value are time features)
train_df.fillna(-1, inplace=True)
# using all the features for training rf model
train_X = train_df.as_matrix(['u_b1_count_in_6','u_b2_count_in_6','u_b3_count_in_6','u_b4_count_in_6','u_b_count_in_6',
'u_b1_count_in_3','u_b2_count_in_3','u_b3_count_in_3','u_b4_count_in_3','u_b_count_in_3',
'u_b1_count_in_1','u_b2_count_in_1','u_b3_count_in_1','u_b4_count_in_1','u_b_count_in_1',
'u_b4_rate','u_b4_diff_hours',
'i_u_count_in_6','i_u_count_in_3','i_u_count_in_1',
'i_b1_count_in_6','i_b2_count_in_6','i_b3_count_in_6','i_b4_count_in_6','i_b_count_in_6',
'i_b1_count_in_3','i_b2_count_in_3','i_b3_count_in_3','i_b4_count_in_3','i_b_count_in_3',
'i_b1_count_in_1','i_b2_count_in_1','i_b3_count_in_1','i_b4_count_in_1','i_b_count_in_1',
'i_b4_rate','i_b4_diff_hours',
'c_u_count_in_6','c_u_count_in_3','c_u_count_in_1',
'c_b1_count_in_6','c_b2_count_in_6','c_b3_count_in_6','c_b4_count_in_6','c_b_count_in_6',
'c_b1_count_in_3','c_b2_count_in_3','c_b3_count_in_3','c_b4_count_in_3','c_b_count_in_3',
'c_b1_count_in_1','c_b2_count_in_1','c_b3_count_in_1','c_b4_count_in_1','c_b_count_in_1',
'c_b4_rate','c_b4_diff_hours',
'ic_u_rank_in_c','ic_b_rank_in_c','ic_b4_rank_in_c',
'ui_b1_count_in_6','ui_b2_count_in_6','ui_b3_count_in_6','ui_b4_count_in_6','ui_b_count_in_6',
'ui_b1_count_in_3','ui_b2_count_in_3','ui_b3_count_in_3','ui_b4_count_in_3','ui_b_count_in_3',
'ui_b1_count_in_1','ui_b2_count_in_1','ui_b3_count_in_1','ui_b4_count_in_1','ui_b_count_in_1',
'ui_b_count_rank_in_u','ui_b_count_rank_in_uc',
'ui_b1_last_hours','ui_b2_last_hours','ui_b3_last_hours','ui_b4_last_hours',
'uc_b1_count_in_6','uc_b2_count_in_6','uc_b3_count_in_6','uc_b4_count_in_6','uc_b_count_in_6',
'uc_b1_count_in_3','uc_b2_count_in_3','uc_b3_count_in_3','uc_b4_count_in_3','uc_b_count_in_3',
'uc_b1_count_in_1','uc_b2_count_in_1','uc_b3_count_in_1','uc_b4_count_in_1','uc_b_count_in_1',
uc_b_count_rank_in_u',
'uc_b1_last_hours','uc_b2_last_hours','uc_b3_last_hours','uc_b4_last_hours'])
train_y = train_df['label'].values
print("train subset is generated.")
至此样本集合和验证集和构建完成分别为: valid_X, valid_y, train_X, train_y
在构建好训练集和验证集之后,我们将要做的是对模型的参数的调优,调优参数包括过程参数和基学习器参数。
在RF(或GBDT)的训练过程中,参数调节(parameter tuning)十分重要,一般地,将集成学习模型的参数分为两大类:过程参数和基学习器参数,一般地,先调试过程参数(如RF的基学习器个数n_estimators等),然后再调试基学习器参数(如决策树的最大深度max_depth等)。
在这里通过对四个参数的调节,得出四个最优参数。过程如下:
(1). selection for best N/P ratio of subsamole
(2). selection for best n_estimators for RF
(3). selection for best max_depth & min_samples_split & min_samples_leaf for RF
(4). selection for best prediction cutoff for RF
一:正负样本的不平衡率N/P进行调优。
f1_scores = []
np_ratios = []
for np_ratio in [1, 5, 10, 30, 50, 70, 100]:
t1 = time.time()
#generation of training and valid set
valid_X, valid_y, train_X, train_y = valid_train_set_construct(valid_ratio = 0.2, valid_sub_ratio = 1, train_np_ratio = np_ratio, train_sub_ratio = 1)
#generation of rf model and fit
rf_clf = RandomForestClassifier(max_depth=35, n_estimators=100, max_features="sqrt", verbose=True)
rf_clf.fit(train_X, train_y)
# validation and evaluation
valid_y_pred = rf_clf.predict(valid_X)
f1_scores.append(metrics.f1_score(valid_y, valid_y_pred))
np_ratios.append(np_ratio)
print('rf_clf [NP ratio = %d] is fitted' % np_ratio)
t2 = time.time()
print('time used %d s' % (t2-t1))
根据不同的N/P值得到的正负样本组合进行训练模型,再用验证集进行评估,得到一组N/P对应的f1_score,画图表示,并找出最好N/P值。
# plot the result
f1 = plt.figure(1)
plt.plot(np_ratios, f1_scores, label="md=35,nt=100")
plt.xlabel('NP ratio')
plt.ylabel('f1_score')
plt.title('f1_score as function of NP ratio - RF')
plt.legend(loc=4)
plt.grid(True, linewidth=0.3)
plt.show()
二:对森林的规模树的个数n_estimators进行调优。过程如下:
# 1.2 selection for best n_estimators of RF
# training and validating
f1_scores = []n_trees = []
valid_X, valid_y, train_X, train_y = valid_train_set_construct(valid_ratio = 0.2,
valid_sub_ratio = 1,
train_np_ratio = 5,
train_sub_ratio = 1)
for nt in [10, 20, 40, 80, 120, 160, 200, 300, 400, 500, 600, 700, 800]:
t1 = time.time()
# generation of training and valid set
# generation of RF model and fit
RF_clf = RandomForestClassifier(n_estimators=nt,
max_depth=35,
max_features="sqrt",
verbose=True)
RF_clf.fit(train_X, train_y)
# validation and evaluation
valid_y_pred = RF_clf.predict(valid_X)
f1_scores.append(metrics.f1_score(valid_y, valid_y_pred))
n_trees.append(nt)print('RF_clf [n_estimators = %d] is fitted' % nt)
t2 = time.time()
print('time used %d s' % (t2-t1))
# plot the result
f1 = plt.figure(1)
plt.plot(n_trees, f1_scores, label="md=35,np_ratio=5")
plt.xlabel('n_trees')
plt.ylabel('f1_score')
plt.title('f1_score as function of RF n_trees')
plt.legend(loc=4)
plt.grid(True, linewidth=0.3)
plt.show()
第三:对概率阈值的设置,根据已经训练的模型输入样本特征得到预测值,预测值为概率值,默认为0.5,不断调整,修改概率值达到阈值的条件,从而改变对该用户和该样本的购买预测分类标签。代码如下:
## 1.4 selection for best cutoff in range(0.1, 0.9, 0.1) of RF
# training and validatingf1_scores = []
cut_offs = []
valid_X, valid_y, train_X, train_y = valid_train_set_construct(valid_ratio = 0.2,
valid_sub_ratio = 1,train_np_ratio = 10, train_sub_ratio = 1)
# generation of RF model and fit
RF_clf = RandomForestClassifier(max_depth=30, n_estimators=150,max_features="sqrt", verbose=True)
RF_clf.fit(train_X, train_y)
for cutoff in np.arange(0.1, 1, 0.05):
# validation and evaluation
valid_y_pred = (RF_clf.predict_proba(valid_X)[:,1] > cutoff).astype(int)
f1_scores.append(metrics.f1_score(valid_y, valid_y_pred))
cut_offs.append(cutoff)
print('RF_clf [cutoff = %.2f] is fitted' % cutoff)
# plot the result
f1 = plt.figure(1)
plt.plot(cut_offs, f1_scores, label="np_ratio=10,nt=150,md=30")
plt.xlabel('cut_offs')
plt.ylabel('f1_score')
plt.title('f1_score as function of RF cut_offs')
plt.legend(loc=4)
plt.grid(True, linewidth=0.3)
plt.show()
综上,得到最优的参数如下:
max_depth = 20 (>= 10)
n_estimators = 300
cutoffs = 0.55 (0.45 ~ 0.65)
N/P ratio = 50
然后根据训练得到的参数构建rf模型,然后根据df_part_1和df_part_2的总体数据(不在进行验证集训练集的切分)进行模型的训练。然后在对df_part_3中的数据进行样本的组合,按照样本的特征格式,进行表的关键字连接,代入进模型进行预测。
将预测结果和给定商品样本子集进行求交集操作。这里得到预测结果。
# loading data
df_P = df_read(path_df_P)
df_P_item = df_P.drop_duplicates(['item_id'])[['item_id']]
df_pred = pd.read_csv(open(path_df_result_tmp,'r'), index_col=False, header=None)
df_pred.columns = ['user_id', 'item_id']
# output result
df_pred_P = pd.merge(df_pred, df_P_item, on=['item_id'], how='inner')[['user_id', 'item_id']]
df_pred_P.to_csv(path_df_result, index=False)