移动推荐算法中的RF训练模型过程

首先是函数 :

作用:generation and splitting to training set & valid set

def valid_train_set_construct(valid_ratio = 0.5, valid_sub_ratio = 0.5, train_np_ratio = 1, train_sub_ratio = 0.5):
    # generation of train set
    @param valid_ratio:          float ~ [0~1], the valid set ratio in total set and the rest is train set
    @param valid_sub_ratio:   float ~ (0~1), random sample ratio of valid set
    @param train_np_ratio:    (1~1200), the sub-sample ratio of training set for N/P balanced.
    @param train_sub_ratio:   float ~ (0~1), random sample ratio of train set after N/P subsample

    @return  valid_X, valid_y, train_X, train_y

首先根据形参验证集比率来对总数据集进行切分,切分方法为随机切分,首先根据随机数生成函数与已经聚好类的索引值这里是user_id,item_id以及商品种类和购买与否的数据大小进行生成一个对应大小的数组,这个数组值均为0-1,这里我们可以将其对应数组值小于该valid_ratio的下标作为验证集的采样数据。由于随机数产生频率大致均匀,从而此方法满足。以上讨论的是对负样本的采样,函数如下:

       msk_1 = np.random.rand(len(df_part_1_uic_label_cluster)) < valid_ratio

       msk_2 = np.random.rand(len(df_part_2_uic_label_cluster)) < valid_ratio

       valid_df_part_1_uic_label_cluster = df_part_1_uic_label_cluster.loc[msk_1]
       valid_df_part_2_uic_label_cluster = df_part_2_uic_label_cluster.loc[msk_2]

首先因为,对于验证集的划分不仅仅针对于负样本那些簇,同样对于正样本所在的0簇也要进行划分,以上通过随机数阈值划分,对正样本0簇也进行了划分,以下是对正样本进行再次通过sample的第二次采样。函数如下:      

valid_part_1_uic_label = valid_df_part_1_uic_label_cluster[ valid_df_part_1_uic_label_cluster['class'] == 0 ].sample(frac = valid_sub_ratio)//表示对正样本的采样     因为聚类时统一将正样本分为0簇,并对其全采样,这里进行全采样之后的下采样

    valid_part_2_uic_label = valid_df_part_2_uic_label_cluster[ valid_df_part_2_uic_label_cluster['class'] == 0 ].sample(frac = valid_sub_ratio)  

  在对整体样本进行标记之后,开始对簇内进行用sample进行已经“平衡”后的采样,依次遍历每个簇开始进行采样,然后进行连接。函数如下:

         ## constructing valid set
        for i in range(1,1001,1):
        valid_part_1_uic_label_0_i = valid_df_part_1_uic_label_cluster[valid_df_part_1_uic_label_cluster['class'] == i]
        if len(valid_part_1_uic_label_0_i) != 0:
              valid_part_1_uic_label_0_i = valid_part_1_uic_label_0_i.sample(frac = valid_sub_ratio)
              valid_part_1_uic_label     = pd.concat([valid_part_1_uic_label, valid_part_1_uic_label_0_i])
         valid_part_2_uic_label_0_i = valid_df_part_2_uic_label_cluster[valid_df_part_2_uic_label_cluster['class'] == i]
         if len(valid_part_2_uic_label_0_i) != 0:
             valid_part_2_uic_label_0_i = valid_part_2_uic_label_0_i.sample(frac = valid_sub_ratio)

             valid_part_2_uic_label     = pd.concat([valid_part_2_uic_label, valid_part_2_uic_label_0_i])

        然后将取样得到的uic_label表和之前构建的特征表根据user_id,iterm_id,item_category及其组合为连接关键字进行表的连接,最后将df_part_1和df_part_2进行处理的结果进行连接,函数如下:

    valid_part_1_df = pd.merge(valid_part_1_uic_label, df_part_1_U, how='left', on=['user_id'])
    valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_I,  how='left', on=['item_id'])
    valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_C,  how='left', on=['item_category'])
    valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_IC, how='left', on=['item_id','item_category'])
    valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_UI, how='left', on=['user_id','item_id','item_category','label'])
    valid_part_1_df = pd.merge(valid_part_1_df, df_part_1_UC, how='left', on=['user_id','item_category'])
    
    valid_part_2_df = pd.merge(valid_part_2_uic_label, df_part_2_U, how='left', on=['user_id'])
    valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_I,  how='left', on=['item_id'])
    valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_C,  how='left', on=['item_category'])
    valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_IC, how='left', on=['item_id','item_category'])
    valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_UI, how='left', on=['user_id','item_id','item_category','label'])
    valid_part_2_df = pd.merge(valid_part_2_df, df_part_2_UC, how='left', on=['user_id','item_category'])

    

   样本格式

   在进行了特征构建之后,我们通过合并各大类特征数据(U、I、C、UI、UC、IC)得出训练和预测所需的数据,数据样本格式如下:

#索引特征标签
一行样本数据user_id, item_id约100个特征数据分类结果(0-未购买,1-购买)

  在得出样本集之后,就可以进行模型的训练和预测了。

(p.s.生成的数据量规模达到10G级别,考虑到单机计算存储资源受限,在示例程序大量使用了分块操作,另外也可考虑基于HDFS+MR来实现)。

最后将所得的连接表验证集构成的样本构成训练样本格式:

valid_df = pd.concat([valid_part_1_df, valid_part_2_df])

缺失值填充-1.一般RF,GBDT都进行-1的存储。

    # fill the missing value as -1 (missing value are time features)
    valid_df.fillna(-1, inplace=True)

 到此验证集构建基本完成,最后一步就是将其转化为矩阵,并且对于训练样本的特征和标签属性的分离。

# using all the features for valid rf model
    valid_X = valid_df.as_matrix(['u_b1_count_in_6','u_b2_count_in_6','u_b3_count_in_6','u_b4_count_in_6','u_b_count_in_6', 
                                  'u_b1_count_in_3','u_b2_count_in_3','u_b3_count_in_3','u_b4_count_in_3','u_b_count_in_3', 
                                  'u_b1_count_in_1','u_b2_count_in_1','u_b3_count_in_1','u_b4_count_in_1','u_b_count_in_1', 
                                  'u_b4_rate','u_b4_diff_hours',
                                  'i_u_count_in_6','i_u_count_in_3','i_u_count_in_1',
                                  'i_b1_count_in_6','i_b2_count_in_6','i_b3_count_in_6','i_b4_count_in_6','i_b_count_in_6', 
                                  'i_b1_count_in_3','i_b2_count_in_3','i_b3_count_in_3','i_b4_count_in_3','i_b_count_in_3',
                                  'i_b1_count_in_1','i_b2_count_in_1','i_b3_count_in_1','i_b4_count_in_1','i_b_count_in_1', 
                                  'i_b4_rate','i_b4_diff_hours',
                                  'c_u_count_in_6','c_u_count_in_3','c_u_count_in_1',
                                  'c_b1_count_in_6','c_b2_count_in_6','c_b3_count_in_6','c_b4_count_in_6','c_b_count_in_6',
                                  'c_b1_count_in_3','c_b2_count_in_3','c_b3_count_in_3','c_b4_count_in_3','c_b_count_in_3',
                                  'c_b1_count_in_1','c_b2_count_in_1','c_b3_count_in_1','c_b4_count_in_1','c_b_count_in_1',
                                  'c_b4_rate','c_b4_diff_hours',
                                  'ic_u_rank_in_c','ic_b_rank_in_c','ic_b4_rank_in_c', 
                                  'ui_b1_count_in_6','ui_b2_count_in_6','ui_b3_count_in_6','ui_b4_count_in_6','ui_b_count_in_6',
                                  'ui_b1_count_in_3','ui_b2_count_in_3','ui_b3_count_in_3','ui_b4_count_in_3','ui_b_count_in_3',
                                  'ui_b1_count_in_1','ui_b2_count_in_1','ui_b3_count_in_1','ui_b4_count_in_1','ui_b_count_in_1', 
                                  'ui_b_count_rank_in_u','ui_b_count_rank_in_uc',
                                  'ui_b1_last_hours','ui_b2_last_hours','ui_b3_last_hours','ui_b4_last_hours',
                                  'uc_b1_count_in_6','uc_b2_count_in_6','uc_b3_count_in_6','uc_b4_count_in_6','uc_b_count_in_6', 
                                  'uc_b1_count_in_3','uc_b2_count_in_3','uc_b3_count_in_3','uc_b4_count_in_3','uc_b_count_in_3', 
                                  'uc_b1_count_in_1','uc_b2_count_in_1','uc_b3_count_in_1','uc_b4_count_in_1','uc_b_count_in_1',
                                  'uc_b_count_rank_in_u',

                                  'uc_b1_last_hours','uc_b2_last_hours','uc_b3_last_hours','uc_b4_last_hours'])

 标签属性列向量构建:

        valid_y = valid_df['label'].values

至此valid_x,valid_y构建完成。

对于训练集的构造和以上验证集的构造类似,只是在划分比率上为1-验证集的比率。所以有如下两行代码:

    ### constructing training set
    train_df_part_1_uic_label_cluster = df_part_1_uic_label_cluster.loc[~msk_1]

    train_df_part_2_uic_label_cluster = df_part_2_uic_label_cluster.loc[~msk_2] 

再次对0簇也即划分好比列的正样本进行随机采样为sample函数,代码如下:

train_part_1_uic_label = train_df_part_1_uic_label_cluster[ train_df_part_1_uic_label_cluster['class'] == 0 ].sample(frac =                 train_sub_ratio)

train_part_2_uic_label = train_df_part_2_uic_label_cluster[ train_df_part_2_uic_label_cluster['class'] == 0 ].sample(frac =                 train_sub_ratio)

接下来对负样本按照采样率在负类样本的簇中进行采样,采样率为对负样本进行的不平衡乘法和自采样率的成绩计算,采样方法为依次遍历每个簇,如下:

      frac_ratio = float(train_sub_ratio) * float(train_np_ratio)/float(1200)
      for i in range(1,1001,1):
            train_part_1_uic_label_0_i = train_df_part_1_uic_label_cluster[train_df_part_1_uic_label_cluster['class'] == i]
            if len(train_part_1_uic_label_0_i) != 0:
                train_part_1_uic_label_0_i = train_part_1_uic_label_0_i.sample(frac = frac_ratio)
                train_part_1_uic_label = pd.concat([train_part_1_uic_label, train_part_1_uic_label_0_i])
    
        train_part_2_uic_label_0_i = train_df_part_2_uic_label_cluster[train_df_part_2_uic_label_cluster['class'] == i]
        if len(train_part_2_uic_label_0_i) != 0:
            train_part_2_uic_label_0_i = train_part_2_uic_label_0_i.sample(frac = frac_ratio)
            train_part_2_uic_label = pd.concat([train_part_2_uic_label, train_part_2_uic_label_0_i])

接下来就是根据索引进行特征表的连接,从而组合成样本特征的格式,如下:

 # constructing training set
    train_part_1_df = pd.merge(train_part_1_uic_label, df_part_1_U, how='left', on=['user_id'])
    train_part_1_df = pd.merge(train_part_1_df, df_part_1_I,  how='left', on=['item_id'])
    train_part_1_df = pd.merge(train_part_1_df, df_part_1_C,  how='left', on=['item_category'])
    train_part_1_df = pd.merge(train_part_1_df, df_part_1_IC, how='left', on=['item_id','item_category'])
    train_part_1_df = pd.merge(train_part_1_df, df_part_1_UI, how='left', on=['user_id','item_id','item_category','label'])
    train_part_1_df = pd.merge(train_part_1_df, df_part_1_UC, how='left', on=['user_id','item_category'])
    
    train_part_2_df = pd.merge(train_part_2_uic_label, df_part_2_U, how='left', on=['user_id'])
    train_part_2_df = pd.merge(train_part_2_df, df_part_2_I,  how='left', on=['item_id'])
    train_part_2_df = pd.merge(train_part_2_df, df_part_2_C,  how='left', on=['item_category'])
    train_part_2_df = pd.merge(train_part_2_df, df_part_2_IC, how='left', on=['item_id','item_category'])
    train_part_2_df = pd.merge(train_part_2_df, df_part_2_UI, how='left', on=['user_id','item_id','item_category','label'])
    train_part_2_df = pd.merge(train_part_2_df, df_part_2_UC, how='left', on=['user_id','item_category'])
    
    train_df = pd.concat([train_part_1_df, train_part_2_df])
    # fill the missing value as -1 (missing value are time features)
    train_df.fillna(-1, inplace=True)
    
  # using all the features for training rf model
 train_X = train_df.as_matrix(['u_b1_count_in_6','u_b2_count_in_6','u_b3_count_in_6','u_b4_count_in_6','u_b_count_in_6', 
                            'u_b1_count_in_3','u_b2_count_in_3','u_b3_count_in_3','u_b4_count_in_3','u_b_count_in_3', 
                             'u_b1_count_in_1','u_b2_count_in_1','u_b3_count_in_1','u_b4_count_in_1','u_b_count_in_1', 
                             'u_b4_rate','u_b4_diff_hours',
                             'i_u_count_in_6','i_u_count_in_3','i_u_count_in_1',
                             'i_b1_count_in_6','i_b2_count_in_6','i_b3_count_in_6','i_b4_count_in_6','i_b_count_in_6', 
                             'i_b1_count_in_3','i_b2_count_in_3','i_b3_count_in_3','i_b4_count_in_3','i_b_count_in_3',
                             'i_b1_count_in_1','i_b2_count_in_1','i_b3_count_in_1','i_b4_count_in_1','i_b_count_in_1', 
                             'i_b4_rate','i_b4_diff_hours',
                             'c_u_count_in_6','c_u_count_in_3','c_u_count_in_1',
                             'c_b1_count_in_6','c_b2_count_in_6','c_b3_count_in_6','c_b4_count_in_6','c_b_count_in_6',
                             'c_b1_count_in_3','c_b2_count_in_3','c_b3_count_in_3','c_b4_count_in_3','c_b_count_in_3',
                              'c_b1_count_in_1','c_b2_count_in_1','c_b3_count_in_1','c_b4_count_in_1','c_b_count_in_1',
                              'c_b4_rate','c_b4_diff_hours',
                              'ic_u_rank_in_c','ic_b_rank_in_c','ic_b4_rank_in_c', 
                              'ui_b1_count_in_6','ui_b2_count_in_6','ui_b3_count_in_6','ui_b4_count_in_6','ui_b_count_in_6',
                              'ui_b1_count_in_3','ui_b2_count_in_3','ui_b3_count_in_3','ui_b4_count_in_3','ui_b_count_in_3',
                              'ui_b1_count_in_1','ui_b2_count_in_1','ui_b3_count_in_1','ui_b4_count_in_1','ui_b_count_in_1', 
                              'ui_b_count_rank_in_u','ui_b_count_rank_in_uc',
                              'ui_b1_last_hours','ui_b2_last_hours','ui_b3_last_hours','ui_b4_last_hours',
                              'uc_b1_count_in_6','uc_b2_count_in_6','uc_b3_count_in_6','uc_b4_count_in_6','uc_b_count_in_6', 
                              'uc_b1_count_in_3','uc_b2_count_in_3','uc_b3_count_in_3','uc_b4_count_in_3','uc_b_count_in_3', 
                              'uc_b1_count_in_1','uc_b2_count_in_1','uc_b3_count_in_1','uc_b4_count_in_1','uc_b_count_in_1',
                               uc_b_count_rank_in_u',
                               'uc_b1_last_hours','uc_b2_last_hours','uc_b3_last_hours','uc_b4_last_hours'])
    train_y = train_df['label'].values
    print("train subset is generated.")

至此样本集合和验证集和构建完成分别为: valid_X, valid_y, train_X, train_y

在构建好训练集和验证集之后,我们将要做的是对模型的参数的调优,调优参数包括过程参数和基学习器参数。

在RF(或GBDT)的训练过程中,参数调节(parameter tuning)十分重要,一般地,将集成学习模型的参数分为两大类:过程参数和基学习器参数,一般地,先调试过程参数(如RF的基学习器个数n_estimators等),然后再调试基学习器参数(如决策树的最大深度max_depth等)。

    在这里通过对四个参数的调节,得出四个最优参数。过程如下:

        (1). selection for best N/P ratio of subsamole
        (2). selection for best n_estimators for RF
        (3). selection for best max_depth & min_samples_split & min_samples_leaf for RF
        (4). selection for best prediction cutoff for RF


一:正负样本的不平衡率N/P进行调优。

      f1_scores = []
      np_ratios = []
      for np_ratio in [1, 5, 10, 30, 50, 70, 100]:
          t1 = time.time()
          #generation of training and valid set
             valid_X, valid_y, train_X, train_y = valid_train_set_construct(valid_ratio = 0.2, valid_sub_ratio = 1, train_np_ratio                                                                                                                                  = np_ratio, train_sub_ratio = 1)
         #generation of rf model and fit
         rf_clf = RandomForestClassifier(max_depth=35, n_estimators=100, max_features="sqrt", verbose=True)
         rf_clf.fit(train_X, train_y)    
         # validation and evaluation
         valid_y_pred = rf_clf.predict(valid_X)
         f1_scores.append(metrics.f1_score(valid_y, valid_y_pred))
         np_ratios.append(np_ratio)
         print('rf_clf [NP ratio = %d] is fitted' % np_ratio)
    
         t2 = time.time() 
         print('time used %d s' % (t2-t1))
   根据不同的N/P值得到的正负样本组合进行训练模型,再用验证集进行评估,得到一组N/P对应的f1_score,画图表示,并找出最好N/P值。
        # plot the result
        f1 = plt.figure(1)
        plt.plot(np_ratios, f1_scores, label="md=35,nt=100")
        plt.xlabel('NP ratio')
        plt.ylabel('f1_score')
        plt.title('f1_score as function of NP ratio - RF')
        plt.legend(loc=4)
        plt.grid(True, linewidth=0.3)

        plt.show()

 二:对森林的规模树的个数n_estimators进行调优。过程如下:

      # 1.2 selection for best n_estimators of RF

      # training and validating

      f1_scores = []
      n_trees = []
      valid_X, valid_y, train_X, train_y = valid_train_set_construct(valid_ratio = 0.2, 
                                                               valid_sub_ratio = 1, 
                                                               train_np_ratio = 5, 
                                                               train_sub_ratio = 1)
      for nt in [10, 20, 40, 80, 120, 160, 200, 300, 400, 500, 600, 700, 800]:
           t1 = time.time()
       # generation of training and valid set
       # generation of RF model and fit
          RF_clf = RandomForestClassifier(n_estimators=nt, 
                                    max_depth=35, 
                                    max_features="sqrt", 
                                    verbose=True)
          RF_clf.fit(train_X, train_y)
         # validation and evaluation

         valid_y_pred = RF_clf.predict(valid_X)

        f1_scores.append(metrics.f1_score(valid_y, valid_y_pred))

        n_trees.append(nt)
        print('RF_clf [n_estimators = %d] is fitted' % nt)
        t2 = time.time() 
        print('time used %d s' % (t2-t1))
        # plot the result
        f1 = plt.figure(1)
        plt.plot(n_trees, f1_scores, label="md=35,np_ratio=5")
        plt.xlabel('n_trees')
        plt.ylabel('f1_score')
        plt.title('f1_score as function of RF n_trees')
        plt.legend(loc=4)
        plt.grid(True, linewidth=0.3)
        plt.show()


第三:对概率阈值的设置,根据已经训练的模型输入样本特征得到预测值,预测值为概率值,默认为0.5,不断调整,修改概率值达到阈值的条件,从而改变对该用户和该样本的购买预测分类标签。代码如下:

## 1.4 selection for best cutoff in range(0.1, 0.9, 0.1) of RF

# training and validating
f1_scores = []
cut_offs = []
valid_X, valid_y, train_X, train_y = valid_train_set_construct(valid_ratio = 0.2, 
                                          valid_sub_ratio = 1,train_np_ratio = 10, train_sub_ratio = 1)
# generation of RF model and fit
RF_clf = RandomForestClassifier(max_depth=30, n_estimators=150,max_features="sqrt", verbose=True)
RF_clf.fit(train_X, train_y)
for cutoff in np.arange(0.1, 1, 0.05):
    # validation and evaluation
    valid_y_pred = (RF_clf.predict_proba(valid_X)[:,1] > cutoff).astype(int)
    f1_scores.append(metrics.f1_score(valid_y, valid_y_pred))
    cut_offs.append(cutoff)
    print('RF_clf [cutoff = %.2f] is fitted' % cutoff)
    # plot the result
    f1 = plt.figure(1)
    plt.plot(cut_offs, f1_scores, label="np_ratio=10,nt=150,md=30")
    plt.xlabel('cut_offs')
    plt.ylabel('f1_score')
    plt.title('f1_score as function of RF cut_offs')
    plt.legend(loc=4)
    plt.grid(True, linewidth=0.3)

    plt.show()

综上,得到最优的参数如下:                  

                             max_depth = 20 (>= 10)
                             n_estimators = 300
                             cutoffs = 0.55 (0.45 ~ 0.65)
                             N/P ratio = 50 

然后根据训练得到的参数构建rf模型,然后根据df_part_1和df_part_2的总体数据(不在进行验证集训练集的切分)进行模型的训练。然后在对df_part_3中的数据进行样本的组合,按照样本的特征格式,进行表的关键字连接,代入进模型进行预测。

将预测结果和给定商品样本子集进行求交集操作。这里得到预测结果。

             # loading data
        df_P = df_read(path_df_P)
        df_P_item = df_P.drop_duplicates(['item_id'])[['item_id']]
        df_pred = pd.read_csv(open(path_df_result_tmp,'r'), index_col=False, header=None)
        df_pred.columns = ['user_id', 'item_id']
        # output result
        df_pred_P = pd.merge(df_pred, df_P_item, on=['item_id'], how='inner')[['user_id', 'item_id']]
        df_pred_P.to_csv(path_df_result, index=False)

阅读更多
个人分类: 学习者
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭
关闭