kaggle竞赛实战10——特征优化

特征优化思路:

在完成常规流程后,如果不知道怎么办,可以针对文本or时间序列特征进行进一步处理

首先,我们注意到,每一笔信用卡的交易记录都有交易时间,而对于时间字段和文本字段,普通的批量创建特征的方法都是无法较好的挖掘其全部信息的,因此我们需要围绕交易字段中的交易时间进行额外的特征衍生。此处我们可以考虑构造一些用于描述用户行为习惯的特征(经过反复验证,用户行为特征是最为有效的提高预测结果的特征类),包括最近一次交易与首次交易的时间差、信用卡激活日期与首次交易的时间差、用户两次交易平均时间间隔、按照不同交易地点/商品品类进行聚合(并统计均值、方差等统计量)。此外,我们也知道越是接近当前时间点的用户行为越有价值,因此我们还需要重点关注用户最近两个月(实际时间跨度可以自行决定)的行为特征,以两个月为跨度,进一步统计该时间周期内用户的上述交易行为特点,并带入模型进行训练。

其次是二阶交叉特征,在此前的特征衍生过程中,我们曾进行了交叉特征衍生,但只是进行了一阶交叉衍生,例如交易额在不同商品上的汇总,但实际上还可以进一步构造二阶衍生,例如交易额在不同商品组合上的汇总。通常来说更高阶的衍生会导致特征矩阵变得更加稀疏,并且由于每一阶的衍生都会创造大量特征,因此更高阶的衍生往往也会造成维度爆炸,因此高阶交叉特征衍生需要谨慎。不过正如此前我们考虑的,由于用户行为特征对模型结果有更大的影响,因此我们可以单独围绕用户行为数据进行二阶交叉特征衍生,并在后续建模前进行特征筛选。

其三,我们需要对异常值加强思考,而模型误差大部分都由于异常值导致。实际上,用户评分是通过某套公式人工计算得出的,因此这些异常值极有可能是某类特殊用户的标记,因此我们不妨在实际建模过程中进行两阶段建模,即先预测每个输入样本是否是异常样本,并根据分类预测结果进行后续的回归建模

而为了保证后续两阶段建模时第一阶段的分类模型能够更加准确的识别异常用户,我们需要创建一些基于异常用户的特征聚合字段,例如异常用户平均消费次数、消费金额等

模型优化:

使用catboost,CatBoost是由俄罗斯搜索引擎Yandex在2017年7月开源的一个GBM算法,自开源之日起,CatBoost就因为其强大的效果与极快的执行效率广受算法工程人员青睐。在诸多CatBoost的算法优势中,最引人关注的当属该模型能够自主采用独热编码和平均编码的混合策略来处理类别特征,也就是说CatBoost将一些经过实践验证的、普遍有效的特征工程方法融入了实际模型训练过程;此外,CatBoost还提出了一种全新的梯度提升的机制,能够非常好的在经验风险和结构风险中做出权衡(即能够很好的提升精度、同时又能够避免过拟合问题)。

在后续的建模环节中,我们就将使用CatBoost替换随机森林,并最终带入CatBoost、XGBoost和LightGBM三个模型进行模型融合。

并且,需要注意的是,在实际二阶段建模过程时,我们需要在每个建模阶段都进行交叉验证与模型融合,才能最大化提升模型效果。也就是说我们需要训练三组模型(以及对应进行三次模型融合),来完成分类预测问题、普通用户回归预测问题和异常用户回归预测问题。三轮建模关系如下:

1356e2f466df400b8d9b61dacd69ad3c.png

不难发现,整体建模与融合过程都将变得更加复杂。不过这也是更加贴近真实状态的一种情况,很多时候算法和融合过程都只是元素,如何构建一个更加精准、高效的训练流程,才是进阶的算法工程人员更需要考虑的问题。

接下来从头开始

加载训练集,测试集,基本处理 
     train = pd.read_csv('data/train.csv') 
     test = pd.read_csv('data/test.csv') 
      
     target = train['target'] 

    for df in [train, test]:     
         df['year']  = df['first_active_month'].fillna('0-0').apply(lambda x:int(str(x).split('-')[0])) #把这列里的空值填充为0-0,然后将它按-分割,取出月份

        df['first_active_month'] = pd.to_datetime(df['first_active_month']) #在日期后面加个-01改为完整日期

   df['elapsed_time'] = (datetime.date(2018,3, 1) - df['first_active_month'].dt.date).dt.days #计算和2018-03-1间的间隔日期,因为数据集是以18年2月份为时间分割的

df['first_active_month'] = pd.to_datetime(df['first_active_month']) 
         df['weekofyear'] = df['first_active_month'].dt.weekofyear 
         df['dayofyear'] = df['first_active_month'].dt.dayofyear 
         df['month'] = df['first_active_month'].dt.month 

 ## 交易表合并train test 
     train_test = pd.concat([train[['card_id','first_active_month']], test[['card_id','first_active_month']] ], axis=0, ignore_index=True) 
     historical_transactions   = historical_transactions.merge(train_test[['card_id','first_active_month']], on=['card_id'], how='left') 
     new_transactions = new_transactions.merge(train_test[['card_id','first_active_month']], on=['card_id'], how='left') 

 然后需要进行时间字段的更细粒度的呈现: 

  def month_trans(x):  
         return x // 30 
      
   def week_trans(x):  
         return x // 7 

     def get_expand_common(df_): 
         df = df_.copy() #先复制下这个表,因为要直接修改表


         df['category_2'].fillna(1.0,inplace=True) 
         df['category_3'].fillna('A',inplace=True) 
         df['category_3'] = df['category_3'].map({'A':0, 'B':1, 'C':2}) 
         df['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True) 
         df['installments'].replace(-1, np.nan,inplace=True) 
         df['installments'].replace(999, np.nan,inplace=True) 
         df['installments'].replace(0, 1,inplace=True) 
          
         df['purchase_amount'] = np.round(df['purchase_amount'] / 0.00150265118 + 497.06,8) 
         df['purchase_amount'] = df.purchase_amount.apply(lambda x: np.round(x)) 
          
         df['purchase_date']          =  pd.to_datetime(df['purchase_date'])  
         df['first_active_month']     =  pd.to_datetime(df['first_active_month'])  
         df['purchase_hour']          =  df['purchase_date'].dt.hour 
         df['year']                   = df['purchase_date'].dt.year 
         df['month']                  =  df['purchase_date'].dt.month 
         df['day']                    = df['purchase_date'].dt.day 
         df['hour']                   = df['purchase_date'].dt.hour 
         df['weekofyear'] = df['purchase_date'].dt.weekofyear 
         df['dayofweek']              =  df['purchase_date'].dt.dayofweek 
         df['weekend']                =  (df.purchase_date.dt.weekday >=5).astype(int)  
         df                           =  df.sort_values(['card_id','purchase_date'])  
         df['purchase_date_floorday'] =  df['purchase_date'].dt.floor('d')  #删除小于day的时间 
          
         # 距离激活时间的相对时间,0, 1,2,3,...,max-act 
         df['purchase_day_since_active_day']   = df['purchase_date_floorday'] - df['first_active_month']  #ht_card_id_gp['purchase_date_floorday'].transform('min') 
         df['purchase_day_since_active_day']   = df['purchase_day_since_active_day'].dt.days  #.astype('timedelta64[D]')  
         df['purchase_month_since_active_day'] = df['purchase_day_since_active_day'].agg(month_trans).values 
         df['purchase_week_since_active_day']  = df['purchase_day_since_active_day'].agg(week_trans).values 
          
         # 距离最后一天时间的相对时间,0,1,2,3,...,max-act 
         ht_card_id_gp = df.groupby('card_id') 
         df['purchase_day_since_reference_day']   =  ht_card_id_gp['purchase_date_floorday'].transform('max') - df['purchase_date_floorday'] 
         df['purchase_day_since_reference_day']   =  df['purchase_day_since_reference_day'].dt.days 
         # 一个粗粒度的特征(距离最近购买过去了几周,几月) 
         df['purchase_week_since_reference_day']  = df['purchase_day_since_reference_day'].agg(week_trans).values 
         df['purchase_month_since_reference_day'] = df['purchase_day_since_reference_day'].agg(month_trans).values 
          
         df['purchase_day_diff']   =  df['purchase_date_floorday'].shift() 
         df['purchase_day_diff']   =  df['purchase_date_floorday'].values - df['purchase_day_diff'].values 
         df['purchase_day_diff']   =  df['purchase_day_diff'].dt.days 
         df['purchase_week_diff']  =  df['purchase_day_diff'].agg(week_trans).values 
         df['purchase_month_diff'] =  df['purchase_day_diff'].agg(month_trans).values  
          
         df['purchase_amount_ddgd_98']  = df['purchase_amount'].values * df['purchase_day_since_reference_day'].apply(lambda x:0.98**x).values 
         df['purchase_amount_ddgd_99']  = df['purchase_amount'].values * df['purchase_day_since_reference_day'].apply(lambda x:0.99**x).values     
         df['purchase_amount_wdgd_96']  = df['purchase_amount'].values * df['purchase_week_since_reference_day'].apply(lambda x:0.96**x).values  
         df['purchase_amount_wdgd_97']  = df['purchase_amount'].values * df['purchase_week_since_reference_day'].apply(lambda x:0.97**x).values  
         df['purchase_amount_mdgd_90']  = df['purchase_amount'].values * df['purchase_month_since_reference_day'].apply(lambda x:0.9**x).values 
         df['purchase_amount_mdgd_80']  = df['purchase_amount'].values * df['purchase_month_since_reference_day'].apply(lambda x:0.8**x).values  
          
         df = reduce_mem_usage(df) 
          
         return df 


     historical_transactions = get_expand_common(historical_transactions) 
     new_transactions        = get_expand_common(new_transactions) 

在执行完数据清洗与时间字段的处理之后,接下来我们需要开始执行特征优化。根据此前介绍的思路,首先我们需要进行基础行为特征字段衍生: 

构造基本统计特征 
     def aggregate_transactions(df_, prefix):  
          
         df = df_.copy() 
          
         df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30 
         df['month_diff'] = df['month_diff'].astype(int) 
         df['month_diff'] += df['month_lag'] 
          
         df['price'] = df['purchase_amount'] / df['installments'] 
         df['duration'] = df['purchase_amount'] * df['month_diff'] 
         df['amount_month_ratio'] = df['purchase_amount'] / df['month_diff'] 
          
         df.loc[:, 'purchase_date'] = pd.DatetimeIndex(df['purchase_date']).\\ 
                                           astype(np.int64) * 1e-9 
          
         agg_func = { 
             'category_1':      ['mean'], 
             'category_2':      ['mean'], 
             'category_3':      ['mean'], 
             'installments':    ['mean', 'max', 'min', 'std'], 
             'month_lag':       ['nunique', 'mean', 'max', 'min', 'std'], 
             'month':           ['nunique', 'mean', 'max', 'min', 'std'], 
             'hour':            ['nunique', 'mean', 'max', 'min', 'std'], 
             'weekofyear':      ['nunique', 'mean', 'max', 'min', 'std'], 
             'dayofweek':       ['nunique', 'mean'], 
             'weekend':         ['mean'], 
             'year':            ['nunique'], 
             'card_id':         ['size','count'], 
             'purchase_date':   ['max', 'min'], 
             ### 
             'price':             ['mean','max','min','std'], 
             'duration':          ['mean','min','max','std','skew'], 
             'amount_month_ratio':['mean','min','max','std','skew'], 
             }  
          
         for col in ['category_2','category_3']: 
             df[col+'_mean'] = df.groupby([col])['purchase_amount'].transform('mean') 
             agg_func[col+'_mean'] = ['mean'] 
          
         agg_df = df.groupby(['card_id']).agg(agg_func) 
         agg_df.columns = [prefix + '_'.join(col).strip() for col in agg_df.columns.values] 
         agg_df.reset_index(drop=False, inplace=True) 
        
         return agg_df 
     print('generate statistics features...') 
     auth_base_stat = aggregate_transactions(historical_transactions[historical_transactions['authorized_flag']==1], prefix='auth_') 
     print('generate statistics features...') 
     hist_base_stat = aggregate_transactions(historical_transactions[historical_transactions['authorized_flag']==0], prefix='hist_') 
     print('generate statistics features...') 
     new_base_stat  = aggregate_transactions(new_transactions, prefix='new_') 

    def get_quantile(x, percentiles = [0.1, 0.25, 0.75, 0.9]): 
         x_len = len(x) 
         x = np.sort(x) 
         sts_feas = []   
         for per_ in percentiles: 
             if per_ == 1: 
                 sts_feas.append(x[x_len - 1])  
             else: 
                 sts_feas.append(x[int(x_len * per_)])  
         return sts_feas  
      
     def get_cardf_tran(df_, month = 3, prefix = '_'): 
          
         df = df_.copy()  
         if prefix == 'hist_cardf_': 
             df['month_to_now']  =  (datetime.date(2018, month, 1) - df['purchase_date_floorday'].dt.date).dt.days 
          
         df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30 
         df['month_diff'] = df['month_diff'].astype(int) 
         df['month_diff'] += df['month_lag'] 
          
         print('*'*30,'Part1, whole data','*'*30) 
         cardid_features = pd.DataFrame() 
         cardid_features['card_id'] = df['card_id'].unique()    
         print( '*' * 30, 'Traditional Features', '*' * 30) 
         ht_card_id_gp = df.groupby('card_id')  
         cardid_features['card_id_cnt'] = ht_card_id_gp['authorized_flag'].count().values 
          
         if  prefix == 'hist_cardf_': 
             cardid_features['card_id_isau_mean'] = ht_card_id_gp['authorized_flag'].mean().values 
             cardid_features['card_id_isau_sum'] = ht_card_id_gp['authorized_flag'].sum().values  
          
         cardid_features['month_diff_mean']   = ht_card_id_gp['month_diff'].mean().values 
         cardid_features['month_diff_median'] = ht_card_id_gp['month_diff'].median().values 
          
         if prefix == 'hist_cardf_': 
             cardid_features['reference_day']           =  ht_card_id_gp['purchase_date_floorday'].max().values 
             cardid_features['first_day']               =  ht_card_id_gp['purchase_date_floorday'].min().values  
             cardid_features['activation_day']          =  ht_card_id_gp['first_active_month'].max().values 
             
             # first to activation day 
             cardid_features['first_to_activation_day']  =  (cardid_features['first_day'] - cardid_features['activation_day']).dt.days 
             # activation to reference day  
             cardid_features['activation_to_reference_day']  =  (cardid_features['reference_day'] - cardid_features['activation_day']).dt.days 
             # first to last day  
             cardid_features['first_to_reference_day']  =  (cardid_features['reference_day'] - cardid_features['first_day']).dt.days 
             # reference day to now   
             cardid_features['reference_day_to_now']  =  (datetime.date(2018, month, 1) - cardid_features['reference_day'].dt.date).dt.days  
             # first day to now 
             cardid_features['first_day_to_now']  =  (datetime.date(2018, month, 1) - cardid_features['first_day'].dt.date).dt.days  
              
             print('card_id(month_lag, min to reference day):min') 
             cardid_features['card_id_month_lag_min'] = ht_card_id_gp['month_lag'].agg('min').values    
             # is_purchase_before_activation,first_to_reference_day_divide_activation_to_reference_day 
             cardid_features['is_purchase_before_activation'] = cardid_features['first_to_activation_day'] < 0  
             cardid_features['is_purchase_before_activation'] = cardid_features['is_purchase_before_activation'].astype(int) 
             cardid_features['first_to_reference_day_divide_activation_to_reference_day'] = cardid_features['first_to_reference_day']  / (cardid_features['activation_to_reference_day']  + 0.01) 
             cardid_features['days_per_count'] = cardid_features['first_to_reference_day'].values / cardid_features['card_id_cnt'].values 
         
         if prefix == 'new_cardf_': 
             print(' Eight time features, ')  
             cardid_features['reference_day']           =  ht_card_id_gp['reference_day'].last().values 
             cardid_features['first_day']               =  ht_card_id_gp['purchase_date_floorday'].min().values  
             cardid_features['last_day']                =  ht_card_id_gp['purchase_date_floorday'].max().values 
             cardid_features['activation_day']          =  ht_card_id_gp['first_active_month'].max().values 
             # reference to first day 
             cardid_features['reference_day_to_first_day']  =  (cardid_features['first_day'] - cardid_features['reference_day']).dt.days 
             # reference to last day 
             cardid_features['reference_day_to_last_day']  =  (cardid_features['last_day'] - cardid_features['reference_day']).dt.days   
             # first to last day  
             cardid_features['first_to_last_day']  =  (cardid_features['last_day'] - cardid_features['first_day']).dt.days 
             # activation to first day  
             cardid_features['activation_to_first_day']  =  (cardid_features['first_day'] - cardid_features['activation_day']).dt.days 
             # activation to first day  
             cardid_features['activation_to_last_day']  =  (cardid_features['last_day'] - cardid_features['activation_day']).dt.days 
             # last day to now   
             cardid_features['reference_day_to_now']  =  (datetime.date(2018, month, 1) - cardid_features['reference_day'].dt.date).dt.days  
             # first day to now 
             cardid_features['first_day_to_now']  =  (datetime.date(2018, month, 1) - cardid_features['first_day'].dt.date).dt.days  
              
             print('card_id(month_lag, min to reference day):min') 
             cardid_features['card_id_month_lag_max'] = ht_card_id_gp['month_lag'].agg('max').values   
             cardid_features['first_to_last_day_divide_reference_to_last_day'] = cardid_features['first_to_last_day']  / (cardid_features['reference_day_to_last_day']  + 0.01) 
             cardid_features['days_per_count'] = cardid_features['first_to_last_day'].values / cardid_features['card_id_cnt'].values 
          
         for f in ['reference_day', 'first_day', 'last_day', 'activation_day']: 
             try: 
                 del cardid_features[f] 
             except: 
                 print(f, '不存在!!!') 
      
         print('card id(city_id,installments,merchant_category_id,.......):nunique, cnt/nunique')  
         for col in tqdm_notebook(['category_1','category_2','category_3','state_id','city_id','installments','merchant_id', 'merchant_category_id','subsector_id','month_lag','purchase_date_floorday']): 
             cardid_features['card_id_%s_nunique'%col]            =  ht_card_id_gp[col].nunique().values 
             cardid_features['card_id_cnt_divide_%s_nunique'%col] =  cardid_features['card_id_cnt'].values / cardid_features['card_id_%s_nunique'%col].values 
               
         print('card_id(purchase_amount & degrade version ):mean,sum,std,median,quantile(10,25,75,90)')  
         for col in tqdm_notebook(['installments','purchase_amount','purchase_amount_ddgd_98','purchase_amount_ddgd_99','purchase_amount_wdgd_96','purchase_amount_wdgd_97','purchase_amount_mdgd_90','purchase_amount_mdgd_80']): 
             if col =='purchase_amount': 
                 for opt in ['sum','mean','std','median','max','min']: 
                     cardid_features['card_id_' +col+ '_' + opt] = ht_card_id_gp[col].agg(opt).values 
                  
                 cardid_features['card_id_' +col+ '_range'] =  cardid_features['card_id_' +col+ '_max'].values - cardid_features['card_id_' +col+ '_min'].values 
                 percentiles = ht_card_id_gp[col].apply(lambda x:get_quantile(x,percentiles = [0.025, 0.25, 0.75, 0.975]))  
      
                 cardid_features[col + '_2.5_quantile']  = percentiles.map(lambda x:x[0]).values 
                 cardid_features[col + '_25_quantile'] = percentiles.map(lambda x:x[1]).values 
                 cardid_features[col + '_75_quantile'] = percentiles.map(lambda x:x[2]).values 
                 cardid_features[col + '_97.5_quantile'] = percentiles.map(lambda x:x[3]).values 
                 cardid_features['card_id_' +col+ '_range2'] =  cardid_features[col+ '_97.5_quantile'].values - cardid_features[col+ '_2.5_quantile'].values 
                 del cardid_features[col + '_2.5_quantile'],cardid_features[col + '_97.5_quantile'] 
                 gc.collect() 
             else: 
                 for opt in ['sum']: 
                     cardid_features['card_id_' +col+ '_' + opt] = ht_card_id_gp[col].agg(opt).values           
          
         print( '*' * 30, 'Pivot Features', '*' * 30) 
         print('Count  Pivot') #purchase_month_since_reference_day(可能和month_lag重复),百分比降分,暂时忽略 (dayofweek,merchant_cate,state_id)作用不大installments 
         for pivot_col in tqdm_notebook(['category_1','category_2','category_3','month_lag','subsector_id','weekend']): #'city_id',, 
          
             tmp     = df.groupby(['card_id',pivot_col])['merchant_id'].count().to_frame(pivot_col + '_count') 
             tmp.reset_index(inplace =True)   
               
             tmp_pivot = pd.pivot_table(data=tmp,index = 'card_id',columns=pivot_col,values=pivot_col + '_count',fill_value=0) 
             tmp_pivot.columns = [tmp_pivot.columns.names[0] + '_cnt_pivot_'+ str(col) for col in tmp_pivot.columns] 
             tmp_pivot.reset_index(inplace = True) 
             cardid_features = cardid_features.merge(tmp_pivot, on = 'card_id', how='left') 
            
             if  pivot_col!='weekend' and  pivot_col!='installments': 
                 tmp            = df.groupby(['card_id',pivot_col])['purchase_date_floorday'].nunique().to_frame(pivot_col + '_purchase_date_floorday_nunique')  
                 tmp1           = df.groupby(['card_id'])['purchase_date_floorday'].nunique().to_frame('purchase_date_floorday_nunique')  
                 tmp.reset_index(inplace =True)   
                 tmp1.reset_index(inplace =True)    
                 tmp  = tmp.merge(tmp1, on ='card_id', how='left') 
                 tmp[pivot_col + '_day_nunique_pct'] = tmp[pivot_col + '_purchase_date_floorday_nunique'].values / tmp['purchase_date_floorday_nunique'].values 
               
                 tmp_pivot = pd.pivot_table(data=tmp,index = 'card_id',columns=pivot_col,values=pivot_col + '_day_nunique_pct',fill_value=0) 
                 tmp_pivot.columns = [tmp_pivot.columns.names[0] + '_day_nunique_pct_'+ str(col) for col in tmp_pivot.columns] 
                 tmp_pivot.reset_index(inplace = True) 
                 cardid_features = cardid_features.merge(tmp_pivot, on = 'card_id', how='left') 
          
         if prefix == 'new_cardf_': 
         ######## 在卡未激活之前就有过消费的记录  ##############    
             print('*'*30,'Part2, data with time less than activation day','*'*30) 
             df_part = df.loc[df.purchase_date < df.first_active_month] 
      
             cardid_features_part = pd.DataFrame() 
             cardid_features_part['card_id'] = df_part['card_id'].unique()    
             ht_card_id_part_gp = df_part.groupby('card_id') 
             cardid_features_part['card_id_part_cnt'] = ht_card_id_part_gp['authorized_flag'].count().values 
      
             print('card_id(purchase_amount): sum')  
             for col in tqdm_notebook(['purchase_amount']):  
                 for opt in ['sum','mean']: 
                     cardid_features_part['card_id_part_' +col+ '_' + opt] = ht_card_id_part_gp[col].agg(opt).values 
      
             cardid_features = cardid_features.merge(cardid_features_part, on ='card_id', how='left') 
             cardid_features['card_id_part_purchase_amount_sum_percent'] = cardid_features['card_id_part_purchase_amount_sum'] / (cardid_features['card_id_purchase_amount_sum'] + 0.01) 
      
         cardid_features = reduce_mem_usage(cardid_features) 
          
         new_col_names = [] 
         for col in cardid_features.columns: 
             if col == 'card_id': 
                 new_col_names.append(col) 
             else: 
                 new_col_names.append(prefix + col) 
         cardid_features.columns = new_col_names 
          
         return cardid_features 
     print('auth...') 
     authorized_transactions = historical_transactions.loc[historical_transactions['authorized_flag'] == 1] 
     auth_cardf_tran = get_cardf_tran(authorized_transactions, 3, prefix='auth_cardf_') 
     print('hist...') 
     hist_cardf_tran = get_cardf_tran(historical_transactions, 3, prefix='hist_cardf_') 
     print('new...') 
     reference_days = historical_transactions.groupby('card_id')['purchase_date'].last().to_frame('reference_day') 
     reference_days.reset_index(inplace = True) 
     new_transactions = new_transactions.merge(reference_days, on ='card_id', how='left') 
     new_cardf_tran  = get_cardf_tran(new_transactions, 5, prefix='new_cardf_') 
   ]
  }

  然后,需要进一步考虑最近两个月的用户行为特征: 

   def get_cardf_tran_last2(df_, month = 3, prefix = 'last2_'):  
          
         df = df_.loc[df_.month_lag >= -2].copy() 
         print('*'*30,'Part1, whole data','*'*30) 
         cardid_features = pd.DataFrame() 
         cardid_features['card_id'] = df['card_id'].unique()    
          
         df['month_diff'] = ((datetime.datetime.today() - df['purchase_date']).dt.days)//30 
         df['month_diff'] = df['month_diff'].astype(int) 
         df['month_diff'] += df['month_lag'] 
          
         print( '*' * 30, 'Traditional Features', '*' * 30) 
         ht_card_id_gp = df.groupby('card_id') 
         print(' card id : count') 
         cardid_features['card_id_cnt'] = ht_card_id_gp['authorized_flag'].count().values 
          
         cardid_features['card_id_isau_mean'] = ht_card_id_gp['authorized_flag'].mean().values  
         cardid_features['card_id_isau_sum']  = ht_card_id_gp['authorized_flag'].sum().values 
          
         cardid_features['month_diff_mean']   = ht_card_id_gp['month_diff'].mean().values 
      
         print('card id(city_id,installments,merchant_category_id,.......):nunique, cnt/nunique')  
         for col in tqdm_notebook(['state_id','city_id','installments','merchant_id', 'merchant_category_id','purchase_date_floorday']): 
             cardid_features['card_id_%s_nunique'%col] = ht_card_id_gp[col].nunique().values 
             cardid_features['card_id_cnt_divide_%s_nunique'%col] = cardid_features['card_id_cnt'].values / cardid_features['card_id_%s_nunique'%col].values 
               
         for col in tqdm_notebook(['purchase_amount','purchase_amount_ddgd_98','purchase_amount_wdgd_96','purchase_amount_mdgd_90','purchase_amount_mdgd_80']): #,'purchase_amount_ddgd_98','purchase_amount_ddgd_99','purchase_amount_wdgd_96','purchase_amount_wdgd_97','purchase_amount_mdgd_90','purchase_amount_mdgd_80']): 
             if col =='purchase_amount': 
                 for opt in ['sum','mean','std','median']: 
                     cardid_features['card_id_' +col+ '_' + opt] = ht_card_id_gp[col].agg(opt).values   
             else: 
                 for opt in ['sum']: 
                     cardid_features['card_id_' +col+ '_' + opt] = ht_card_id_gp[col].agg(opt).values  
          
         print( '*' * 30, 'Pivot Features', '*' * 30) 
         print('Count  Pivot') #purchase_month_since_reference_day(可能和month_lag重复),百分比降分,暂时忽略 (dayofweek,merchant_cate,state_id)作用不大 
          
         for pivot_col in tqdm_notebook(['category_1','category_2','category_3','month_lag','subsector_id','weekend']): #'city_id',  
          
             tmp     = df.groupby(['card_id',pivot_col])['merchant_id'].count().to_frame(pivot_col + '_count') 
             tmp.reset_index(inplace =True)   
               
             tmp_pivot = pd.pivot_table(data=tmp,index = 'card_id',columns=pivot_col,values=pivot_col + '_count',fill_value=0) 
             tmp_pivot.columns = [tmp_pivot.columns.names[0] + '_cnt_pivot_'+ str(col) for col in tmp_pivot.columns] 
             tmp_pivot.reset_index(inplace = True) 
             cardid_features = cardid_features.merge(tmp_pivot, on = 'card_id', how='left') 
            
             if  pivot_col!='weekend' and  pivot_col!='installments': 
                 tmp            = df.groupby(['card_id',pivot_col])['purchase_date_floorday'].nunique().to_frame(pivot_col + '_purchase_date_floorday_nunique')  
                 tmp1           = df.groupby(['card_id'])['purchase_date_floorday'].nunique().to_frame('purchase_date_floorday_nunique')  
                 tmp.reset_index(inplace =True)   
                 tmp1.reset_index(inplace =True)    
                 tmp  = tmp.merge(tmp1, on ='card_id', how='left') 
                 tmp[pivot_col + '_day_nunique_pct'] = tmp[pivot_col + '_purchase_date_floorday_nunique'].values / tmp['purchase_date_floorday_nunique'].values 
               
                 tmp_pivot = pd.pivot_table(data=tmp,index = 'card_id',columns=pivot_col,values=pivot_col + '_day_nunique_pct',fill_value=0) 
                 tmp_pivot.columns = [tmp_pivot.columns.names[0] + '_day_nunique_pct_'+ str(col) for col in tmp_pivot.columns] 
                 tmp_pivot.reset_index(inplace = True) 
                 cardid_features = cardid_features.merge(tmp_pivot, on = 'card_id', how='left') 
           
         cardid_features = reduce_mem_usage(cardid_features) 
          
         new_col_names = [] 
         for col in cardid_features.columns: 
             if col == 'card_id': 
                 new_col_names.append(col) 
             else: 
                 new_col_names.append(prefix + col) 
         cardid_features.columns = new_col_names 
          
         return cardid_features   
      
     hist_cardf_tran_last2 = get_cardf_tran_last2(historical_transactions, month = 3, prefix = 'hist_last2_') 
   ]
     以及进一步进行二阶交叉特征衍生:

    def successive_aggregates(df_, prefix = 'levelAB_'): 
         df = df_.copy() 
         cardid_features = pd.DataFrame() 
         cardid_features['card_id'] = df['card_id'].unique()     
           
         level12_nunique = [('month_lag','state_id'),('month_lag','city_id'),('month_lag','subsector_id'),('month_lag','merchant_category_id'),('month_lag','merchant_id'),('month_lag','purchase_date_floorday'),\\ 
                            ('subsector_id','merchant_category_id'),('subsector_id','merchant_id'),('subsector_id','purchase_date_floorday'),('subsector_id','month_lag'),\\ 
                            ('merchant_category_id', 'merchant_id'),('merchant_category_id','purchase_date_floorday'),('merchant_category_id','month_lag'),\\ 
                            ('purchase_date_floorday', 'merchant_id'),('purchase_date_floorday','merchant_category_id'),('purchase_date_floorday','subsector_id')]     
         for col_level1,col_level2 in tqdm_notebook(level12_nunique):   
              
             level1  = df.groupby(['card_id',col_level1])[col_level2].nunique().to_frame(col_level2 + '_nunique') 
             level1.reset_index(inplace =True)   
               
             level2 = level1.groupby('card_id')[col_level2 + '_nunique'].agg(['mean', 'max', 'std']) 
             level2 = pd.DataFrame(level2) 
             level2.columns = [col_level1 + '_' + col_level2 + '_nunique_' + col for col in level2.columns.values] 
             level2.reset_index(inplace = True) 
              
             cardid_features = cardid_features.merge(level2, on='card_id', how='left')  
          
         level12_count = ['month_lag','state_id','city_id','subsector_id','merchant_category_id','merchant_id','purchase_date_floorday'] 
         for col_level in tqdm_notebook(level12_count):  
          
             level1  = df.groupby(['card_id',col_level])['merchant_id'].count().to_frame(col_level + '_count') 
             level1.reset_index(inplace =True)   
               
             level2 = level1.groupby('card_id')[col_level + '_count'].agg(['mean', 'max', 'std']) 
             level2 = pd.DataFrame(level2) 
             level2.columns = [col_level + '_count_' + col for col in level2.columns.values] 
             level2.reset_index(inplace = True) 
              
             cardid_features = cardid_features.merge(level2, on='card_id', how='left')  
          
         level12_meansum = [('month_lag','purchase_amount'),('state_id','purchase_amount'),('city_id','purchase_amount'),('subsector_id','purchase_amount'),\\ 
                            ('merchant_category_id','purchase_amount'),('merchant_id','purchase_amount'),('purchase_date_floorday','purchase_amount')] 
         for col_level1,col_level2 in tqdm_notebook(level12_meansum):  
          
             level1  = df.groupby(['card_id',col_level1])[col_level2].sum().to_frame(col_level2 + '_sum') 
             level1.reset_index(inplace =True)   
               
             level2 = level1.groupby('card_id')[col_level2 + '_sum'].agg(['mean', 'max', 'std']) 
             level2 = pd.DataFrame(level2) 
             level2.columns = [col_level1 + '_' + col_level2 + '_sum_' + col for col in level2.columns.values] 
             level2.reset_index(inplace = True) 
      
             cardid_features = cardid_features.merge(level2, on='card_id', how='left')            
          
         cardid_features = reduce_mem_usage(cardid_features) 
          
         new_col_names = [] 
         for col in cardid_features.columns: 
             if col == 'card_id': 
                 new_col_names.append(col) 
             else: 
                 new_col_names.append(prefix + col) 
         cardid_features.columns = new_col_names 
          
         return cardid_features   
      
     print('hist...') 
     hist_levelAB = successive_aggregates(historical_transactions, prefix = 'hist_levelAB_') 
   ]    
      接下来,将上述衍生特征进行合并: 

 print('#_____基础统计特征') 
     train = pd.merge(train, auth_base_stat, on='card_id', how='left') 
     test  = pd.merge(test,  auth_base_stat, on='card_id', how='left') 
     train = pd.merge(train, hist_base_stat, on='card_id', how='left') 
     test  = pd.merge(test,  hist_base_stat, on='card_id', how='left') 
     train = pd.merge(train, new_base_stat , on='card_id', how='left') 
     test  = pd.merge(test,  new_base_stat , on='card_id', how='left') 
     print(train.shape) 
     print(test.shape) 
     print('#_____全局cardid特征') 
     train = pd.merge(train, auth_cardf_tran, on='card_id', how='left') 
     test  = pd.merge(test,  auth_cardf_tran, on='card_id', how='left') 
     train = pd.merge(train, hist_cardf_tran, on='card_id', how='left') 
     test  = pd.merge(test,  hist_cardf_tran, on='card_id', how='left') 
     train = pd.merge(train, new_cardf_tran , on='card_id', how='left') 
     test  = pd.merge(test,  new_cardf_tran , on='card_id', how='left') 
     print(train.shape) 
     print(test.shape) 
     print('#_____最近两月cardid特征') 
     train = pd.merge(train, hist_cardf_tran_last2, on='card_id', how='left') 
     test  = pd.merge(test,  hist_cardf_tran_last2, on='card_id', how='left') 
     print(train.shape) 
     print(test.shape) 
     print('#_____补充二阶特征') 
     train = pd.merge(train, hist_levelAB, on='card_id', how='left') 
     test  = pd.merge(test,  hist_levelAB, on='card_id', how='left') 
     print(train.shape) 
     print(test.shape) 
   ]

并在此基础上补充部分简单四折运算后的衍生特征: 

    train['outliers'] = 0 
     train.loc[train['target'] < -30, 'outliers'] = 1 
     train['outliers'].value_counts() 
     for f in ['feature_1','feature_2','feature_3']: 
         colname = f+'_outliers_mean' 
         order_label = train.groupby([f])['outliers'].mean() 
         for df in [train, test]: 
             df[colname] = df[f].map(order_label) 
      
     for df in [train, test]: 
          
         df['days_feature1'] = df['elapsed_time'] * df['feature_1'] 
         df['days_feature2'] = df['elapsed_time'] * df['feature_2'] 
         df['days_feature3'] = df['elapsed_time'] * df['feature_3'] 
      
         df['days_feature1_ratio'] = df['feature_1'] / df['elapsed_time'] 
         df['days_feature2_ratio'] = df['feature_2'] / df['elapsed_time'] 
         df['days_feature3_ratio'] = df['feature_3'] / df['elapsed_time'] 
      
         df['feature_sum'] = df['feature_1'] + df['feature_2'] + df['feature_3'] 
         df['feature_mean'] = df['feature_sum']/3 
         df['feature_max'] = df[['feature_1', 'feature_2', 'feature_3']].max(axis=1) 
         df['feature_min'] = df[['feature_1', 'feature_2', 'feature_3']].min(axis=1) 
         df['feature_var'] = df[['feature_1', 'feature_2', 'feature_3']].std(axis=1) 
          
         df['card_id_total'] = df['hist_card_id_size']+df['new_card_id_size'] 
         df['card_id_cnt_total'] = df['hist_card_id_count']+df['new_card_id_count'] 
         df['card_id_cnt_ratio'] = df['new_card_id_count']/df['hist_card_id_count'] 
         df['purchase_amount_total'] = df['hist_cardf_card_id_purchase_amount_sum']+df['new_cardf_card_id_purchase_amount_sum'] 
         df['purchase_amount_ratio'] = df['new_cardf_card_id_purchase_amount_sum']/df['hist_cardf_card_id_purchase_amount_sum'] 
         df['month_diff_ratio'] = df['new_cardf_month_diff_mean']/df['hist_cardf_month_diff_mean'] 
         df['installments_total'] = df['new_cardf_card_id_installments_sum']+df['auth_cardf_card_id_installments_sum'] 
         df['installments_ratio'] = df['new_cardf_card_id_installments_sum']/df['auth_cardf_card_id_installments_sum'] 
         df['price_total'] = df['purchase_amount_total']/df['installments_total'] 
         df['new_CLV'] = df['new_card_id_count'] * df['new_cardf_card_id_purchase_amount_sum'] / df['new_cardf_month_diff_mean'] 
         df['hist_CLV'] = df['hist_card_id_count'] * df['hist_cardf_card_id_purchase_amount_sum'] / df['hist_cardf_month_diff_mean'] 
         df['CLV_ratio'] = df['new_CLV'] / df['hist_CLV'] 

3.特征筛选 

在创建完全部特征后即可进行特征筛选了。此处我们考虑手动进行特征筛选,排除部分过于稀疏的特征后即可将数据保存在本地: 

    del_cols = [] 
     for col in train.columns: 
         if 'subsector_id_cnt_' in col and 'new_cardf':  
             del_cols.append(col) 
     del_cols1 = [] 
     for col in train.columns: 
         if 'subsector_id_cnt_' in col and 'hist_last2_' in col: 
             del_cols1.append(col) 
     del_cols2 = [] 
     for col in train.columns: 
         if 'subsector_id_cnt_' in col and 'auth_cardf' in col: 
             del_cols2.append(col) 
     del_cols3 = [] 
     for col in train.columns: 
         if 'merchant_category_id_month_lag_nunique_' in col and '_pivot_supp' in col: 
             del_cols3.append(col) 
         if 'city_id' in col and '_pivot_supp' in col: 
             del_cols3.append(col) 
         if 'month_diff' in col and 'hist_last2_' in col: 
             del_cols3.append(col) 
         if 'month_diff_std' in col or 'month_diff_gap' in col: 
             del_cols3.append(col)  
     fea_cols = [col for col in train.columns if train[col].dtypes!='object' and train[col].dtypes != '<M8[ns]' and col!='target' not in col and col!='min_num'\\ 
                 and col not in del_cols and col not in del_cols1 and col not in del_cols2 and col!='target1' and col!='card_id_cnt_ht_pivot_supp'  and col not in del_cols3]    
     print('删除前:',train.shape[1]) 
     print('删除后:',len(fea_cols)) 
      
     train = train[fea_cols+['target']] 
     fea_cols.remove('outliers') 
     test = test[fea_cols] 
      
     train.to_csv('./data/all_train_features.csv',index=False) 
     test.to_csv('./data/all_test_features.csv',index=False) 

  实际执行过程中,可以按照如下方式进行读取: 

     ## load all features 
     train = pd.read_csv('./data/all_train_features.csv') 
     test  = pd.read_csv('./data/all_test_features.csv') 
      
     inf_cols = ['new_cardf_card_id_cnt_divide_installments_nunique', 'hist_last2_card_id_cnt_divide_installments_nunique'] 
     train[inf_cols] = train[inf_cols].replace(np.inf, train[inf_cols].replace(np.inf, -99).max().max()) 
     ntrain[inf_cols] = ntrain[inf_cols].replace(np.inf, ntrain[inf_cols].replace(np.inf, -99).max().max()) 
     test[inf_cols] = test[inf_cols].replace(np.inf, test[inf_cols].replace(np.inf, -99).max().max()) 
      
     # ## load sparse 
     # train_tags = sparse.load_npz('train_tags.npz') 
     # test_tags  = sparse.load_npz('test_tags.npz') 
      
     ## 获取非异常值的index 
     normal_index = train[train['outliers']==0].index.tolist() 
     ## without outliers 
     ntrain = train[train['outliers'] == 0] 
      
     target        = train['target'].values 
     ntarget       = ntrain['target'].values 
     target_binary = train['outliers'].values 
     ### 
     y_train        = target 
     y_ntrain       = ntarget 
     y_train_binary = target_binary 
      
     print('train:',train.shape) 
     print('ntrain:',ntrain.shape) 
   ]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值