银联高校极客挑战赛-数据建模赛道总结(决赛前)

赛题回顾

赛题描述

本次大赛基于脱敏和采样后的约 40,000 条用户消费行为数据,预测在未来的一段时间内,用户对于某产品是否会有购买和收藏的行为。

数据背景

本数据集为经过脱敏和采样后用户在某网站的消费行为数据,其中大致包含了三大类信息,即用户的基础信息商品的基础信息用户的行为信息,其中用户的基础信息除了用户的id之外皆为脱敏后的具体行为信息,商品信息除卖家id、商品id 之外皆为脱敏后的具体商品信息, 用户的行为信息包含了两个用户的具体行为(收藏购买),除此之外都是脱敏后的具体行为信息。

字段列表

日期信息进行统一编号,第一天编号为 01, 第二天为 02,以此类推。所有的特征(除上文中已经说明的之外)无具体指代。

列名类型说明示例
user_idString用户唯一标识(脱敏后)abc123
Product_idString产品id(脱敏后)abc123
sellerString卖家id(脱敏后)abc123
dayString日期1,2,..30
action_typeInt行为相关变量(脱敏后)1
ProductInfo_XInt产品相关变量1,50,100
Webinfo_XInt网络行为相关变量1,50,100
UserInfo_XInt用户相关变量1,50,100
purchaseInt用户是否购买0,1
favoriteInt用户是否收藏0,1

 

 

 

 

 

 

 

 

 

 

 

评估标准

1.本次测评算法为: AUC(Area Under the Curve) ,我们将分别针对购买和收藏的预测值求出一个AUC。

最终的测评结果为 AUC* = (AUC(购买) + AUC(收藏)) / 2。

2.最终使用 AUC* 作为参赛选手得分。AUC* 越大,代表结果越优,该团队的排名越靠前。

 

数据处理及探索性分析

使用pandas-profiling对训练集及测试集进行分析

  1. 缺失值

    所有属性中,这两列缺失值较多,用列均值填补,其中个别负数视为异常值将其取绝对值

    train_test['UserInfo_92']=train_test['UserInfo_92'].fillna(train_test['UserInfo_92'].mean())
    train_test['UserInfo_92']=train_test['UserInfo_92'].abs()
  2. string类型转换,将user_id,Product_id,seller,day转化为数值类型
    # 字符特征转化,factorize返回一元组,转为无符号整数
    train_test["user_id"] = pd.factorize(train_test["user_id"])[0].astype(np.uint16)
    train_test["Product_id"] = pd.factorize(train_test["Product_id"])[0].astype(np.uint16)
    train_test["seller"] = pd.factorize(train_test["seller"])[0].astype(np.uint16)
    train_test["day"] = pd.to_numeric(train_test["day"])

     

  3. 测试集训练集对应列合成一个train_test的dataframe,和两个lable
    # lable标签
    lable_purchase = train_data['purchase']
    lable_favorite = train_data['favorite']
    
    # 合并train和test,对应列名
    col_name=train_data.columns
    features=test_data.columns.tolist()
    train_data=train_data[features]
    train_test=pd.concat([train_data,test_data],sort=False)

     

 

购买部分

初赛

  1. 数据预处理
  • 删去字符类型特征
    train_test= train_test.drop(['user_id','Product_id','seller'], axis=1)

     

  1. 特征选取及交叉
  • 对其他特征做统计特征,取列中值与本列标准差为一新特征A,将新特征A取绝对值为新特征B

    col_name=data_test.columns.tolist()
    for i in col_name:
        data_test[i+'std']=data_test[i]-data_test[i].std()# 与标准差作差
        data_test[i+'std_abs']=data_test[i+'std'].abs()# 去差的绝对值
  1. 模型选择
  • 模型选取为xgboost
    import xgboost as xgb
    def xgb_model(X_t, X_v, y_t, y_v, test):
        print("XGB model start")
        
        xgb_val = xgb.DMatrix(X_v, label=y_v)
        xgb_train = xgb.DMatrix(X_t, label=y_t)
        xgb_test = xgb.DMatrix(test)    
    
        params = {
                  'booster': 'gbtree',
                  'objective': 'binary:logistic',
                  'eval_metric': 'auc',
                  'lambda': 10, # 控制模型复杂度的权重值的L2正则化项参数,参数越大,模型越不容易过拟合。          'subsample': 0.7, # 随机采样训练样本
                  'silent': True, # 设置成1则没有运行信息输出,最好是设置为0.
                  'eta': 0.01, # 如同学习率
                  'n_jobs': -1,            
                  }
    
        plst = list(params.items())
        num_rounds = 5000 # 迭代次数
        watchlist = [(xgb_train, 'train'), (xgb_val, 'val')]    
    
        model = xgb.train(plst, xgb_train, num_rounds, watchlist, early_stopping_rounds=200)# 训练模型
    
        res = model.predict(xgb_test)# 预测测试集
        print('test prediction finished')
        return res # 返回测试集预测结果

     

  1. 结果融合

复赛

  1. 数据预处理
  • 删除单一值占比极高的几列:'ProductInfo_84':83.1%,'action_type':96.1%,'ProductInfo_107':99.3%和'UserInfo_112','UserInfo_148'
  1. 特征选取
  • 交叉特征:将'user_id','Product_id','seller','day'四个属性做交叉特征(三类交叉,两类交叉),然后筛选对模型有提高效果的特征保留
  • 特征按购买率转化: 对所有int类型的特征求每个特征每个类别的购买率(trans_list),选取购买率为占比稍大些的特征进行购买率转化(在trans_list中删掉购买率为0的类别占比为0.0%和0.1%的特征,即dlist),若类别购买率为0则为0购买,不为0则为1,即将每列特征转化为01的二值特征,表示购买或不购买,增加模型鲁棒性(不保留原始特征)
    trans_list=['UserInfo_70','UserInfo_266','UserInfo_244','UserInfo_246','ProductInfo_46','ProductInfo_53',
    'UserInfo_62','UserInfo_119','UserInfo_216','UserInfo_151','ProductInfo_102','UserInfo_167',
    'UserInfo_6','UserInfo_223','UserInfo_45','UserInfo_226','UserInfo_180','UserInfo_254',
    'UserInfo_166','UserInfo_168','ProductInfo_81','UserInfo_18','ProductInfo_19','ProductInfo_7',
    'ProductInfo_191','ProductInfo_73','WebInfo_2','ProductInfo_6','ProductInfo_37','UserInfo_22',
    'UserInfo_198','UserInfo_230','UserInfo_174','UserInfo_51','UserInfo_38','UserInfo_171',
    'UserInfo_116','UserInfo_145','ProductInfo_45','UserInfo_154','UserInfo_78','UserInfo_10',
    'UserInfo_123','UserInfo_107','UserInfo_50','ProductInfo_62','UserInfo_41','ProductInfo_125',
    'UserInfo_68','UserInfo_124','UserInfo_87','ProductInfo_94','ProductInfo_104','UserInfo_35',
    'UserInfo_239','UserInfo_89','ProductInfo_18','UserInfo_252','UserInfo_177','UserInfo_150',
    'UserInfo_212','UserInfo_236','UserInfo_257','UserInfo_55','UserInfo_164','UserInfo_40','ProductInfo_215',
    'ProductInfo_63','ProductInfo_90','UserInfo_26','UserInfo_120','UserInfo_221','UserInfo_247',
    'UserInfo_39','UserInfo_110','UserInfo_214','UserInfo_241','ProductInfo_10','ProductInfo_44',
    'UserInfo_32','UserInfo_111','UserInfo_170','ProductInfo_179','UserInfo_163','UserInfo_175',
    'UserInfo_182']# 要转化的,取int类型中购买率为0的多的
    dlist=['UserInfo_89','UserInfo_41','UserInfo_266','UserInfo_26','UserInfo_236',
    'UserInfo_230','UserInfo_221','UserInfo_198','UserInfo_174','UserInfo_170',
    'UserInfo_168','UserInfo_163','UserInfo_10','ProductInfo_6','ProductInfo_46','UserInfo_107',
    'UserInfo_110','UserInfo_116','UserInfo_119','UserInfo_120','UserInfo_123','UserInfo_124',
    'UserInfo_154','UserInfo_164','UserInfo_166','UserInfo_171','UserInfo_175','UserInfo_180',
    'UserInfo_182','UserInfo_214','UserInfo_223','UserInfo_241','UserInfo_252','UserInfo_254',
    'UserInfo_257','UserInfo_35','UserInfo_45','UserInfo_50','UserInfo_62','UserInfo_68','UserInfo_70',
    'UserInfo_78','UserInfo_87']# drop掉0.0%,0.1%的
    train_data['purchase']=lable_purchase # 此时train_test中没有purchase_label列,加上,用完再删 
    for k in trans_list: # k是第几列,j是第几个属性值
        feature=train_data[k].unique()
        for j in feature:
            # tmp是属性的购买率
            tmp=train_data[train_data[k]==j]['purchase'].sum()/train_data[train_data[k]==j].shape[0]
            if tmp==0:# 如果购买率为0,赋值为0
                train_test.ix[(train_test[k]==j),k]=-9999 # 消除特征本身有0和1的影响 
            else:
                train_test.ix[(train_test[k]==j),k]=9999
        train_test.ix[(train_test[k]==9999),k]=1
        train_test.ix[(train_test[k]==-9999),k]=0
    train_data = train_data.drop('purchase', axis=1) 

     

  • 统计特征:对所有列做1.这列值与这列标准差的差2.1中结果的绝对值(与初赛相同操作)
  1. 模型选择
  • 与初赛相同为xgboost,参数基本一致
  1. 结果融合
  • 与UserInfo_82融合(1:9),UserInfo_82列取复制然后归一化后的结果直接提交auc为0.556,效果与不做购买率直接用xgb训练的结果相似,说明这列取负归一化后可以大致说明购买率的倾向。做购买力率转换后xgb得分为0.586,将这个结果与UserInfo_82融合,线上结果可以达到0.594。
    purchase_result = xgb_model(X_train_p, X_valid_p, y_train_p, y_valid_p, test)#做购买率变化之后xgb结果
    
    # 对UserInfo_82取负值然后minmax归一化
    test_data['UserInfo_82']=-test_data['UserInfo_82']
    test_data['UserInfo_82']=(test_data['UserInfo_82']-test_data['UserInfo_82'].min())/(test_data['UserInfo_82'].max()-test_data['UserInfo_82'].min())
    u82=test_data['UserInfo_82'] # 线上556
    
    res=purchase_result*0.9+u82*0.1# 购买率转化后结果与u82融合

    那么问题就来了,为什么要对UserInfo_82这列取负值再做归一化呢?

收藏部分

初赛

  1. 数据预处理
  • 与初赛购买相同
  1. 特征选取及交叉
  • 选取交叉特征(全都为有意义的交叉特征,复赛排列组合了各特征,出现了一些无意义但是效果好的交叉特征)
# 手动选取一些有意义的特征
train_test['browse_Product_un']=train_test.groupby("user_id").Product_id.transform('count')
train_test['browse_seller']=train_test.groupby("user_id").seller.transform('nunique')
train_test['browse_days']=train_test.groupby("user_id").day.transform('nunique')
train_test['last_days'] = train_test.groupby("user_id").day.transform('max')
train_test['later_days'] = train_test.groupby("user_id").day.transform('min')
train_test['Product_count']=train_test.groupby('Product_id')['Product_id'].transform('count')
train_test['user_action']=train_test.groupby('user_id')['action_type'].transform('nunique')
train_test['seller_seller']=train_test.groupby('seller')['seller'].transform('count')
train_test['day_seller_count']=train_test.groupby('day')['seller'].transform('count')
train_test['user_seller_user']=train_test.groupby(['user_id','seller'])['user_id'].transform('count')
train_test['Product_day_user']=train_test.groupby(['Product_id','day'])['user_id'].transform('count')
train_test['seller_day_user']=train_test.groupby(['seller','day'])['user_id'].transform('count')
train_test['day_user_count']=train_test.groupby('day')['day'].transform('count')
train_test['seller_user']=train_test.groupby(['seller'])['seller'].transform('count')
train_test['u123_']=train_test.groupby(['user_id'])['UserInfo_123'].transform('count')

# ---网络行为数量
train_test['webaction1']=train_test['WebInfo_1']+train_test['WebInfo_2']+train_test['WebInfo_3']

# ---product每天收藏最多的和最少的
train_test['product_day_max']=train_test.groupby(['day'])['Product_id'].transform('max')
train_test['product_day_min']=train_test.groupby(['day'])['Product_id'].transform('min')
  • 统计类型特征
    # ~~~~~~~~~~~~~~~~~~~~floatl64类型处理~~~~~~~~~~~~~~~~~~~~~~~~
    col_name=train_test.columns.tolist()
    product_float = []
    user_float = []
    web_float = []
    pu_col = []
    
    product_int = []
    user_int = []
    web_int = []
    
    for i in range(0, len(col_name)):
        if train_test[col_name[i]].dtype == 'float64':
            pu_col.append(col_name[i])
            if col_name[i][0] =='P':
                product_float.append(col_name[i])
            elif col_name[i][0] =='U':
                user_float.append(col_name[i])
            elif col_name[i][0] =='W':# web info 没有float64类型的
                web_float.append(col_name[i])
        if train_test[col_name[i]].dtype == 'int64':
            if col_name[i][0] =='P':
                product_int.append(col_name[i])
            elif col_name[i][0] =='U':
                user_int.append(col_name[i])
            elif col_name[i][0] =='W':
                web_int.append(col_name[i])
    
    for i in product_float:
        train_test[i] = train_test[i].apply(lambda x:np.exp(x))
        train_test[i+'_mean']=train_test.groupby(['Product_id'])[i].transform('mean')
        train_test[i+"_min"]=train_test.groupby('Product_id')[i].transform("min")
        train_test[i+"_max"]=train_test.groupby('Product_id')[i].transform("max")
        train_test[i+"_median"]=train_test.groupby('Product_id')[i].transform("median")
    for i in user_float:
        train_test[i] = train_test[i].apply(lambda x:np.exp(x))
        train_test[i+'_mean']=train_test.groupby(['user_id'])[i].transform('mean')
        train_test[i+"_min"]=train_test.groupby('user_id')[i].transform("min")
        train_test[i+"_max"]=train_test.groupby('user_id')[i].transform("max")
        train_test[i+"_median"]=train_test.groupby('user_id')[i].transform("median")
        train_test['seller_product_score_2'] = train_test.groupby(['seller','Product_id'])[i].transform('mean')
    
    for i in user_int:#0.633645 线上0.60074
        train_test[i+"u_cont"]=train_test.groupby(['user_id'])[i].transform('count') # 0.628459
        train_test[i+"u_contm"]=train_test.groupby(['user_id'])[i].transform('mean') # 0.632236
    
    for i in product_int:
        train_test[i+"p_cont"]=train_test.groupby(['Product_id'])[i].transform('count')
        train_test[i+"p_contm"]=train_test.groupby(['Product_id'])[i].transform('mean')
    
    
    # ---将所有ProductInfo相加 0.634057
    train_test['psum'] = np.zeros(train_test.shape[0])
    for i in range(1, len(train_test.columns)):
        if train_test.columns[i][0]=='P':
            train_test['psum'] = train_test['psum']+train_test[train_test.columns[i]]
    train_test['psum'] = train_test['psum'].apply(lambda x:np.log(x+1))
    
    
    # 从图像上看训练集测试集分布不一致的int64类型列
    d_col=['UserInfo_150','UserInfo_14','UserInfo_216','UserInfo_239','UserInfo_22','UserInfo_6','UserInfo_246','WebInfo_2']

     

  1. 模型选择
  • 与初赛购买模型相同-Xgboost

复赛

  1. 数据预处理
  • 与复赛购买相同
  1. 特征选取及交叉
  • 收藏值使用了部分初赛已知含义的交叉特征
  • 匿名特征不做处理对模型起负相关作用,所以删掉匿名特征,只保留'user_id','seller','Product_id','day','action_type','WebInfo_1','WebInfo_3','UserInfo_123' 对保留的特征做交叉特征(两类交叉,三类交叉)
  • 不同于购买,收藏的交叉特征为排列组合后递归筛选对模型正相关的特征,所以一些交叉特征无实际含义。
    features=['user_id','seller','Product_id','day','action_type','WebInfo_1','WebInfo_3','UserInfo_123']
    train_data_f=train_data[features]
    test_data_f=test_data[features]
    train_test_f=pd.concat([train_data_f,test_data_f],sort=False)
    train_test_f=train_test_f.fillna(train_test_f.mean())
    
    # 用户角度出发
    train_test_f['u_p_count']=train_test_f.groupby("user_id").Product_id.transform('count')# 0.6900794一个用户浏览商品的数量
    train_test_f['u_p_nunique']=train_test_f.groupby("user_id").Product_id.transform('nunique')# 0.735705288用户浏览不同product的数量
    train_test_f['u_s_nunique']=train_test_f.groupby("user_id").seller.transform('nunique')## 用户浏览不同的商家数量
    train_test_f['u_d_nunique']=train_test_f.groupby("user_id").day.transform('nunique')# 用户分别在哪几天浏览过
    train_test_f['last_day'] = train_test_f.groupby("user_id").day.transform('max')# 0.691511用户最晚是哪天浏览
    train_test_f['first_day'] = train_test_f.groupby("user_id").day.transform('min')# 用户最早是哪天浏览
    train_test_f['during_days'] = train_test_f['last_day'] - train_test_f['first_day']# 用户持续浏览几天
    train_test_f['u_s_u_count']=train_test_f.groupby(['user_id','seller'])['user_id'].transform('count')# 0.734514318强特
    
    # 从商品角度出发
    train_test_f['p_p_count']=train_test_f.groupby('Product_id')['Product_id'].transform('count')# 0.709898商品出现次数,强特
    train_test_f['p_u_nunique']=train_test_f.groupby('Product_id')['user_id'].transform('nunique')#一个商品不同用户数量
    train_test_f['p_d_u_count']=train_test_f.groupby(['Product_id','day'])['user_id'].transform('count')#0.712725 一个商品每天的用户浏览量
    train_test_f['p_u_count']=train_test_f.groupby(['Product_id'])['user_id'].transform('count')# 一个商品的用户总数量
    train_test_f['p_d_unique']=train_test_f.groupby(['Product_id']).day.transform('nunique')# 0.735715770
    train_test_f['p_u_nunique']=train_test_f.groupby(['Product_id'])['user_id'].transform('nunique') #0.73514088
    
    # 从卖家角度出发
    train_test_f['s_s_count']=train_test_f.groupby('seller')['seller'].transform('count')# 0.7134725卖家出现次数
    train_test_f['s_u_count']=train_test_f.groupby('seller')['user_id'].transform('count')# 一个卖家用户浏览量
    train_test_f['s_u_nunique']=train_test_f.groupby('seller')['user_id'].transform('nunique')# 一个卖家不同用户浏览量
    train_test_f['s_d_u_count']=train_test_f.groupby(['seller','day'])['user_id'].transform('count')# 0.7252421一个卖家每天的用户浏览量
    train_test_f['s_p_count']=train_test_f.groupby(['seller'])['Product_id'].transform('count')# 一个卖家的商品数量
    
    #从天数角度出发
    train_test_f['d_u_count']=train_test_f.groupby('day')['day'].transform('count')# 0.7256288购买日期出现的次数
    train_test_f['d_s_count']=train_test_f.groupby('day')['seller'].transform('count')# 0.7257619一天中一个卖家出现的次数
    train_test_f['d_u_count']=train_test_f.groupby('day')['user_id'].transform('count')# 0.7257619一天中出现的用户数
    train_test_f['d_p_count']=train_test_f.groupby('day')['Product_id'].transform('count')# 一天中一个商品出现的次数
    
    # u123
    train_test_f['u_u123_count']=train_test_f.groupby(['user_id'])['UserInfo_123'].transform('count')# 0.735175687
    
    #=======三特征交叉所有都加上后筛选结果=======
    train_test_f['u_d_s_count']=train_test_f.groupby(['user_id','day'])['seller'].transform('count')# 0.741555
    train_test_f['u_s_p_unique']=train_test_f.groupby(['user_id','seller'])['Product_id'].transform('nunique')#0.7417228218
    train_test_f['u_s_d_mean']=train_test_f.groupby(['user_id','seller'])['day'].transform('mean')#0.7417228218
    train_test_f['u_p_d_mean']=train_test_f.groupby(['user_id','Product_id'])['day'].transform('mean')#0.7417228218
    train_test_f['s_u_d_mean']=train_test_f.groupby(['seller','user_id'])['day'].transform('mean')#0.7417228218
    train_test_f['p_s_d_mean']=train_test_f.groupby(['Product_id','seller'])['day'].transform('mean')#0.7417228218
    train_test_f['u_s_u_count']=train_test_f.groupby(['user_id','seller'])['user_id'].transform('count')
    train_test_f['u_p_u_count']=train_test_f.groupby(['user_id','Product_id'])['user_id'].transform('count')
    train_test_f['u_d_p_count']=train_test_f.groupby(['user_id','day'])['Product_id'].transform('count')#0.7464669548
    train_test_f['s_p_d_max']=train_test_f.groupby(['seller','Product_id'])['day'].transform('max')#0.7469330771
    train_test_f['u_s_u_count'] = train_test_f.groupby(['user_id','seller'])['user_id'].transform('count')#0.7469330771
    train_test_f['u_p_u_count'] = train_test_f.groupby(['user_id','Product_id'])['user_id'].transform('count')#0.7469330771
    train_test_f['u_d_u_count'] = train_test_f.groupby(['user_id','day'])['user_id'].transform('count')#0.7469734682
    train_test_f['u_s_a_nunique'] = train_test_f.groupby(['user_id', 'seller'])['action_type'].transform('nunique')#0.7504861997
    
    #一下0.75052
    train_test_f['u_s_d_max'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('max')
    train_test_f['u_s_d_min'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('min')
    train_test_f['u_s_d_mean'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('mean')
    train_test_f['u_s_d_std'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('std')
    train_test_f['u_s_d_median'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('median')
    train_test_f['u_s_d_nunique'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('nunique')
    train_test_f['u_s_d_count'] = train_test_f.groupby(['user_id', 'seller'])['day'].transform('count')
    
    train_test_f['p_s_u_nunique'] = train_test_f.groupby(['Product_id','seller'])['user_id'].transform('nunique')#0.7506777329
    
    # 0.7504883551
    train_test_f['p_s_d_max'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('max')
    train_test_f['p_s_d_min'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('min')
    train_test_f['p_s_d_mean'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('mean')
    train_test_f['p_s_d_std'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('std')
    train_test_f['p_s_d_median'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('median')
    train_test_f['p_s_d_nunique'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('nunique')
    train_test_f['p_s_d_count'] = train_test_f.groupby(['seller', 'Product_id'])['day'].transform('count')
    
    train_test_f['p_s_a_nunique'] = train_test_f.groupby(['Product_id','seller'])['action_type'].transform('nunique')
    train_test_f['u_p_s_nunique'] = train_test_f.groupby(['user_id', 'Product_id'])['seller'].transform('nunique')

     

  1. 模型选择
  • 收藏模型为CatBoost,将'user_id','Product_id','seller'三列作为类别变量输入模型,学习率0.04,训练1000轮,100轮早停
    from catboost import CatBoostClassifier
    def cb_model(X_t, X_v, y_t, y_v, test):
        print('cb_model start')
        category=[] # 
        c_list = ['user_id','Product_id','seller']# ,'day'
        for index,value in enumerate(X_t.columns):
            if(value in c_list):
                category.append(index)
                continue
        model= CatBoostClassifier(iterations=1000,learning_rate=0.04,cat_features=category, loss_function='Logloss',
                                logging_level='Verbose',eval_metric='AUC')
        model.fit(X_t,y_t,eval_set=(X_v,y_v),early_stopping_rounds=100)
        res=model.predict_proba(test)[:,1]
        # importance=model.get_feature_importance(prettified=True)# 显示特征重要程度
        # print(importance)
        return res

     

  • 进行5折交叉验证
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2019)# 15折交叉
    for train_index, test_index in kf.split(data_f,lable_favorite):
        #data_f为收藏训练集,lable_purchase是lable
        X_train_p, y_train_p = data_f.iloc[train_index], lable_favorite.iloc[train_index]
        X_valid_p, y_valid_p = data_f.iloc[test_index], lable_favorite.iloc[test_index]
        favorite_result +=cb_model(X_train_p, X_valid_p, y_train_p, y_valid_p, test_f)/5.0

     

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值