天猫用户重复购买预测赛题——模型训练、验证和评测

天池大赛比赛地址:链接

理论知识

  • 分类是一个有监督的学习过程,在大量带标签数据的前提下,计算出未知样本的标签取值,二分类和多分类问题

  • 逻辑回归 虽然叫回归 但是属于分类算法 通过将线性函数的结果映射到Sigmoid函数中 预估出概率并分类

  • Sigmoid函数

    • 是归一化函数,将连续数值转化为0到1的范围,连续型–>离散型

    • 回归函数 f ( x ) = 1 1 + e − x f(x) = {1 \over 1+e^{-x}} f(x)=1+ex1

    • from sklearn.linear_model import LogisticRegression
      from sklearn.preprocessing import StandardScaler
      from sklearn.model_selection import train_test_split
      
      # 需要标准化
      stdScaler = StandardScaler()
      X = stdScaler.fit_transform(train)
      X_train,X_test,y_train,y_test = train_test_split(X,target,test_size=0.3,random_state=2020)
      
      clf = LogisticRegression(random_state=2020,slover='lbfgs',multi_class='multinomial').
      	fit(X_train,y_train)
      
  • K近邻分类

    • 计算样本数据中的点和当前点的距离,如欧式距离

    • 提取样本最相似数据分类标签

    • 确定前k个点所在类别的出现频率

    • 返回前k个点所出现频率最高的类别 作为当前点的预测分类

    • from sklearn.neighbors import KNeighborsClassifier
      from sklearn.preprocessing import StandardScaler
      
      # 需要标准化
      stdScaler = StandardScaler()
      X = stdScaler.fit_transform(train)
      X_train,X_test,y_train,y_test = train_test_split(X,target,test_size=0.3,random_state=2020)
      clf = KNeighborsClassifier(n_neighbors=3).fit(X_train,y_train)
      
  • 高斯贝叶斯分类模型

    • P ( A ∣ B ) = P ( A , B ) P ( B ) = P ( B ∣ A ) ∗ P ( A ) P ( B ) P(A|B) = {P(A,B) \over P(B)} = {P(B|A) * P(A) \over P(B)} P(AB)=P(B)P(A,B)=P(B)P(BA)P(A)

    • from sklearn.naive_bayes import GussianNB
      from sklearn.preprocessing import StandardScaler
      
      # 需要标准化
      stdScaler = StandardScaler()
      X = stdScaler.fit_transform(train)
      X_train,X_test,y_train,y_test = train_test_split(X,target,test_size=0.3,random_state=2020)
      clf = GussianNB().fit(X_train,y_train)
      
  • 集成学习分类模型

    • Bagging 抽取m个样本 进行训练 多个训练器结合策略

    • Boosting 带权重训练集 训练 基于学习误差率 更新权重系数 重新训练

    • 随机森林

    • LightGBM

    • 极端随机森林 Extra Tree ET

      • 多个决策树构成
      • 随机森林 应用的是Bagging模型,极端随机森林使用所有的训练样本计算
      • 随机森林 是在一个随机子集内得到最佳的分叉属性极端随机模型依靠完全随机得到分叉值
  • 模型验证指标

    • 指标描述方法 sklearn.metrics
      Accurary准确率accuray_score
      Percision查准率precision_score
      Recall查全率recall_score
      F1F1值f1_score
      Classification Report分类报告classification_report
      Confusion Matrix混淆矩阵confusion_matrix
      ROCROC曲线roc_curve
      AUCROC曲线下的面积auc
    • 查准率和查全率

      • 假设有个不太准的验钞机 假的会拦住 真的会存起来 但有时候会出问
      • 查准率 precision:存起来的钞票中 真钞的比例 = 存起来的真钞票 / (存起来的真钞+存起来的假钞)
      • 查全率recall:所以真钞中被 存起来的比例 = 存起来的真钞票 / (存起来的真钞+误拦住的真钞)
    • F1值

      • 查准率和查全率的加权调和平均
      • $ F = {(a^2+1)*R\over a^2-(P+R)}$
      • 当a = 1时,就是最常见的F1值, F 1 = 2 P R P + R F1 = {2PR\over P+R} F1=P+R2PR
    • 分类报告

      • 提供查准率、查全率、F1值 三种评估指标
    • 混淆矩阵

      • 预测值=1预测值=0
        真实值=1TP(True Postive)TN(True Negative)
        真实值=0FP(False Postive)FN(Flash Negative)
    • ROC

      • 横坐标是FPR(Fasle Postive Rate)
      • 纵坐标是TPR(True Postive Rate)
      • 理想的目标是TPR=1,FPR=0 ROC曲线越靠拢(0,1)点,越偏离45度对角线效果越好
    • AUC曲线

      • ROC曲线下方的面积

1. 设置交叉验证方式

# 1.简单交叉验证
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0, n_jobs=-1)
scores = cross_val_score(clf, train, target, cv=5,scoring='f1_macro')
print(scores)

# 2.使用ShuffleSplit切分数据
from sklearn.model_selection import ShuffleSplit
cv = ShuffleSplit(n_splits=5,test_size=0.3,random_state=2020)
scores = cross_val_score(clf, train, target, cv=cv)

# 3.使用KFold切分数据 
from sklearn.model_selection import KFlod
kf = KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):
  X_train,X_test,y_train,y_test = train[train_index], train[test_index], target[train_index], target[test_index]
  clf = clf.fit(X_train, y_train)
  print(k, clf.score(X_test, y_test))
  
# 4.使用StratifiedKFold切分数据 label均分
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for k, (train_index, test_index) in enumerate(skf.split(train, target)):
    X_train, X_test, y_train, y_test = train[train_index], train[test_index], target[train_index], target[test_index]
    clf = clf.fit(X_train, y_train)
    print(k, clf.score(X_test, y_test))

2. 模型调参

from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=0)
clf = RandomForestClassifier(n_job=-1)

parameters = {
  'n_estimators':[50,100,200],
  'max_depth':[2,5]
}
clf = GridSearchCV(clf,param_grid=parameters,cv=5,scoring='precision_macro')
print(clf.cv_results)
print(clf.best_params_)

3. 不同的分类模型

# LR模型
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# 标准化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_train, y_train)
clf.score(X_test, y_test)

# KNN模型
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
# 标准化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
clf.score(X_test, y_test)

# 高斯贝叶斯模型
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
# 标准化
stdScaler = StandardScaler()
X = stdScaler.fit_transform(train)
X_train, X_test, y_train, y_test = train_test_split(X, target, random_state=0)
clf = GaussianNB().fit(X_train, y_train)
clf.score(X_test, y_test)

# bagging模型
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# 随机森林模型
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
clf = clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)
clf = clf.fit(X_train, y_train)
clf.score(X_test, y_test)

# ExTree模型
from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=0)

# AdaBoost模型
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier(n_estimators=100)

# GBDT模型
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# lgb模型
import lightgbm
clf = lightgbm
train_matrix = clf.Dataset(X_train, label=y_train)
test_matrix = clf.Dataset(X_test, label=y_test)
params = {
          'boosting_type': 'gbdt',
          'objective': 'multiclass',
          'metric': 'multi_logloss',
          'min_child_weight': 1.5,
          'num_leaves': 2**5,
          'lambda_l2': 1,
          'subsample': 0.7,
          'colsample_bytree': 0.7,
          'colsample_bylevel': 0.7,
          'learning_rate': 0.03,
          'seed': 2020,
          "num_class": 2,
          'silent': True,
          }
num_round = 10000
early_stopping_rounds = 100
model = clf.train(params, 
                  train_matrix,
                  num_round,
                  valid_sets=test_matrix,
                  early_stopping_rounds=early_stopping_rounds)
pre= model.predict(X_valid,num_iteration=model.best_iteration)

# xgb模型
import xgboost
clf = xgboost
train_matrix = clf.DMatrix(X_train, label=y_train, missing=-1)
test_matrix = clf.DMatrix(X_test, label=y_test, missing=-1)
z = clf.DMatrix(X_valid, label=y_valid, missing=-1)
params = {'booster': 'gbtree',
          'objective': 'multi:softprob',
          'eval_metric': 'mlogloss',
          'gamma': 1,
          'min_child_weight': 1.5,
          'max_depth': 5,
          'lambda': 1,
          'subsample': 0.7,
          'colsample_bytree': 0.7,
          'colsample_bylevel': 0.7,
          'eta': 0.03,
          'tree_method': 'exact',
          'seed': 2020,
          "num_class": 2
          }

num_round = 10000
early_stopping_rounds = 100
watchlist = [(train_matrix, 'train'),
             (test_matrix, 'eval')]
model = clf.train(params,
                  train_matrix,
                  num_boost_round=num_round,
                  evals=watchlist,
                  early_stopping_rounds=early_stopping_rounds)
pre = model.predict(z,ntree_limit=model.best_ntree_limit)

4. 模型融合

def stacking_reg(clf, train_x, train_y, test_x, clf_name, kf,):
    valid_y_pre = np.zeros((train_y.shape[0],1))
    test = np.zeros((test_x.shape[0],1))
    test_y_pre_k = np.empty((splits,test_x.shape[0],1))
    
    cv_scores = []
    for i ,(train_idx,test_idx) in enumerate(kf.split(train_x)):
        tr_x = train_x[train_idx]
        tr_y = train_y[train_idx]
        te_x = train_x[test_idx]
        te_y = train_y[test_idx]
        if clf_name in ['rf','ada','gb','et','lr','en','ls','kr1',]:
            clf.fit(tr_x,tr_y)
            te_y_pre = clf.predict(te_x).reshape(-1,1)
            valid_y_pre[test_idx] = te_y_pre
            test_y_pre_k[i,:] = clf.predict(test_x).reshape(-1,1)
            cv_scores.append(mean_squared_error(te_y,te_y_pre))
        elif clf_name in ['xgb']:
            train_matrix = clf.DMatrix(tr_x,label=tr_y,missing=-1)
            test_matrix = clf.DMatrix(te_x,label=te_y,missing=-1)
            z = clf.DMatrix(test_x,missing=-1)
            params = {
                'booster': 'gbtree',# 树的结构
                'eval_metric': 'rmse',
                'gamma': 0.1, # 节点分裂所需的最小损失函数下降值 越大 算法越保守
                'min_child_weight': 1,# 子集中实例重量的最小总和  如果小于这个数 就不继续分 越大越保守
                'max_depth': 8,
                'lambda': 3, # L2正则化系数
                'subsample': 0.8, # 每棵树随机采样的比例  越小 越保守
                'colsample_bytree': 0.8,  # 每棵随机采样的列数的占比
                'colsample_bylevel': 0.8, # 每棵树每次节点分裂的时候列采样的比例
                'eta': 0.03, #更新中使用的步长搜索 缩小特征权重
                'tree_method': 'auto', # 构建树的方法
                'seed': 2020,
                'nthread': 8 #线程数
            }
            num_round = 10000
            early_stopping_rounds = 100
            watchlist = [(train_matrix,'train'),(test_matrix,'evel')]
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_boost_round = num_round,
                                  evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(test_matrix,
                                         ntree_limit=model.best_ntree_limit).reshape(-1,1)
                valid_y_pre[test_idx] = te_y_pre
                
                test_y_pre_k[i,:] = model.predict(z,
                                                ntree_limit=model.best_ntree_limit).reshape(-1,1)
                
                cv_scores.append(mean_squared_error(te_y,te_y_pre))
        elif clf_name in ['lgb']:
            train_matrix = clf.Dataset(tr_x,label=tr_y)
            test_matrix = clf.Dataset(te_x,label=te_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'regression_l2',
                'min_child_weight': 1,
                'metric': 'mse',
                'num_leaves': 31,
                'lambda_l2': 3,
                'subsample': 0.8,
                'colsample_bytree': 0.8,
                'learning_rate': 0.01,
                'seed': 2020,
                'nthread': -1,
                'silent': True, # 输出细节
            }
            num_round = 10000
            early_stopping_rounds = 100
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_round,
                                  valid_sets=test_matrix,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(te_x,
                                         num_iteration = model.best_iteration).reshape(-1,1)
                valid_y_pre[test_idx] = te_y_pre
                
                test_y_pre_k[i,:] = model.predict(test_x,
                                        num_iteration = model.best_iteration).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y,te_y_pre))
        else:
            raise IOError("please add new clf")
        print("%s now score is:" % clf_name, cv_scores)
        
    test[:] = test_y_pre_k.mean(axis=0)
    print("%s_score_list:" % clf_name,cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    #print('{}_mean_squared_error: {}'.format(clf_name,mean_squared_error(y_valid,test)))
    #opt_models_new[clf_name] = mean_squared_error(y_valid,test)
    return valid_y_pre.reshape(-1,1),test.reshape(-1,1)

def ls_reg(x_train, y_train, x_valid, kf):
    ls_reg = Lasso(alpha=0.0005)
    ls_train,ls_test = stacking_reg(ls_reg,x_train,y_train,x_valid,'ls',kf)   
    return ls_train,ls_test,'ls_reg'

def svr_reg(x_train, y_train, x_valid, kf):
    svr_reg = SVR(kernel='linear')
    svr_train,svr_test = stacking_reg(svr_reg,x_train,y_train,x_valid,'svr',kf)   
    return svr_train,svr_test,'svr_reg'

def lr_reg(x_train, y_train, x_valid, kf):
    lr_reg = LinearRegression(n_jobs=-1)
    lr_train,lr_test = stacking_reg(lr_reg,x_train,y_train,x_valid,'lr',kf)   
    return lr_train,lr_test,'lr_reg'

def en_reg(x_train, y_train, x_valid, kf):
    en_reg = ElasticNet(alpha=0.0005, l1_ratio=.9, )
    en_train,en_test = stacking_reg(en_reg,x_train,y_train,x_valid,'en',kf)   
    return en_train,en_test,'en_reg'

def gb_reg(x_train, y_train, x_valid, kf):
    gbdt = GradientBoostingRegressor(
                                     n_estimators=250,
                                     random_state=2020,
                                     max_features='auto',
                                     verbose=1)
    gbdt_train, gbdt_test = stacking_reg(gbdt,x_train,y_train,
                                         x_valid,"gb",kf)
    return gbdt_train, gbdt_test, "gb_reg"

def rf_reg(x_train, y_train, x_valid, kf):
    randomforest = RandomForestRegressor(
                                         n_estimators=350,
                                         n_jobs=-1,
                                         random_state=2020,
                                         max_features='auto',
                                         verbose=1)
    rf_train, rf_test = stacking_reg(randomforest,x_train,y_train,
                                     x_valid,"rf",kf)
    return rf_train, rf_test, "rf_reg"

def ada_reg(x_train, y_train, x_valid, kf):
    adaboost = AdaBoostRegressor(n_estimators=800,
                                 random_state=2020,
                                 learning_rate=0.01)
    ada_train, ada_test = stacking_reg(adaboost,x_train,y_train,
                                       x_valid,"ada",kf)
    return ada_train, ada_test, "ada_reg"

def et_reg(x_train, y_train, x_valid, kf):
    extratree = ExtraTreesRegressor(n_estimators=600,
                                    max_depth=32,
                                    max_features="auto",
                                    n_jobs=-1,
                                    random_state=2020,
                                    verbose=1)
    et_train, et_test = stacking_reg(extratree,x_train,y_train,
                                     x_valid,"et",kf)
    return et_train, et_test, "et_reg"

def xgb_reg(x_train, y_train, x_valid, kf):
    xgb_train, xgb_test = stacking_reg(xgboost,x_train,y_train,
                                       x_valid,"xgb",kf)
    return xgb_train, xgb_test, "xgb_reg"

def lgb_reg(x_train, y_train, x_valid, kf, ):
    lgb_train, lgb_test = stacking_reg(lightgbm,x_train,y_train,
                                       x_valid,"lgb",kf)
    return lgb_train, lgb_test, "lgb_reg"

def stacking_clf(clf, train_x, train_y, test_x, clf_name, kf,):
    valid_y_pre = np.zeros((train_y.shape[0],1))
    test = np.zeros((test_x.shape[0],1))
    test_y_pre_k = np.empty((splits,test_x.shape[0],1))
    
    cv_scores = []
    for i ,(train_idx,test_idx) in enumerate(kf.split(train_x)):
        tr_x = train_x[train_idx]
        tr_y = train_y[train_idx]
        te_x = train_x[test_idx]
        te_y = train_y[test_idx]
        if clf_name in ["rf","ada","gb","et","lr","knn","gnb"]:
            clf.fit(tr_x,tr_y)
            te_y_pre = clf.predict(te_x).reshape(-1,1)
            valid_y_pre[test_idx] = te_y_pre
            test_y_pre_k[i,:] = clf.predict(test_x).reshape(-1,1)
            cv_scores.append(log_loss(te_y,te_y_pre))
        elif clf_name in ['xgb']:
            train_matrix = clf.DMatrix(tr_x,label=tr_y,missing=-1)
            test_matrix = clf.DMatrix(te_x,label=te_y,missing=-1)
            z = clf.DMatrix(test_x,missing=-1)
            params = {
                'booster': 'gbtree',# 树的结构
                'objective':'multi:softprob',
                'eval_metric': 'mlogloss',
                'num_class':2,
                'gamma': 1, # 节点分裂所需的最小损失函数下降值 越大 算法越保守
                'min_child_weight': 1.5,# 子集中实例重量的最小总和  如果小于这个数 就不继续分 越大越保守
                'max_depth': 8,
                'lambda': 5, # L2正则化系数
                'subsample': 0.8, # 每棵树随机采样的比例  越小 越保守
                'colsample_bytree': 0.8,  # 每棵随机采样的列数的占比
                'colsample_bylevel': 0.8, # 每棵树每次节点分裂的时候列采样的比例
                'eta': 0.03, #更新中使用的步长搜索 缩小特征权重
                'tree_method': 'exact', # 构建树的方法
                'seed': 2020,
                'nthread': -1, #线程数
                
            }
            num_round = 10000
            early_stopping_rounds = 100
            watchlist = [(train_matrix,'train'),(test_matrix,'evel')]
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_boost_round = num_round,
                                  evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(test_matrix,
                                         ntree_limit=model.best_ntree_limit)
                valid_y_pre[test_idx] = te_y_pre[:,0].reshape(-1,1)
                
                test_y_pre_k[i,:] = model.predict(z,
                                                ntree_limit=model.best_ntree_limit)[:,0].reshape(-1,1)
                
                cv_scores.append(log_loss(te_y,valid_y_pre[test_idx]))
        elif clf_name in ['lgb']:
            train_matrix = clf.Dataset(tr_x,label=tr_y)
            test_matrix = clf.Dataset(te_x,label=te_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'multiclass',
                'num_class':2,
                'metric': 'multi_logloss',
                'min_child_weight': 1.5,
                'num_leaves': 32,
                'lambda_l2': 5,
                'subsample': 0.8,
                'colsample_bytree': 0.8,
                'learning_rate': 0.01,
                'seed': 2020,
            }
            num_round = 10000
            early_stopping_rounds = 100
            if test_matrix:
                model = clf.train(params,train_matrix,
                                  num_round,
                                  valid_sets=test_matrix,
                                  early_stopping_rounds=early_stopping_rounds)
                
                te_y_pre = model.predict(te_x,
                                         num_iteration = model.best_iteration)
                valid_y_pre[test_idx] = te_y_pre[:,0].reshape(-1,1)
                
                test_y_pre_k[i,:] = model.predict(test_x,
                                        num_iteration = model.best_iteration)[:,0].reshape(-1,1)
                cv_scores.append(log_loss(te_y,valid_y_pre[test_idx]))
        else:
            raise IOError("please add new clf")
        print("%s now score is:" % clf_name, cv_scores)
        
    test[:] = test_y_pre_k.mean(axis=0)
    print("%s_score_list:" % clf_name,cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    #print('{}_mean_squared_error: {}'.format(clf_name,mean_squared_error(y_valid,test)))
    #opt_models_new[clf_name] = mean_squared_error(y_valid,test)
    return valid_y_pre.reshape(-1,1),test.reshape(-1,1)

def rf_clf(x_train, y_train, x_valid, kf):
    randomforest = RandomForestClassifier(n_estimators=1200, max_depth=20, n_jobs=-1, random_state=2020, max_features="auto",verbose=1)
    rf_train, rf_test = stacking_clf(randomforest, x_train, y_train, x_valid, "rf", kf)
    return rf_train, rf_test,"rf"

def ada_clf(x_train, y_train, x_valid, kf):
    adaboost = AdaBoostClassifier(n_estimators=250, random_state=2020, learning_rate=0.01)
    ada_train, ada_test = stacking_clf(adaboost, x_train, y_train, x_valid, "ada", kf)
    return ada_train, ada_test,"ada"

def gb_clf(x_train, y_train, x_valid, kf):
    gbdt = GradientBoostingClassifier(learning_rate=0.04, n_estimators=300, subsample=0.8, random_state=2020,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking_clf(gbdt, x_train, y_train, x_valid, "gb", kf)
    return gbdt_train, gbdt_test,"gb"

def et_clf(x_train, y_train, x_valid, kf):
    extratree = ExtraTreesClassifier(n_estimators=1200, max_depth=35, max_features="auto", n_jobs=-1, random_state=2017,verbose=1)
    et_train, et_test = stacking_clf(extratree, x_train, y_train, x_valid, "et", kf)
    return et_train, et_test,"et"

def xgb_clf(x_train, y_train, x_valid, kf):
    xgb_train, xgb_test = stacking_clf(xgboost, x_train, y_train, x_valid, "xgb", kf)
    return xgb_train, xgb_test,"xgb"

def lgb_clf(x_train, y_train, x_valid, kf):
    lgb_train, lgb_test = stacking_clf(lightgbm, x_train, y_train, x_valid, "lgb", kf)
    return lgb_train, lgb_test,"lgb"

def gnb_clf(x_train, y_train, x_valid, kf):
    gnb=GaussianNB()
    gnb_train, gnb_test = stacking_clf(gnb, x_train, y_train, x_valid, "gnb", kf)
    return gnb_train, gnb_test,"gnb"

def lr_clf(x_train, y_train, x_valid, kf):
    logisticregression = LogisticRegression(n_jobs=-1,random_state=2020, solver='lbfgs', multi_class='multinomial')
    lr_train, lr_test = stacking_clf(logisticregression, x_train, y_train, x_valid, "lr", kf)
    return lr_train, lr_test, "lr"

def knn_clf(x_train, y_train, x_valid, kf):
    kneighbors=KNeighborsClassifier(n_neighbors=150,n_jobs=-1)
    knn_train, knn_test = stacking_clf(kneighbors, x_train, y_train, x_valid, "knn", kf)
    return knn_train, knn_test, "knn"

5. 训练并验证

from sklearn.model_selection import StratifiedKFold
splits = 3
kf = KFold(n_splits=3, shuffle=True, random_state=0)
sk = StratifiedKFold(n_splits=splits,shuffle=True,random_state=2020)
# clf_name = [rf_clf,ada_clf,lr_clf,gb_clf,et_clf,xgb_clf,lgb_clf,
#            lr_reg,en_reg,et_reg,ls_reg,rf_reg,gb_reg,xgb_reg,lgb_reg]
clf_name = [lgb_reg,xgb_reg]
test_pre_k = np.empty((len(clf_name),test.shape[0],1))
#test_pre_k = np.empty((len(clf_name),test.shape[0],1))
def model_bagging(X_train,y_train,test,clf_name):
    X_train = X_train
    y_train = y_train
    test = test
    for k,clf in enumerate(clf_name):
        tmp_train,tmp_test,name = clf(X_train,y_train,test,sk)
        test_pre_k[k,:] = tmp_test
    test_pre = test_pre_k.mean(axis = 0)
    return test_pre

test_pre =  model_bagging(X,y,test,clf_name)
#print('bagging_mean_squared_error: %s' %mean_squared_error(Y_valid,test_pre))

6. 初步结果

在这里插入图片描述

  • 2
    点赞
  • 72
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值