蒸汽预测赛题——模型验证

天池大赛比赛地址:链接

理论知识

  • 欠拟合 高偏差

    • 增加额外特征
    • 增加多项式特征
    • 降低惩罚
  • 过拟合 高方差

    • 收集更多数据
    • 使用更少的特征
    • 增加惩罚
  • 泛化:机器学习模型学习到的概念在处理未遇到过的样本时的表现

  • 正则化:防止过拟合

    • L1正则化 Lasso
      • 向量元素绝对值之和
      • 迫使那些弱的特征所对应的系数变成0,往往会使学到的模型很稀疏
    • L2正则化 Ridge 岭回归
      • 欧几里得范数 向量元素绝对值平方和再开方
      • 拟合的始终是条曲线
  • 回归模型的评估方法

    • 指标描述方法
      Mean Absolute Error(MAE)平均绝对值误差from sklearn.metrics import mean_absolue_error
      Mean Square Error(MSE)均方误差mean_square_error
      Root Mean Square Error(RMSE)均方根误差mean_square_error sort
      R-SquaredR平均值 = 1-(真实-预测的平方)/(真实-均值的平方)r2_score
  • 交叉验证

    • 简单交叉验证 通常划分30%的数据作为测试集

    • K折交叉验证 KFold

      • 原始数据划分K组 每个子集数据分别做一次验证集 剩余K-1次作为训练集 最后取平均

      • from sklearn.model_selection import KFold
        kf = KFold(n_splits=10)
        for train,test in kf.split(data):
        
    • 留一法交叉验证 LeaveOneOut

      • 训练集由除一个样本之外的其余样本组成,剩下一个样本组成验证集

      • from sklearn.model_selection import LeaveOneOut
        loo = LeaveOneOut()
        
    • 留P法交叉验证 LeavePOut

      • 从完整的数据集中删除p个样本,产生训练集和验证集

      • from sklearn.model_selection import LeavePOut
        lpo = LeavePOut(p=5)
        
    • 基于类标签,具有分层的交叉验证

      • 解决样本不平衡的问题
      • 确保相应的类别频率在每个训练和验证的折叠中保留
      • StratifiedKFold:K-Fold变种 返回的是index 返每个小集合中 各个类别的样例比例大致和完整数据中相同
      • StratifiedShuffleSplit:ShuffleSplit变种 返回的直接是 划分中每个类的比例和完整中的相同
    • 用于分组数据的交叉验证

      • 留出一组特定的不属于测试集和训练集的数据
      • GroupKFold、LeaveOneOut、LeavePOut、GroupShuffleSplit
    • 时间序列分割 TimeSeriesSplit

      • 有关时间序列的样本,必须保证时间上的顺序性
      • 不能用未来的数据去检验现在的数据正确性
  • 模型调参

    • 整体模型的偏差和方差的大和谐
    • 过程影响类参数
      • 子模型数 n_estimators
      • 学习率 learning_rate
    • 子模型影响类参数
      • 最大数深度 max_depth
      • 分裂条件 criterion
    • bagging 降低方差
    • boosting 降低偏差
    • RandomForest
      • criterion、n_estimators、max_leaf_nodes、ma x_depth、min_samples_split、min_samples_leaf、min_weight_fraction_leaf
    • GradientTree
      • learning_rate、n_estimators、max_leaf_nodes、ma x_depth、min_samples_split、min_samples_leaf、min_weight_fraction_leaf、max_features、subsample
    • 网格搜索 穷举

1. 导包

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import lightgbm as lgb

from sklearn.metrics import mean_squared_error

2. 加载数据

train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"

train_data = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')

feature_columns = [col for col in train_data if col not in ['target']]

min_max_scaler = MinMaxScaler().fit(train_data[feature_columns])
train_data_scaler = min_max_scaler.transform(train_data[feature_columns])
test_data_scaler = min_max_scaler.transform(test_data[feature_columns])
train_data_scaler = pd.DataFrame(train_data_scaler,columns=feature_columns)
train_data_scaler['target'] = train_data['target']
test_data_scaler = pd.DataFrame(test_data_scaler,columns=feature_columns)

pca = PCA(n_components=16)
new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,:-1])
new_train_pca_16 = pd.DataFrame(new_train_pca_16)
new_train_pca_16['target'] = train_data_scaler['target']
new_test_pca_16 = pca.transform(test_data_scaler)
new_test_pca_16 = pd.DataFrame(new_test_pca_16)

train= new_train_pca_16.fillna(0)[new_test_pca_16.columns]
target = new_train_pca_16['target']

train_data,test_data,train_target,test_target = train_test_split(train,target,test_size=0.2,random_state=0)

3. 拟合数据

# 欠拟合 max_iter最大迭代次数 tol迭代到loss变化小于这个值时停止
clf = SGDRegressor(max_iter=500,tol=1e-2)
clf.fit(train_data,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
# SGDRegressor train MSE:  0.1518963746787445
# SGDRegressor test MSE:  0.15617004077751775

# 过拟合
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(5)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data_poly)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data_poly)))
# SGDRegressor train MSE:  0.13239797454332378
# SGDRegressor test MSE:  0.14471496796078484

# 正常拟合
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
clf = SGDRegressor(max_iter=1000,tol=1e-3)
clf.fit(train_data_poly,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data_poly)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data_poly)))
# SGDRegressor train MSE:  0.1339474648768595
# SGDRegressor test MSE:  0.14245462035372333

# 联合正则化
poly = PolynomialFeatures(3)
train_data_poly = poly.fit_transform(train_data)
test_data_poly = poly.transform(test_data)
#elasticNet 联合l1 和l2 范数加权正则化 l1_ratio 表示占多少
clf = SGDRegressor(max_iter=1000,tol=1e-3,penalty='elasticNet',l1_ratio=0.9,alpha=0.001)
clf.fit(train_data_poly,train_target)
print("SGDRegressor train MSE: ",mean_squared_erroror(train_target,clf.predict(train_data_poly)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data_poly)))
# SGDRegressor train MSE:  0.1415386321846727
# SGDRegressor test MSE:  0.1480928293310648

4. 交叉验证 K折KFold、留一法LeaveOneOut、留P法LeavePOut

# from sklearn.model_selection import LeaveOneOut
# loo = LeaveOneOut()
# from sklearn.model_selection import LeavePOut
# lpo = LeavePOut(p=10)
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for k,(train_index,test_index) in enumerate(kf.split(train)):
    train_data,test_data,train_target,test_target = train.values[train_index],train.values[test_index],\
    target.values[train_index],target.values[test_index]
    clf = SGDRegressor(max_iter=1000,tol=1e-3)
    clf.fit(train_data,train_target)
    print(k,"折 ","SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
    print(k,"折 ","SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
    
'''
注意KFold返回的是index
0 折  SGDRegressor train MSE:  0.15000344819294997
0 折  SGDRegressor test MSE:  0.10575417172621228
1 折  SGDRegressor train MSE:  0.13355864901674058
1 折  SGDRegressor test MSE:  0.18233565251295702
2 折  SGDRegressor train MSE:  0.1465085643709417
2 折  SGDRegressor test MSE:  0.13286058060702355
3 折  SGDRegressor train MSE:  0.14066988317392046
3 折  SGDRegressor test MSE:  0.1618205154729875
4 折  SGDRegressor train MSE:  0.13813813503634081
4 折  SGDRegressor test MSE:  0.16437792249692298
'''

5. 模型超参空间搜索 GridSearchCV

from sklearn.model_selection import GridSearchCV
randomForestRegressor = RandomForestRegressor()
# 构建字典 超参空间
parameters = {'n_estimators':[100,200],'max_depth':[5,10]}
clf = GridSearchCV(randomForestRegressor,param_grid=parameters,cv=3)
clf.fit(train_data,train_target)

print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
# cv_results_中有训练时间和验证指标的信息
print(sorted(clf.cv_results_))
print(clf.best_params_)

'''
SGDRegressor train MSE:  0.03937490165306472
SGDRegressor test MSE:  0.15825533758727703
['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_n_estimators', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']
{'max_depth': 10, 'n_estimators': 200}
'''

 #随机参数优化
 train_data,test_data,train_target,test_target = train_test_split(train,target,test_size=0.2,random_state=0)

from sklearn.model_selection import RandomizedSearchCV
randomForestRegressor = RandomForestRegressor()
parameters = {'n_estimators':[50,100,150,200],'max_depth':[3,5,10,15]}
clf = RandomizedSearchCV(randomForestRegressor,param_distributions=parameters,cv=3)
clf.fit(train_data,train_target)
print("SGDRegressor train MSE: ",mean_squared_error(train_target,clf.predict(train_data)))
print("SGDRegressor test MSE: ",mean_squared_error(test_target,clf.predict(test_data)))
print(sorted(clf.cv_results_))
print(clf.best_params_)

'''
SGDRegressor train MSE:  0.022121433444846298
SGDRegressor test MSE:  0.15637853039323252
['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_n_estimators', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']
{'n_estimators': 200, 'max_depth': 15}
'''

6. LGB模型5折交叉验证

train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"

train_data2 = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data2 = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')

train_data2_d = train_data2[test_data2.columns].values
train_data2_t = train_data2['target'].values

from sklearn.model_selection import KFold
import lightgbm as lgb
import numpy as np

kf = KFold(n_splits=5,shuffle=True,random_state=2020)
MSE_DICT={'train_mse':[],'test_mse':[]}

for k,(train_index,test_index) in enumerate(kf.split(train_data2_d)):
    lgb_reg = lgb.LGBMRegressor(
        learning_rate=0.01,
        max_depth=-1,
        n_estimators=5000,
        boosting_type='gbdt',
        random_state=2020,
        objective='regression',
    )
    X_train_KFold,X_test_KFold = train_data2_d[train_index],train_data2_d[test_index]
    Y_train_KFold,Y_test_KFold = train_data2_t[train_index],train_data2_t[test_index]
    
    lgb_reg.fit(X=X_train_KFold,
                y=Y_train_KFold,
                eval_set=[(X_train_KFold,Y_train_KFold),(X_test_KFold,Y_test_KFold)],
                eval_names=['train','test'],
                early_stopping_rounds=100,
                eval_metric='MSE',
                verbose = 50
    )
    y_train_KFold_predict = lgb_reg.predict(X_train_KFold,num_iteration=lgb_reg.best_iteration_)
    y_test_KFold_predict = lgb_reg.predict(X_test_KFold,num_iteration=lgb_reg.best_iteration_)
    
    train_mse = mean_squared_error(y_train_KFold_predict, Y_train_KFold)
    test_mse = mean_squared_error(y_test_KFold_predict, Y_test_KFold)
    MSE_DICT['train_mse'].append(train_mse)
    MSE_DICT['test_mse'].append(test_mse)
    print('第{}折 训练和预测 训练MSE 预测MSE'.format(k + 1))
    print('------\n', '训练MSE\n', train_mse, '\n------')
    print('------\n', '预测MSE\n', test_mse, '\n------\n')

print('------\n', '训练MSE\n', MSE_DICT['train_mse'], '\n',np.mean(MSE_DICT['train_mse']), '\n------')
print('------\n', '预测MSE\n', MSE_DICT['test_mse'], '\n',np.mean(MSE_DICT['test_mse']), '\n------')

'''
Training until validation scores don't improve for 100 rounds
[50]	train's l2: 0.425465	test's l2: 0.496533
[100]	train's l2: 0.220271	test's l2: 0.283883
[150]	train's l2: 0.134273	test's l2: 0.19732
[200]	train's l2: 0.0949213	test's l2: 0.158257
[250]	train's l2: 0.0741039	test's l2: 0.13845
[300]	train's l2: 0.0617553	test's l2: 0.127983
[350]	train's l2: 0.052972	test's l2: 0.122248
[400]	train's l2: 0.0463628	test's l2: 0.118739
[450]	train's l2: 0.0408907	test's l2: 0.116877
[500]	train's l2: 0.036397	test's l2: 0.115797
[550]	train's l2: 0.0326647	test's l2: 0.115223
[600]	train's l2: 0.0293509	test's l2: 0.114269
[650]	train's l2: 0.0265367	test's l2: 0.113535
[700]	train's l2: 0.0241231	test's l2: 0.112868
[750]	train's l2: 0.0219745	test's l2: 0.112329
[800]	train's l2: 0.0200866	test's l2: 0.111965
[850]	train's l2: 0.0184577	test's l2: 0.111526
[900]	train's l2: 0.0169686	test's l2: 0.111169
[950]	train's l2: 0.0156194	test's l2: 0.110916
[1000]	train's l2: 0.014383	test's l2: 0.1108
[1050]	train's l2: 0.0132859	test's l2: 0.110887
[1100]	train's l2: 0.0122815	test's l2: 0.110883
Early stopping, best iteration is:
[1002]	train's l2: 0.0143354	test's l2: 0.110787
第1折 训练和预测 训练MSE 预测MSE
------
 训练MSE
 0.014335406865061465 
------
------
 预测MSE
 0.11078669083212028 
 
 ....
 
 训练MSE
 [0.014335406865061465, 0.00038001724874838936, 0.0066264997159705035, 0.002476702516409125, 0.006477785687227073] 
 0.006059282406683311 
------
------
 预测MSE
 [0.11078669083212028, 0.11085583591285612, 0.11445116914718123, 0.09379035773727774, 0.11315095015348814] 
 0.10860700075658469 
------
------
'''

7. 学习曲线 learning_curve

  • learning_curvce 可参考网址
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import ShuffleSplit
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import learning_curve
plt.rcParams.update({# Use mathtext, not LaTeX
                    'text.usetex': False,
                    # Use the Computer modern font
                    'font.family': 'serif',
                    'font.serif': 'cmr10',
                    'mathtext.fontset': 'cm',
})


def plot_learning_curve(estimator,title,X,y,ylim=None,cv=None,n_jobs=1,train_sizes=np.linspace(.1, 1.0, 7)):
    plt.figure(figsize=(9, 5), dpi=250)
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes,
                     train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std,
                     alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes,
                     test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std,
                     alpha=0.1,
                     color="g")
    plt.plot(train_sizes,
             train_scores_mean,
             'o-',
             color="r",
             label="Training score")
    plt.plot(train_sizes,
             test_scores_mean,
             'o-',
             color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"

train_data2 = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data2 = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')

X = train_data2[test_data2.columns].values
Y = train_data2['target'].values

title = "LinearRegression"
cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=2020)
estimator = SGDRegressor()
plot_learning_curve(estimator, title, X, Y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)

在这里插入图片描述

8. 验证曲线 validation_curve

  • 可以用图的形式找最佳超参数
  • validation_curve 可参考网址
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import validation_curve
plt.rcParams.update({# Use mathtext, not LaTeX
                     'text.usetex': False,
                     # Use the Computer modern font
                     'font.family': 'serif',
                     'font.serif': 'cmr10',
                      'mathtext.fontset': 'cm',
})

train_data_file = "../datasets/zhengqi_train.txt"
test_data_file = "../datasets/zhengqi_test.txt"

train_data2 = pd.read_csv(train_data_file, sep='\t', encoding='utf-8')
test_data2 = pd.read_csv(test_data_file, sep='\t', encoding='utf-8')

X = train_data2[test_data2.columns].values
Y = train_data2['target'].values

# 这里判断选多少惩罚系数合理 
param_range = [0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001]
# validation_curve 可参考 https://blog.csdn.net/lvchunyang66/article/details/104411659
train_scores, test_scores = validation_curve(
             SGDRegressor(max_iter=1000,tol=1e-3,penalty='L1'),X,Y,
             param_name="alpha",
             param_range=param_range,
             cv=10,scoring='r2',n_jobs=1)

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(9, 5), dpi=150)
plt.title("Validation Curve with SGDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)

# 绘制轴半对数刻度曲线 semilogx
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range,
            train_scores_mean - train_scores_std,
            train_scores_mean + train_scores_std,
            alpha=0.2,
            color="r")
plt.semilogx(param_range,test_scores_mean,label="Cross-validation score",color="g")
plt.fill_between(param_range,
            test_scores_mean - test_scores_std,
            test_scores_mean + test_scores_std,
            alpha=0.2,
            color="g")

plt.legend(loc="best")
plt.show()

在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值