工业蒸汽量预测

工业蒸汽预测

赛题背景
火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。
赛题描述
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量(回归模型)。
数据说明
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。选手利用训练数据训练出模型,预测测试数据的目标变量,排名结果依据预测结果的MSE(mean square error)。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
# data path
train_data_file = "./data/zhengqi_train.txt"
test_data_file = "./data/zhengqi_test.txt"

EDA阶段

基本数据探索

train_data = pd.read_csv(train_data_file, sep="\t")
train_data.head()
V0V1V2V3V4V5V6V7V8V9...V29V30V31V32V33V34V35V36V37target
00.5660.016-0.1430.4070.452-0.901-1.812-2.360-0.436-2.114...0.1360.109-0.6150.327-4.627-4.789-5.101-2.608-3.5080.175
10.9680.4370.0660.5660.194-0.893-1.566-2.3600.332-2.114...-0.1280.1240.0320.600-0.8430.1600.364-0.335-0.7300.676
21.0130.5680.2350.3700.112-0.797-1.367-2.3600.396-2.114...-0.0090.3610.277-0.116-0.8430.1600.3640.765-0.5890.633
30.7330.3680.2830.1650.599-0.679-1.200-2.0860.403-2.114...0.0150.4170.2790.603-0.843-0.0650.3640.333-0.1120.206
40.6840.6380.2600.2090.337-0.454-1.073-2.0860.314-2.114...0.1831.0780.3280.418-0.843-0.2150.364-0.280-0.0280.384

5 rows × 39 columns

# 无缺失值
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 3   V3      2888 non-null   float64
 4   V4      2888 non-null   float64
 5   V5      2888 non-null   float64
 6   V6      2888 non-null   float64
 7   V7      2888 non-null   float64
 8   V8      2888 non-null   float64
 9   V9      2888 non-null   float64
 10  V10     2888 non-null   float64
 11  V11     2888 non-null   float64
 12  V12     2888 non-null   float64
 13  V13     2888 non-null   float64
 14  V14     2888 non-null   float64
 15  V15     2888 non-null   float64
 16  V16     2888 non-null   float64
 17  V17     2888 non-null   float64
 18  V18     2888 non-null   float64
 19  V19     2888 non-null   float64
 20  V20     2888 non-null   float64
 21  V21     2888 non-null   float64
 22  V22     2888 non-null   float64
 23  V23     2888 non-null   float64
 24  V24     2888 non-null   float64
 25  V25     2888 non-null   float64
 26  V26     2888 non-null   float64
 27  V27     2888 non-null   float64
 28  V28     2888 non-null   float64
 29  V29     2888 non-null   float64
 30  V30     2888 non-null   float64
 31  V31     2888 non-null   float64
 32  V32     2888 non-null   float64
 33  V33     2888 non-null   float64
 34  V34     2888 non-null   float64
 35  V35     2888 non-null   float64
 36  V36     2888 non-null   float64
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39)
memory usage: 880.1 KB
train_data.describe().T
countmeanstdmin25%50%75%max
V02888.00.1230480.928031-4.335-0.297000.35900.726002.121
V12888.00.0560680.941515-5.122-0.226250.27250.599001.918
V22888.00.2897200.911236-3.420-0.313000.38600.918252.828
V32888.0-0.0677900.970298-3.956-0.65225-0.04450.624002.457
V42888.00.0129210.888377-4.742-0.385000.11000.550252.689
V52888.0-0.5585650.517957-2.182-0.85300-0.4660-0.154000.489
V62888.00.1828920.918054-4.576-0.310000.38800.831251.895
V72888.00.1161550.955116-5.048-0.295000.34400.782251.918
V82888.00.1778560.895444-4.692-0.159000.36200.726002.245
V92888.0-0.1694520.953813-12.891-0.390000.04200.042001.335
V102888.00.0343190.968272-2.584-0.420500.15700.619254.830
V112888.0-0.3644650.858504-3.160-0.80325-0.11200.247001.455
V122888.00.0231770.894092-5.165-0.419000.12300.616002.657
V132888.00.1957380.922757-3.675-0.398000.28950.864252.475
V142888.00.0160811.015585-2.455-0.66800-0.16100.829752.558
V152888.00.0961461.033048-2.903-0.66225-0.00050.730004.314
V162888.00.1135050.983128-5.981-0.300000.30600.774252.861
V172888.0-0.0434580.655857-2.224-0.366000.16500.430002.023
V182888.00.0550340.953466-3.582-0.367500.08200.513254.441
V192888.0-0.1148841.108859-3.704-0.98750-0.00050.737253.431
V202888.0-0.1862260.788511-3.402-0.67550-0.15650.304003.525
V212888.0-0.0565560.781471-2.643-0.51700-0.05650.431502.259
V222888.00.3028930.639186-1.375-0.063000.21650.872002.018
V232888.00.1559780.978757-5.5420.097250.33800.368251.906
V242888.0-0.0218131.033403-1.344-1.191000.09500.931252.423
V252888.0-0.0516790.915957-3.808-0.55725-0.07600.356007.284
V262888.00.0720920.889771-5.131-0.452000.07500.644252.980
V272888.00.2724070.270374-1.1640.157750.32500.442000.925
V282888.00.1377120.929899-2.435-0.45500-0.44700.730004.671
V292888.00.0976481.061200-2.912-0.66400-0.02300.745254.580
V302888.00.0554770.901934-4.507-0.283000.05350.488002.689
V312888.00.1277910.873028-5.859-0.170250.29950.635002.013
V322888.00.0208060.902584-4.053-0.407250.03900.557002.395
V332888.00.0078011.006995-4.627-0.49900-0.04000.462005.465
V342888.00.0067151.003291-4.789-0.290000.16000.273005.110
V352888.00.1977640.985675-5.695-0.202500.36400.602002.324
V362888.00.0306580.970812-2.608-0.413000.13700.644255.238
V372888.0-0.1303301.017196-3.630-0.79825-0.18550.495253.000
target2888.00.1263530.983966-3.044-0.350250.31300.793252.538
test_data = pd.read_csv(test_data_file, sep="\t")
test_data.head()
V0V1V2V3V4V5V6V7V8V9...V28V29V30V31V32V33V34V35V36V37
00.3680.380-0.225-0.0490.3790.0920.5500.5510.2440.904...-0.4490.0470.057-0.0420.8470.534-0.009-0.190-0.5670.388
10.1480.489-0.247-0.0490.122-0.2010.4870.493-0.1270.904...-0.4430.0470.5600.1760.5510.046-0.2200.008-0.2940.104
2-0.166-0.062-0.3110.046-0.0550.0630.4850.493-0.2270.904...-0.458-0.3980.1010.1990.6340.017-0.2340.0080.3730.569
30.1020.294-0.2590.051-0.1830.1480.4740.5040.0100.904...-0.456-0.3981.0070.1371.042-0.040-0.2900.008-0.6660.391
40.3000.4280.2080.051-0.0330.1160.4080.4970.1550.904...-0.458-0.7760.2910.3700.181-0.040-0.2900.008-0.140-0.497

5 rows × 38 columns

# 无缺失值
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      1925 non-null   float64
 1   V1      1925 non-null   float64
 2   V2      1925 non-null   float64
 3   V3      1925 non-null   float64
 4   V4      1925 non-null   float64
 5   V5      1925 non-null   float64
 6   V6      1925 non-null   float64
 7   V7      1925 non-null   float64
 8   V8      1925 non-null   float64
 9   V9      1925 non-null   float64
 10  V10     1925 non-null   float64
 11  V11     1925 non-null   float64
 12  V12     1925 non-null   float64
 13  V13     1925 non-null   float64
 14  V14     1925 non-null   float64
 15  V15     1925 non-null   float64
 16  V16     1925 non-null   float64
 17  V17     1925 non-null   float64
 18  V18     1925 non-null   float64
 19  V19     1925 non-null   float64
 20  V20     1925 non-null   float64
 21  V21     1925 non-null   float64
 22  V22     1925 non-null   float64
 23  V23     1925 non-null   float64
 24  V24     1925 non-null   float64
 25  V25     1925 non-null   float64
 26  V26     1925 non-null   float64
 27  V27     1925 non-null   float64
 28  V28     1925 non-null   float64
 29  V29     1925 non-null   float64
 30  V30     1925 non-null   float64
 31  V31     1925 non-null   float64
 32  V32     1925 non-null   float64
 33  V33     1925 non-null   float64
 34  V34     1925 non-null   float64
 35  V35     1925 non-null   float64
 36  V36     1925 non-null   float64
 37  V37     1925 non-null   float64
dtypes: float64(38)
memory usage: 571.6 KB

说明:训练集和测试集中的数据没有缺失值,特征字段经过脱敏处理,没有具体的特征含义,共38个特征。

可视化异常值以及查看数据分布情况

# 箱图
fig = plt.figure(figsize=(80, 60))
feature_columns = train_data.columns.to_list()[:39]

for i, columns in enumerate(feature_columns):
    plt.subplot(7, 8, i+1) # 注:从1开始
    sns.boxplot(train_data[columns], orient="vertical", width=0.5)
    plt.ylabel(columns)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-s9KkEjGd-1612277796483)(output_11_0.png)]

结论:大多数特征变量存在异常值,需要考虑对异常值的处理。

基于模型的异常值获取

基于模型的方法

def find_outliers(model, X, y, sigma=3):
    """找出异常值并可视化"""
    
    try:
        y_pred = pd.Series(model.predict(X_train), index=y_train.index)
    except:
        model.fit(X, y)
        y_pred = pd.Series(model.predict(X_train), index=y_train.index)

    print("MSE:{}, R^2:{}".format(round(mean_squared_error(y, y_pred), 4), round(model.score(X, y), 4)))
    # 计算参统计量
    resid = y-y_pred
    resid_mean = resid.mean()
    resid_std = resid.std()
    print("resid_mean:{}, resid_std:{}".format(round(resid_mean, 4), round(resid_std, 4)))

    # 标准化(正态分布)
    z = (resid - resid_mean)/resid_std

    # 异常值位置(真实值和预测值偏离程度较大,大于三倍标准差)
    outliers = z[abs(z) > sigma].index
    print("outlier index:",outliers.tolist())
    # 可视化异常值(呈现正相关性越好)
    plt.figure(figsize=(15, 3))
    ax_131 = plt.subplot(1, 3, 1)
    plt.plot(y, y_pred, ".", label="Accepted")
    plt.plot(y.loc[outliers], y_pred[outliers], "ro", label="Outlier")
    plt.ylabel("y_pred")
    plt.xlabel("y")
    plt.legend()
    
    # 残差越靠近零值越好
    ax_132 = plt.subplot(1, 3, 2)
    plt.plot(y, y-y_pred, ".", label="Accepted")
    plt.plot(y.loc[outliers], y.loc[outliers ] - y_pred.loc[outliers], "ro", label="Outlier")
    plt.ylabel("y-y_pred")
    plt.xlabel("y")
    plt.legend()

    # 绘制直方图
    ax_133 = plt.subplot(1, 3, 3)
    z.plot.hist(bins=50, ax=ax_133)
    z.loc[outliers].plot.hist(bins=50, color="r", ax=ax_133)
    plt.legend(["Accepted", "Outlier"])
    plt.xlabel("z")
    
    plt.savefig("./imgs/Outlier.png")
    
    return outliers
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge

X_train = train_data.iloc[:, 0:-1]
y_train = train_data.iloc[:, -1]

# 岭回归
model = Ridge()
outliers = find_outliers(model, X_train, y_train)
MSE:0.1073, R^2:0.8891
resid_mean:0.0, resid_std:0.3277
outlier index: [321, 348, 376, 777, 884, 1145, 1164, 1310, 1458, 1466, 1484, 1523, 1704, 1874, 1879, 1979, 2002, 2279, 2528, 2620, 2645, 2647, 2667, 2668, 2669, 2696, 2767, 2769, 2807, 2842, 2863]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VyLiQ77j-1612277796485)(output_15_1.png)]

直方图和Q-Q图

# 如果数据服从正态分布,则所有的点都会落在直线上
plt.figure(figsize=(10, 5))

ax = plt.subplot(1, 2, 1)
sns.distplot(train_data["V1"], fit=stats.norm)

ax = plt.subplot(1, 2, 2)
res = stats.probplot(train_data["V1"], plot=plt)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ybN40xr6-1612277796487)(output_18_0.png)]

# 绘制全部数据
train_cols = 6
train_rows = len(train_data.columns)

plt.figure(figsize=(4*train_cols, 4*train_rows))

i = 0
for col in train_data.columns:
    i += 1
    ax = plt.subplot(train_rows, train_cols, i)
    sns.distplot(train_data[col], fit=stats.norm)
    
    i += 1
    ax = plt.subplot(train_rows, train_cols, i)
    res = stats.probplot(train_data[col], plot=plt)
plt.tight_layout()


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Al0nK8ml-1612277796489)(output_19_0.png)]

结论:特征变量(如V1, V23, V24, V35)的数据分布不是正态分布。

KDE分布

目标:查看训练集和测试集对应特征变量的分布情况,有助于发现两个数据中分布不一致的情况。

# 核密度:直方图的平滑
plt.figure(figsize=(8, 4))

# 查看特征分布是否相同(train and test)
ax = sns.kdeplot(train_data["V1"], color="red", shade=True)
ax = sns.kdeplot(test_data["V1"], color="blue", shade=True)

ax.set_xlabel("V1")
ax.set_ylabel("Frequence")
ax.legend(["train", "test"])

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ejcPJMDt-1612277796491)(output_22_1.png)]

# 绘制全部数据
dist_cols = 6
dist_rows = len(test_data.columns)

plt.figure(figsize=(4*dist_cols, 4*dist_rows))

i = 1
for col in test_data.columns:
    ax = plt.subplot(dist_rows, dist_cols, i)
    ax = sns.kdeplot(train_data[col], color="red", shade=True)
    ax = sns.kdeplot(test_data[col], color="blue", shade=True)
    ax.set_xlabel(col)
    ax.set_ylabel("Frequence")
    ax.legend(["train", "test"])
    i += 1
    
plt.tight_layout()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FlA6DZBh-1612277796493)(output_23_0.png)]

结论:特征变量V5,V9,V11,V17,V22,V28在训练集和测试集中的数据分布不一致,会导致模型非泛化能力变差,需要删除此类特征。

线性回归关系图(定性分析)

说明:分析变量与目标之间的相关性。

fcols = 2
frows = 1

plt.figure(figsize=(8, 4))

ax = plt.subplot(frows, fcols, 1)
sns.regplot(x="V0", y="target", data=train_data, ax=ax
            , scatter_kws={"marker":".", "s":3, "alpha":0.3}
            , line_kws={"color":"k", "alpha":0.7}
           )

ax = plt.subplot(frows, fcols, 2)
sns.distplot(train_data["V0"].dropna())

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6iZsbM9w-1612277796494)(output_26_0.png)]

fcols = 6
frows = len(test_data.columns)

plt.figure(figsize=(5*fcols, 4*frows))

index  = 0
for col in test_data.columns:
    index += 1
    ax = plt.subplot(frows, fcols, index)
    sns.regplot(x=col, y="target", data=train_data, ax=ax
                , scatter_kws={"marker":".", "s":3, "alpha":0.3}
                , line_kws={"color":"k", "alpha":0.7}
               )
    index += 1
    ax = plt.subplot(frows, fcols, index)
    sns.distplot(train_data[col].dropna())

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kSlKuYBD-1612277796495)(output_27_0.png)]

计算特征变量之间的相关性(定量分析)

# 删除不必要的数据
pd.set_option("display.max_columns", 10)
pd.set_option("display.max_rows", 10)
train_data1 = train_data.drop(columns=["V5", "V9", "V11", "V17", "V22", "V28"])
# 计算协方差矩阵
train_corr = train_data1.corr()
train_corr
V0V1V2V3V4...V34V35V36V37target
V01.0000000.9086070.4636430.4095760.781212...-0.0193420.1389330.231417-0.4940760.873212
V10.9086071.0000000.5065140.3839240.657790...-0.0291150.1463290.235299-0.4940430.871846
V20.4636430.5065141.0000000.4101480.057697...-0.0256200.0436480.316462-0.7349560.638878
V30.4095760.3839240.4101481.0000000.315046...-0.0318980.0800340.324475-0.2296130.512074
V40.7812120.6577900.0576970.3150461.000000...0.0286590.1000100.113609-0.0310540.603984
....................................
V34-0.019342-0.029115-0.025620-0.0318980.028659...1.0000000.233616-0.019032-0.006854-0.006034
V350.1389330.1463290.0436480.0800340.100010...0.2336161.0000000.025401-0.0779910.140294
V360.2314170.2352990.3164620.3244750.113609...-0.0190320.0254011.000000-0.0394780.319309
V37-0.494076-0.494043-0.734956-0.229613-0.031054...-0.006854-0.077991-0.0394781.000000-0.565795
target0.8732120.8718460.6388780.5120740.603984...-0.0060340.1402940.319309-0.5657951.000000

33 rows × 33 columns

# 绘制热力图,便于分析变量之间的相关性和变量与目标之间的相关性
plt.figure(figsize=(20, 16))
sns.heatmap(train_corr, vmax=.8, square=True, annot=True)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-alkcACnc-1612277796496)(output_31_1.png)]

统计与目标最相关的K-th个特征变量

k = 10
nlargest_f = train_corr.nlargest(k, columns="target")["target"]
nlargest_f
target    1.000000
V0        0.873212
V1        0.871846
V8        0.831904
V27       0.812585
V31       0.750297
V2        0.638878
V4        0.603984
V12       0.594189
V16       0.536748
Name: target, dtype: float64
# n_largest corr
cols = nlargest_f.index
cm = np.corrcoef(train_data[cols].values.T)
hm = plt.figure(figsize=(10, 10))
sns.heatmap(train_data[cols].corr(), annot=True, square=True)
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-yAIqNZFZ-1612277796498)(output_34_0.png)]

# 增加变量,降低相关系数阈值
threshold = 0.5

corrmat = train_data.corr()
corr_cols = corrmat[corrmat["target"] > threshold]["target"].index
hm = plt.figure(figsize=(10, 10))
sns.heatmap(train_data[corr_cols].corr(), annot=True, square=True, cmap="RdYlGn")  # 配色:RdYlGn
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GYBgfGR9-1612277796499)(output_35_0.png)]

# 设置阈值,确定合适地特征变量
threshold = 0.5
corr_matrix = train_data1.corr().abs()
drop_cols  = corr_matrix[corr_matrix["target"] < 0.5].index
train_data1 = train_data1.drop(columns=drop_cols)

结论:与目标的相关系数越大,则可以认为这些特征变量对target变量的线性影响最大。

Box_Cox变换

由于线性回归是基于正态分布的,因此在进行统计分析时,需要将数据转换使其符合正态分布, Box_Cox能使线性回归模型在满足线性、正态性、独立性以及方差齐性的同时,又不丢失信息。

全量数据

drop_columns=["V5", "V9", "V11", "V17", "V22", "V28"]
train_x = train_data.drop(columns="target")

# 离线:合并训练集和测试集 线上:训练集
all_data = pd.concat([train_x, test_data])
all_data.drop(columns = drop_columns, inplace=True)
all_data.head()
V0V1V2V3V4...V33V34V35V36V37
00.5660.016-0.1430.4070.452...-4.627-4.789-5.101-2.608-3.508
10.9680.4370.0660.5660.194...-0.8430.1600.364-0.335-0.730
21.0130.5680.2350.3700.112...-0.8430.1600.3640.765-0.589
30.7330.3680.2830.1650.599...-0.843-0.0650.3640.333-0.112
40.6840.6380.2600.2090.337...-0.843-0.2150.364-0.280-0.028

5 rows × 32 columns

# 预先进行归一化操作(全部数据,建议在数据量比较大的时候进行处理)
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
all_cols = all_data.columns.tolist()
all_data  = mms.fit_transform(all_data)

all_data  = pd.DataFrame(all_data, columns=all_cols)
all_data.describe().T
countmeanstdmin25%50%75%max
V04813.00.6941720.1441980.00.6266760.7294880.7901951.0
V14813.00.7213570.1314430.00.6794160.7524970.7995531.0
V24813.00.6023000.1406280.00.5144140.6170720.7004641.0
V34813.00.6031390.1524620.00.5038880.6142700.7104741.0
V44813.00.5237430.1064300.00.4781820.5358660.5850361.0
...........................
V334813.00.4584930.0990950.00.4090370.4545180.5000001.0
V344813.00.4837900.1010200.00.4544900.4999490.5113651.0
V354813.00.7628730.1020370.00.7272730.8000200.8000201.0
V364813.00.3323850.1274560.00.2705840.3470560.4148611.0
V374813.00.5457950.1503560.00.4456470.5393170.6430611.0

32 rows × 8 columns

### 分开处理
# 测试集和训练集分别处理
cols_numeric = test_data.columns
train_data_process = train_data[cols_numeric]
# 归一化后,并消除异常值
train_data_process = pd.DataFrame(mms.fit_transform(train_data_process), columns=cols_numeric)

test_data_process = test_data[cols_numeric]
test_data_process = pd.DataFrame(mms.fit_transform(test_data_process), columns=cols_numeric)
# Box_Cox变换
cols_numeric_left = cols_numeric[:13]
cols_numeric_right = cols_numeric[13:]
train_data_process = pd.concat([train_data_process, train_data["target"]], axis=1)
fcols = 6
frows = len(cols_numeric_left)

plt.figure(figsize=(4*fcols, 4*frows))

i = 0

for var in cols_numeric_left:
    dat =  train_data_process[[var, "target"]].dropna()
    i += 1
    plt.subplot(frows, fcols, i)
    sns.distplot(dat[var], fit=stats.norm)
    plt.title(var + " Original")
    plt.xlabel("")
    
    i += 1
    plt.subplot(frows, fcols, i)
    _ = stats.probplot(dat[var], plot=plt)
    plt.xlabel("")
    plt.ylabel("")
    plt.title("skew=" + "{:.4f}".format(stats.skew(dat[var])))
    
    # 相关性图
    i += 1
    plt.subplot(frows, fcols, i)
    plt.plot(dat[var], dat["target"], ".",alpha=0.8)
    plt.title("corr=" + "{:.2f}".format(np.corrcoef(dat[var], dat["target"])[0][1]))
    
    # 变换后(Box_Cox)
    i += 1
    plt.subplot(frows, fcols, i)
    trans_var, lambda_var = stats.boxcox(dat[var].dropna() + 1)  # 数值不能为负值
    trans_var = pd.DataFrame(mms.fit_transform(trans_var.reshape(-1, 1)))[0]
    
    plt.subplot(frows, fcols, i)
    sns.distplot(trans_var, fit=stats.norm)
    plt.title(var + "Transformed")
    plt.xlabel("")
    
    i += 1
    plt.subplot(frows, fcols, i)
    _ = stats.probplot(trans_var, plot=plt)
    plt.xlabel("")
    plt.ylabel("")
    plt.title("skew=" + "{:.4f}".format(stats.skew(trans_var)))
    
    # 相关性图
    i += 1
    plt.subplot(frows, fcols, i)
    plt.plot(trans_var, dat["target"], ".",alpha=0.8)
    plt.title("corr=" + "{:.2f}".format(np.corrcoef(trans_var, dat["target"])[0][1]))

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vrTRMCPc-1612277796501)(output_48_0.png)]

知识点
偏度的衡量是相对于正态分布来说,正态分布的偏度为0,即若数据分布是对称的,偏度为0。若偏度大于0,则分布右偏,即分布有一条长尾在右;若偏度小于0,则分布为左偏,即分布有一条长尾在左;同时偏度的绝对值越大,说明分布的偏移程度越严重。(大0偏右,小0偏左,等0正态)【注意】数据分布的左偏或右偏,指的是数值拖尾的方向,而不是峰的位置。

特征工程

异常值分析

# 箱图
plt.figure(figsize=(18, 10))
plt.boxplot(train_data.values, labels=train_data.columns)
plt.hlines([-7.5, 7.5], 0, 40, "red")
plt.show()


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oLpHEh1N-1612277796502)(output_52_0.png)]

# 特征变量V9出现了明显的异常值, 删除
train_data = train_data[train_data["V9"] > -7.5]
test_data = test_data[test_data["V9"] > -7.5]
display(train_data.describe())
display(test_data.describe())
V0V1V2V3V4...V34V35V36V37target
count2886.0000002886.0000002886.0000002886.0000002886.000000...2886.0000002886.0000002886.0000002886.0000002886.000000
mean0.1237250.0568560.290340-0.0683640.012254...0.0069590.1985130.030099-0.1319570.127451
std0.9279840.9412690.9112310.9703570.888037...1.0034110.9850580.9702581.0156660.983144
min-4.335000-5.122000-3.420000-3.956000-4.742000...-4.789000-5.695000-2.608000-3.630000-3.044000
25%-0.292000-0.224250-0.310000-0.652750-0.385000...-0.290000-0.199750-0.412750-0.798750-0.347500
50%0.3595000.2730000.386000-0.0450000.109500...0.1600000.3640000.137000-0.1860000.314000
75%0.7260000.5990000.9187500.6235000.550000...0.2730000.6020000.6437500.4930000.793750
max2.1210001.9180002.8280002.4570002.689000...5.1100002.3240005.2380003.0000002.538000

8 rows × 39 columns

V0V1V2V3V4...V33V34V35V36V37
count1925.0000001925.0000001925.0000001925.0000001925.000000...1925.0000001925.0000001925.0000001925.0000001925.000000
mean-0.184404-0.083912-0.4347620.101671-0.019172...-0.011433-0.009985-0.296895-0.0462700.195735
std1.0733331.0766700.9695411.0349251.147286...0.9897320.9952130.9468961.0408540.940599
min-4.814000-5.488000-4.283000-3.276000-4.921000...-4.627000-4.789000-7.477000-2.608000-3.346000
25%-0.664000-0.451000-0.978000-0.644000-0.497000...-0.460000-0.290000-0.349000-0.593000-0.432000
50%0.0650000.195000-0.2670000.2200000.118000...-0.0400000.160000-0.2700000.0830000.152000
75%0.5490000.5890000.2780000.7930000.610000...0.4190000.2730000.3640000.6510000.797000
max2.1000002.1200001.9460002.6030004.475000...5.4650005.1100001.6710002.8610003.021000

8 rows × 38 columns

数据归一化

# 归一化特征变量
feature_columns = test_data.columns.to_list()

mms = MinMaxScaler()
train_data_scalar = mms.fit_transform(train_data[feature_columns])
train_data_scalar = pd.DataFrame(train_data_scalar, columns=feature_columns)
train_data_scalar["target"] = train_data["target"]

test_data_scalar = mms.fit_transform(test_data[feature_columns])
test_data_scalar = pd.DataFrame(test_data_scalar, columns=feature_columns)

查看数据分布

# 已经在EDA阶段探索出,"V5", "V9", "V11", "V17", "V22", "V28“等的变量数据分布差异性较大,故删除
drop_cols = 6
drop_rows = 1

plt.figure(figsize=(5*drop_cols, 5*drop_rows))

for i, col in enumerate(["V5", "V9", "V11", "V17", "V22", "V28"]):
    
    ax = plt.subplot(drop_rows, drop_cols, i+1)
    ax = sns.kdeplot(train_data[col], shade=True, color="Red")
    ax = sns.kdeplot(test_data[col], shade=True, color="Blue")
    ax.set_ylabel("Frequency")
    ax.set_xlabel(col)
    ax = ax.legend(["train", "test"])

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sVF4QRUM-1612277796503)(output_58_0.png)]

特征相关性

plt.figure(figsize=(20, 16))

columns = train_data_process.columns.to_list()
mcorr = train_data_scalar[columns].corr(method="spearman")
mask = np.zeros_like(mcorr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(mcorr, mask=mask, cmap=cmap, square=True, annot=True, fmt="0.2f")

plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SugD8z1c-1612277796505)(output_60_0.png)]

特征降维

mcorr = mcorr.abs()
num_mcoor = mcorr[mcorr["target"] > 0.1]["target"]
index = num_mcoor.sort_values(ascending=False).index
index
Index(['target', 'V0', 'V31', 'V1', 'V8', 'V27', 'V2', 'V16', 'V3', 'V4',
       'V12', 'V10', 'V36', 'V37', 'V24', 'V5', 'V6', 'V20', 'V11', 'V15',
       'V29', 'V7', 'V19', 'V18', 'V13', 'V17', 'V22', 'V30'],
      dtype='object')
train_data_scalar[index].corr(method="spearman")
targetV0V31V1V8...V18V13V17V22V30
target1.0000000.7124030.7116360.6829090.679469...0.1497410.1491990.126262-0.1127430.101378
V00.7124031.0000000.7391160.8941160.832151...0.1321430.1738610.055024-0.0766980.099242
V310.7116360.7391161.0000000.8075850.841469...0.0946780.0715170.115056-0.1064500.131453
V10.6829090.8941160.8075851.0000000.849034...0.0936880.1345950.081446-0.0728480.109216
V80.6794690.8321510.8414690.8490341.000000...0.0795920.1053800.102544-0.0783330.165204
....................................
V180.1497410.1321430.0946780.0936880.079592...1.0000000.242008-0.0736780.0168190.133708
V130.1491990.1738610.0715170.1345950.105380...0.2420081.000000-0.1080200.348432-0.097178
V170.1262620.0550240.1150560.0814460.102544...-0.073678-0.1080201.0000000.3637850.057480
V22-0.112743-0.076698-0.106450-0.072848-0.078333...0.0168190.3484320.3637851.000000-0.054570
V300.1013780.0992420.1314530.1092160.165204...0.133708-0.0971780.057480-0.0545701.000000

28 rows × 28 columns

## PCA降维

目标:消除多重共线性

from sklearn.decomposition import PCA

pca = PCA(n_components=0.9)  # 保留%90的特征信息

# train 
train_data_pca_90 = pca.fit_transform(train_data_scalar.iloc[:, :-1])
train_data_pca_90 = pd.DataFrame(train_data_pca_90)
train_data_pca_90["target"] = train_data["target"]

# test
test_data_pca_90 = pca.fit_transform(test_data_scalar.iloc[:, :])
test_data_pca_90 = pd.DataFrame(test_data_pca_90)

display(train_data_pca_90)
display(test_data_pca_90)
01234...12131415target
00.1836260.0754440.0765020.6131280.609231...-0.0428310.015694-0.0964370.3204430.175
10.0127960.0265420.0113500.4055900.422800...0.1361430.0432240.0660510.1775210.676
20.034287-0.0547960.1232250.2776900.431320...0.1819250.0429510.1607000.1480860.633
30.141906-0.0891360.0785650.2629410.433194...0.1765890.0311080.0770630.0676500.206
40.1437490.0093860.2469910.1711650.327496...0.1418550.1158490.1572050.0291450.384
....................................
2881-0.1114950.1402410.200300-0.283676-0.281734...-0.0596680.007330-0.0033440.1155700.892
2882-0.1893280.1132810.247078-0.245414-0.226019...-0.1552420.0755380.0455010.0653430.812
28830.0775460.2012200.183691-0.277095-0.431218...-0.0956670.010671-0.0383390.0689130.235
28840.0219930.2244930.112763-0.199688-0.458057...-0.1336750.068059-0.007461-0.0276651.042
2885-0.0225760.1950850.194000-0.306873-0.299947...-0.0774140.093005-0.009916-0.0096090.005

2886 rows × 17 columns

01234...910111213
00.014675-0.3487510.237515-0.1912050.081670...-0.046190-0.0133700.0380450.099094-0.062470
1-0.056760-0.3577660.259784-0.2791750.155516...-0.071004-0.0240060.0479270.174885-0.117769
2-0.045538-0.3347890.183349-0.1451200.241489...-0.055299-0.0185120.0240060.133972-0.162479
3-0.1066630.025139-0.011816-0.196600-0.031321...-0.140342-0.075948-0.0683040.1193660.032868
4-0.179450-0.022747-0.127998-0.1861300.068751...-0.160952-0.0433510.0192400.141839-0.083498
....................................
19201.4411620.5806290.7518580.2027560.097667...-0.0296170.1223750.0839110.327381-0.380743
19211.4209700.8033160.922584-0.303722-0.108873...-0.1102000.0236440.0404850.172820-0.223728
19221.1611420.9840690.579733-0.370852-0.092587...-0.163546-0.037952-0.1771130.268723-0.001955
19231.1177360.9566060.529063-0.451184-0.031903...-0.129908-0.082961-0.1539170.265543-0.010352
19241.0246240.7807770.418843-0.2809300.135125...-0.254216-0.198561-0.0943390.373396-0.002877

1925 rows × 14 columns

模型验证

数据处理

train_data = pd.read_csv(train_data_file, sep="\t")
test_data = pd.read_csv(test_data_file, sep="\t")

# 数据归一化
feature_columns = test_data.columns.to_list()

mms = MinMaxScaler()
train_data_scalar = mms.fit_transform(train_data[feature_columns])
train_data_scalar = pd.DataFrame(train_data_scalar, columns=feature_columns)
train_data_scalar["target"] = train_data["target"]

test_data_scalar = mms.fit_transform(test_data[feature_columns])
test_data_scalar = pd.DataFrame(test_data_scalar, columns=feature_columns)

# PCA数据降维
pca = PCA(n_components=16)  # 保留16的特征信息

# train 
train_data_pca_16 = pca.fit_transform(train_data_scalar.iloc[:, :-1])
train_data_pca_16 = pd.DataFrame(train_data_pca_16)
train_data_pca_16["target"] = train_data["target"]

# test
test_data_pca_16 = pca.fit_transform(test_data_scalar.iloc[:, :])
test_data_pca_16 = pd.DataFrame(test_data_pca_16)

# 数据划分
train_data_pca_16 = train_data_pca_16.fillna(0)
X = train_data_pca_16.iloc[:,:-1]
y = train_data_pca_16.iloc[:, -1]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=0)

模型

# L2正则
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import SGDRegressor
poly = PolynomialFeatures(degree=3)

Poly_X_train = poly.fit_transform(X_train)
Poly_X_val = poly.fit_transform(X_val)

clf = SGDRegressor(max_iter=1000, tol=0.001, penalty="L2", alpha=0.00001)
clf.fit(Poly_X_train, y_train)
SGDRegressor(alpha=1e-05, penalty='L2')
print("Training mse:", MSE(y_train, clf.predict(Poly_X_train)))
print("Testing mse:", MSE(y_val, clf.predict(Poly_X_val)))
Training mse: 0.13411813527459393
Testing mse: 0.14244273147224495
# L1正则
poly = PolynomialFeatures(degree=3)

Poly_X_train = poly.fit_transform(X_train)
Poly_X_val = poly.fit_transform(X_val)

clf = SGDRegressor(max_iter=1000, tol=0.001, penalty="L1", alpha=0.00001)
clf.fit(Poly_X_train, y_train)
SGDRegressor(alpha=1e-05, penalty='L1')
print("Training mse:", MSE(y_train, clf.predict(Poly_X_train)))
print("Testing mse:", MSE(y_val, clf.predict(Poly_X_val)))
Training mse: 0.13443580962219054
Testing mse: 0.14261036582932843

LGB线下模型验证

# 预读数据
train_data_2_f = train_data.copy()
test_data_2_f = test_data.copy()

train_data_2 = train_data_2_f[test_data_2_f.columns]
train_data_2_target = train_data_2_f.iloc[:, -1]
# 5折交差验证
from sklearn.model_selection import KFold


Folds = 5
kf = KFold(n_splits=Folds, shuffle=True, random_state=2021)

# 记录训练和预测的MSE
MSE_DICT = {"train mse":[], "test mse":[]}

# 交叉验证
for i, (train_index, test_index) in enumerate(kf.split(train_data_2)):
    
    lgb_reg = lgb.LGBMRegressor(learning_rate=0.01
                            , max_depth=-1
                            , n_estimators=5000
                            , boosting_type="gbdt"
                            , random_state=2021
                            , objective="regression"
                           )
    
    X_train_KFold, X_test_KFold = train_data_2.iloc[train_index, :], train_data_2.iloc[test_index, :]
    Y_train_KFold, Y_test_KFold = train_data_2_target[train_index], train_data_2_target[test_index]
    
    # 训练模型
    lgb_reg.fit(X=X_train_KFold, y=Y_train_KFold
            , eval_set=[(X_train_KFold, Y_train_KFold), (X_test_KFold, Y_test_KFold)]
            , eval_names=["Train", "Test"]
            , eval_metric="MSE"
            , early_stopping_rounds=100
            , verbose=100
           )

    # 预测模型
    X_train_KFold_pred = MSE(Y_train_KFold, lgb_reg.predict(X_train_KFold, num_iteration=lgb_reg.best_iteration_))
    X_test_KFold_pred = MSE(Y_test_KFold, lgb_reg.predict(X_test_KFold, num_iteration=lgb_reg.best_iteration_))
    
    print("第{}折 训练和测试 训练MSE 预测MSE".format(i+1))
    print("-------\n训练MSE:{}".format(X_train_KFold_pred))
    print("测试MSE:{}\n-------\n".format(X_test_KFold_pred))
    
    # 求平均
    MSE_DICT["train mse"].append(X_train_KFold_pred)
    MSE_DICT["test mse"].append(X_test_KFold_pred)
    print("-------\nTrain数值:", MSE_DICT["train mse"])
    print("Test 数值:", MSE_DICT["test mse"])
    print("训练MSE(mean):{}".format(np.mean(MSE_DICT["train mse"])))
    print("测试MSE(mean):{}\n-------\n".format(np.mean(MSE_DICT["test mse"])))

学习曲线

说明:学习曲线是在训练集大小不同是,通过绘制模型训练集和交叉验证集上的准确率来观察模型在新数据上的表现,进而判断模型的方差和偏差是否过高,以及增大训练集是否可以减少过拟合。

def plot_learning_curve(estimator, title, X, y, 
                        ylim = None,
                        cv=None,
                        n_jobs = None
                       ):
    train_size, train_scores, test_scores = learning_curve(estimator=estimator
                                                           ,X = X, y = y, cv = cv, n_jobs=n_jobs)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.plot(train_size, train_scores_mean, "o-", label = "Training Score", color="r")
    plt.fill_between(train_size, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, color="r", alpha=0.1)
    
    plt.plot(train_size, test_scores_mean, "o-", label = "Cross_validation score", color="g")
    plt.fill_between(train_size, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, color="g", alpha=0.1)
    
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.ylabel("Score")
    plt.xlabel("Training examples")
    plt.grid()
    plt.legend(loc="best")
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import learning_curve

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=2021)
model = SGDRegressor()
plot_learning_curve(model, "LinerRegression", X, y, ylim=(0.7, 0.9), cv=cv, n_jobs=-1)


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Qz1MBUQZ-1612277796507)(output_102_0.png)]

结论:当增加模型的训练数据时,可以有效的防止过拟合现象。

验证曲线

说明:用于评估某一个参数对模型的影响情况。

from sklearn.model_selection import validation_curve

X = train_data_2.copy()
y = train_data_2_target.copy()

model = SGDRegressor(max_iter=1000, tol=1e-3, penalty="L2")
# 调整正则化系数
param_range = [0.1, 0.01, 0.001, 0.00001, 0.000001, 0.0000001]

train_scores, test_scores = validation_curve(model, X, y
                                             , param_name="alpha"
                                             , param_range=param_range
                                             , scoring="r2"
                                             , cv=10
                                             , n_jobs=-1
                                            )

train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)

test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)


plt.semilogx(param_range, train_scores_mean, "o-", label = "Training Score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, color="r", alpha=0.1)

plt.semilogx(param_range, test_scores_mean, "o-", label = "Cross-validation score", color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, color="g", alpha=0.1)

plt.title("Validation Curve with SGDRegressor")
plt.xlabel("alpha")
plt.ylabel("Score")
plt.ylim(0.6, 1.0)
plt.legend(loc="best")
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5yyG9QhN-1612277796509)(output_105_0.png)]

模型融合

获取数据

train_data = pd.read_csv(train_data_file, sep="\t")
test_data = pd.read_csv(test_data_file, sep="\t")
# # 归一化数据
# feature_columns = test_data.columns.to_list()

# mms = MinMaxScaler()
# train_data_scalar = mms.fit_transform(train_data[feature_columns])
# train_data_scalar = pd.DataFrame(train_data_scalar, columns=feature_columns)

# test_data_scalar = mms.fit_transform(test_data[feature_columns])
# test_data_scalar = pd.DataFrame(test_data_scalar, columns=feature_columns)

# 消除测试集和训练集不一致的特征
# training
# train_data_scalar = train_data.drop(columns=["V5", "V9", "V11", "V17", "V22", "V28"])
corr_col = ['V0', 'V31', 'V1', 'V8', 'V27', 'V2', 'V16', 'V3', 'V4',
       'V12', 'V10', 'V36', 'V37', 'V24', 'V5', 'V6', 'V20', 'V11', 'V15',
       'V29', 'V7', 'V19', 'V18', 'V13', 'V17', 'V22', 'V30']
train_data_scalar = train_data[corr_col]
train_data_scalar["target"] = train_data["target"]

# testing
# test_data_scalar = test_data.drop(columns =["V5", "V9", "V11", "V17", "V22", "V28"] )
test_data_scalar = test_data[corr_col]
train_data_scalar["origin"] = "train"
test_data_scalar["origin"] = "test"
data_all = pd.concat([train_data_scalar, test_data_scalar], axis=0, ignore_index=True)
def get_training_data(all_data=None):
    """获取训练数据"""
    
    from sklearn.model_selection import train_test_split
    

    df_train = all_data[all_data["origin"] == "train"]
    df_train["label"] = all_data["target"]
    
    # 整理数据
    y = df_train["target"]
    X = df_train.drop(columns=["label", "target", "origin"])
    X_train, X_val, y_train, y_val = train_test_split(X, y
                                                        , test_size=0.2
                                                        , random_state=2021
                                                       )
    
    return X_train, X_val, y_train, y_val
def get_test_data():
    """获取测试数据"""
    
    df_test = data_all[all_data["origin"] == "test"].reset_index(drop=True)
    
    return df_test.drop(columns=["origin"])

评价指标

from sklearn.metrics import make_scorer

def rmse(y_true, y_pred):
    
    diff = y_pred - y_true
    sum_sq = sum(diff**2)
    n = len(y_pred)
    
    return np.sqrt(sum_sq/n)

def mse(y_true, y_pred):
    
    return mean_squared_error(y_true, y_pred)
# 损失
rmse_scorer = make_scorer(rmse, greater_is_better=False)
mse_scorer = make_scorer(mse, greater_is_better=False)

数据准备

# 使用模型消除异常值
from sklearn.linear_model import Ridge
model = Ridge()
X_train, X_val, y_train, y_val = get_training_data(data_all)
outliers = find_outliers(model, X_train, y_train)
MSE:0.1111, R^2:0.8862
resid_mean:0.0, resid_std:0.3335
outlier index: [2620, 1310, 777, 1874, 2264, 2667, 2279, 348, 2767, 1905, 884, 1523, 2645, 2647, 2696, 1879, 2842, 1704, 2668, 1458, 2669]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2tF5T6WP-1612277796510)(output_124_1.png)]

X_outliers=X_train.loc[outliers]
y_outliers=y_train.loc[outliers]
X_t=X_train.drop(index=outliers)
y_t=y_train.drop(index=outliers)
def get_trainning_data_omitoutliers():

    y=y_t.copy()
    X=X_t.copy()
    
    return X,y

训练模型

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedKFold

def train_model(model, param_grid=[], X=[], y=[], 
                splits=5, repeats=5):

    # 获取数据
    if len(y)==0:
        X,y = get_trainning_data_omitoutliers()
        
    # 交叉验证
    rkfold = RepeatedKFold(n_splits=splits, n_repeats=repeats)
    
    # 网格搜索最佳参数
    if len(param_grid)>0:
        gsearch = GridSearchCV(model, param_grid, cv=rkfold,
                               scoring="neg_mean_squared_error",
                               verbose=1, return_train_score=True)

        # 训练
        gsearch.fit(X,y)

        # 最好的模型
        model = gsearch.best_estimator_        
        best_idx = gsearch.best_index_

        # 获取交叉验证评价指标
        grid_results = pd.DataFrame(gsearch.cv_results_)
        cv_mean = abs(grid_results.loc[best_idx,'mean_test_score'])
        cv_std = grid_results.loc[best_idx,'std_test_score']

    # 没有网格搜索  
    else:
        # 使用普通的交叉验证方法
        grid_results = []
        cv_results = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=rkfold)
        cv_mean = abs(np.mean(cv_results))
        cv_std = np.std(cv_results)
    
    # 合并数据
    cv_score = pd.Series({'mean':cv_mean,'std':cv_std})

    # 预测
    y_pred = model.predict(X)
    
    # 模型性能的统计数据        
    print('----------------------')
    print(model)
    print('----------------------')
    print('score=',model.score(X, y))
    print('rmse=',rmse(y, y_pred))
    print('mse=',mse(y, y_pred))
    print('cross_val: mean=',cv_mean,', std=',cv_std)
    
    # 残差分析与可视化
    y_pred = pd.Series(y_pred, index=y.index)
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()
    z = (resid - mean_resid)/std_resid    
    n_outliers = sum(abs(z)>3)
    outliers = z[abs(z)>3].index
    
    plt.figure(figsize=(15, 5))
    ax_131 = plt.subplot(1, 3, 1)
    plt.plot(y,y_pred,'.')
#     plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
    plt.xlabel('y')
    plt.ylabel('y_pred');
    plt.title('corr = {:.3f}'.format(np.corrcoef(y,y_pred)[0][1]))
    
    ax_132=plt.subplot(1, 3, 2)
    plt.plot(y,y-y_pred,'.')
#     plt.plot(y.loc[outliers], y_pred.loc[outliers], 'ro')
    plt.xlabel('y')
    plt.ylabel('y - y_pred');
    plt.title('std resid = {:.3f}'.format(std_resid))
    
    ax_133=plt.subplot(1, 3, 3)
    z.plot.hist(bins=50, ax=ax_133)
#     z.loc[outliers].plot.hist(color='r', bins=50, ax=ax_133)
    plt.xlabel('z')
    plt.title('{:.0f} samples with z>3'.format(n_outliers))

    return model, cv_score, grid_results
opt_models = dict()
score_models = pd.DataFrame(columns=["mean", "std"])
splits = 5
repeats = 5

单一模型

Ridge(L2正则)

model = "Ridge"

opt_models[model] = Ridge()

# 调整参数
alpha_range = np.arange(0.25, 6, 0.25)
param_grid = {"alpha": alpha_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(alpha_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )
plt.xlabel("alpha")
plt.ylabel("score")
----------------------
Ridge(alpha=0.5)
----------------------
score= 0.9030109271008152
rmse= 0.3043510385681425
mse= 0.09262955467750701
cross_val: mean= 0.0956725874884991 , std= 0.006621695786068565

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bfdcAuvM-1612277796511)(output_132_5.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-d3Gm7mXP-1612277796512)(output_132_6.png)]

mean_squared_error(y_val, opt_models["Ridge"].predict(X_val))
0.10705413929393254

Lasso(L1正则)

from sklearn.linear_model import Lasso

model = "Lasso"

opt_models[model] = Lasso()

# 调整参数(对参数alpha非常敏感)
alpha_range = np.arange(1e-4, 1e-3, 4e-5)
param_grid = {"alpha": alpha_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)


plt.figure()
plt.errorbar(alpha_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )

plt.xlabel("alpha")
plt.ylabel("score")
----------------------
Lasso(alpha=0.00022000000000000003)
----------------------
score= 0.9031250717142573
rmse= 0.3046339142675125
mse= 0.09280182172194627
cross_val: mean= 0.09659329022777374 , std= 0.006394970336172911

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-theAxCr6-1612277796513)(output_136_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-lQ2scbQv-1612277796515)(output_136_5.png)]

ElasticNet回归

from sklearn.linear_model import ElasticNet

model = "ElasticNet"

opt_models[model] = ElasticNet()

# 调整参数

param_grid = {"alpha": np.arange(1e-4, 1e-3, 1e-4), "l1_ratio": np.arange(0.1, 1.0, 0.1), "max_iter":[100000]}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
ElasticNet(alpha=0.00030000000000000003, l1_ratio=0.9, max_iter=100000)
----------------------
score= 0.903108934683549
rmse= 0.3046592855488974
mse= 0.09281728027116438
cross_val: mean= 0.09687683618331633 , std= 0.005227508188395059

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-kkoxUj47-1612277796516)(output_138_3.png)]

SVR回归

from sklearn.svm import LinearSVR
model = "LinearSVR"

opt_models[model] = LinearSVR()

# 调整参数
c_range = np.arange(0.1, 1, 0.1)
param_grid = {"C": c_range
              , "max_iter":[1000]
             }
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=repeats)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(c_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )

plt.xlabel("C")
plt.ylabel("score")
----------------------
LinearSVR(C=0.7000000000000001)
----------------------
score= 0.9018854105633738
rmse= 0.30657684406544794
mse= 0.09398936131712997
cross_val: mean= 0.0968415393834378 , std= 0.006284252976090882

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-f06jSGGI-1612277796518)(output_140_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ovUKTqyC-1612277796519)(output_140_5.png)]

K近邻

from sklearn.neighbors import KNeighborsRegressor

model = "KNeighbors"

opt_models[model] = KNeighborsRegressor()

# 调整参数
k_range = np.arange(3, 11, 1)
param_grid = {"n_neighbors": k_range}

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)

plt.figure()
plt.errorbar(k_range
             , abs(grid_results["mean_test_score"])
             , abs(grid_results["std_test_score"])/np.sqrt(splits*repeats)
            )

plt.xlabel("n_neighbors")
plt.ylabel("score")
----------------------
KNeighborsRegressor(n_neighbors=7)
----------------------
score= 0.8671509432245664
rmse= 0.35673998640960447
mse= 0.12726341790352505
cross_val: mean= 0.17977332255159276 , std= 0.015313175226783347

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wt8zhzKc-1612277796520)(output_142_4.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zM1mU19j-1612277796521)(output_142_5.png)]

Boosting方法

GBDT

from sklearn.ensemble import GradientBoostingRegressor

model = "GBDT"

opt_models[model] = GradientBoostingRegressor()

# 调整参数
param_grid = {"n_estimators":[150, 250, 350]
              , "max_depth":[3, 4, 5, 6]
              , "subsample":[0.6, 0.7, 0.8]
              , "min_samples_split":[5, 6, 7]
             }

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
GradientBoostingRegressor(max_depth=4, min_samples_split=5, n_estimators=350,
                          subsample=0.7)
----------------------
score= 0.9908065332807071
rmse= 0.09370290133201399
mse= 0.008780233718037152
cross_val: mean= 0.09280908808257234 , std= 0.0026463691062500556

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-72tZVdBM-1612277796522)(output_145_3.png)]

mean_squared_error(y_val, opt_models["GBDT"].predict(X_val))
0.1144506129075825

XGB

from xgboost import XGBRegressor

model = "XGB"

opt_models[model] = XGBRegressor()

# 调整参数
param_grid = {"n_estimators":[100, 200, 300, 400, 500]
              , "max_depth":[1, 2, 3]
             }

opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=2,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=200, n_jobs=4, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)
----------------------
score= 0.9557014165347619
rmse= 0.2060001199760511
mse= 0.04243604943014754
cross_val: mean= 0.10432026650623527 , std= 0.007917387050248647

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bAlea1Ot-1612277796524)(output_150_3.png)]

LightGBM

  1. 确定任务类型(回归、分类、排序等),以及基学习器的类型(dark, gbdt、RF)
  2. 首先选择较高的学习率,大概0.1附近,这样是为了加快收敛的速度。这对于调参是很有必要的。
  3. 对决策树基本参数调参
  4. 正则化参数调参
  5. 最后降低学习率,这里是为了最后提高准确率
from time import time
import lightgbm as lgb
import datetime
data_train = lgb.Dataset(X_train, y_train)
data_val = lgb.Dataset(X_val, y_val)
初始化状态(未调参)
# 参数设定为默认状态
params1 = {
      "boosting_type": "gbdt"
    , "objective": "regression" # 二分类任务
    , "metric": {"mse", "rmse"}
    
    , "nthread": 4
#     , "device": "gpu"
#     , "gpu_device_id": 1
    , "verbose": 1

    , "learning_rate": 0.1
    
    , "max_depth": 5
    , "num_leaves": 31  # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
    , "subsample": 1.0  # 数据采样
    , "colsample_bytree": 1.0  # 特征采样
    
    , 'reg_alpha': 0.0  # L1
    , 'reg_lambda': 0.0  # L2
}
t0 = time()
cv_result1 = lgb.cv(params=params1, train_set=data_train
                    , early_stopping_rounds = 20
                    , nfold=5
                    , stratified=False
                    , shuffle=True
#                     , metrics="binary_logloss",
                    , seed=0
                      )
print("参数处理时间:",datetime.datetime.fromtimestamp(time()-t0).strftime("%M:%S:%f"))
调整好的参数状态
num_boost_round = 1000
params2 = {
      "boosting_type": "gbdt"
    , "objective": "regression" # 二分类任务
    , "metric": {"mse", "rmse"}
    , "silent": True
    , "nthread": 4
#     , "device": "gpu"
#     , "gpu_device_id": 1
    , "verbose": 1

    , "learning_rate": 0.01
    
    , "max_depth": 3
    , "num_leaves": 31  # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
    , "subsample": 1.0  # 数据采样
    , "colsample_bytree": 1.0  # 特征采样
    
    , 'reg_alpha': 0.0  # L1
    , 'reg_lambda': 0.0  # L2
}
t0 = time()
cv_result2 = lgb.cv(params=params1, train_set=data_train
                    , num_boost_round= num_boost_round
                    , early_stopping_rounds = 20
                    , nfold=5
                    , stratified=False
                    , shuffle=True
#                     , metrics="binary_logloss",
                    , seed=0
                      )
print("参数处理时间:",datetime.datetime.fromtimestamp(time()-t0).strftime("%M:%S:%f"))
#  选择最佳的estimators
print("Best_n_estimators: %d\nBest_cv_score: %.4f" 
      % (np.array(list(cv_result2.values())).shape[1],
         min(np.array(list(cv_result2.values()))[0]))
     )  
Best_n_estimators: 170
Best_cv_score: 0.3413
调参数状态
params3 = {
      "boosting_type": "gbdt"
    , "objective": "regression" # 二分类任务
    , "metric": {"mse", "rmse"}
    
    , "nthread": 4
#     , "device": "gpu"
#     , "gpu_device_id": 1
    , "verbose": 0

    , "learning_rate": 0.01
    
    , "max_depth": 3
    , "num_leaves": 31  # 由于lightGBM是leaves_wise生长,官方说法是要小于2^max_depth
    , "subsample": 0.8  # 数据采样
    , "colsample_bytree": 0.8  # 特征采样
    
    , 'reg_alpha': 0.0  # L1
    , 'reg_lambda': 0.0  # L2
}
t0 = time()
cv_result3 = lgb.cv(params=params1, train_set=data_train
                    , num_boost_round=num_boost_round 
                    , early_stopping_rounds = 20
                    , nfold=5
                    , stratified=False
                    , shuffle=True
#                     , metrics="binary_logloss",
                    , verbose_eval=True
                    , seed=0
                      )
print("处理时间:",datetime.datetime.fromtimestamp(time()-t0).strftime("%M:%S:%f"))
可视化指标
rmse指标
fig, ax = plt.subplots(1, 2, figsize = (14,5))

length1 = np.array(list(cv_result1.values())).shape[1]
length2 = np.array(list(cv_result2.values())).shape[1]
length3 = np.array(list(cv_result3.values())).shape[1]

ax[0].plot(range(length1), cv_result1[list(cv_result1.keys())[0]], label="param1", c="red")
ax[1].plot(range(length1), cv_result1[list(cv_result1.keys())[1]], label="param1", c="green")

ax[0].plot(range(length2), cv_result2[list(cv_result2.keys())[0]], label="param2", c="magenta")
ax[1].plot(range(length2), cv_result2[list(cv_result2.keys())[1]], label="param2", c="blue")

ax[0].plot(range(length3), cv_result3[list(cv_result3.keys())[0]], label="param3", c="yellow")
ax[1].plot(range(length3), cv_result3[list(cv_result3.keys())[1]], label="param3", c="c")

ax[0].set_xlabel("num_round", fontsize=12)
ax[1].set_xlabel("num_round", fontsize=12)
ax[0].set_ylabel(list(cv_result1.keys())[0], fontsize=12)
ax[1].set_ylabel(list(cv_result1.keys())[1], fontsize=12)
ax[0].set_ylim((0.09, 0.4))
ax[0].legend()
ax[1].legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rRsc3ioG-1612277796527)(output_167_0.png)]

mse指标
fig, ax = plt.subplots(1, 2, figsize = (14,5))

length1 = np.array(list(cv_result1.values())).shape[1]
length2 = np.array(list(cv_result2.values())).shape[1]
length3 = np.array(list(cv_result3.values())).shape[1]

ax[0].plot(range(length1), cv_result1[list(cv_result1.keys())[2]], label="param1", c="red")
ax[1].plot(range(length1), cv_result1[list(cv_result1.keys())[3]], label="param1", c="green")

ax[0].plot(range(length2), cv_result2[list(cv_result2.keys())[2]], label="param2", c="magenta")
ax[1].plot(range(length2), cv_result2[list(cv_result2.keys())[3]], label="param2", c="blue")

ax[0].plot(range(length3), cv_result3[list(cv_result3.keys())[2]], label="param3", c="yellow")
ax[1].plot(range(length3), cv_result3[list(cv_result3.keys())[3]], label="param3", c="c")

ax[0].set_xlabel("num_round", fontsize=12)
ax[1].set_xlabel("num_round", fontsize=12)
ax[0].set_ylabel(list(cv_result1.keys())[2], fontsize=12)
ax[1].set_ylabel(list(cv_result1.keys())[3], fontsize=12)
ax[0].set_ylim((0.1, 0.13))
ax[0].legend()
ax[1].legend()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-grOFKFt1-1612277796528)(output_169_0.png)]

rs  = lgb.train(params3, num_boost_round=num_boost_round, train_set=data_train, feval=data_val)
mean_squared_error(y_val, rs.predict(X_val))
0.10717527815511706

随机森林

from sklearn.ensemble import RandomForestRegressor

model = "RandomForest"

opt_models[model] = RandomForestRegressor()

# 调整参数
param_grid = {"n_estimators":[100, 150, 200]
              , "max_features":[1, 2, 3]
#               , "max_depth": [2, 3, 4, 5]
              , "min_samples_split": [2, 4, 6]
             }
opt_models[model], cv_score, grid_results = train_model(opt_models[model], param_grid, splits=splits, repeats=1)

cv_score.name = model
score_models = score_models.append(cv_score)
----------------------
RandomForestRegressor(max_features=3, min_samples_split=4, n_estimators=200)
----------------------
score= 0.9796795029577018
rmse= 0.13930944030785755
mse= 0.01940712015888852
cross_val: mean= 0.10995452070949858 , std= 0.012431532597183166

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NGOUB705-1612277796529)(output_175_3.png)]

mean_squared_error(y_val, opt_models["RandomForest"].predict(X_val))
0.12410701776059764

多模型融合Bagging

def model_predict(test_data, test_y = [], stack=True):
    """投票法voting"""
    
    i = 0
    y_predict_total = np.zeros((test_data.shape[0], ))
    
    for model in opt_models.keys():
        if model != "LinearSVR" and model != "KNeighbors" :
            y_predict = opt_models[model].predict(test_data)
            y_predict_total += y_predict
            i += 1
        if len(test_y) > 0:
            print("{}-mse: ".format(model), mean_squared_error(test_y, y_predict))
    
    y_predict_mean = np.round(y_predict_total/i, 3)
    
    if len(test_y)>0:
        print("mean_mse: ", mean_squared_error(test_y, y_predict_mean))
    else:
#         y_predict_mean = pd.Series(y_predict_mean)
        return y_predict_mean
    
model_predict(X_val, y_val)

Stacking

模型的选择

  • 第一层的基模型最好是强模型,而第二层的基模型可以放一个简单的分类器,防止过拟合。
  • 第一层基模型的个数不能太小,因为一层模型个数等于第二层分类器的特征维度。大家可以把勉强将其想象成神经网络的第一层神经元的个数,当然这个值也不是越多越好。
  • 第一层的基模型必须准而不同",如果有一个性能很差的模型出现在第一层,将会严重影响整个模型融合的效果

获取次级学习器数据

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from scipy import sparse
import xgboost
import lightgbm

from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor,ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
def stacking_reg(clf,train_x,train_y,test_x,clf_name,kf,label_split=None):
    train=np.zeros((train_x.shape[0],1))
    test=np.zeros((test_x.shape[0],1))
    test_pre=np.empty((folds,test_x.shape[0],1))
    cv_scores=[]
    for i,(train_index,test_index) in enumerate(kf.split(train_x,label_split)):       
        tr_x=train_x[train_index]
        tr_y=train_y[train_index]
        te_x=train_x[test_index]
        te_y = train_y[test_index]
        
        if clf_name in ["rf","ada","gb","et","lr","lsvc","knn"]:
            clf.fit(tr_x,tr_y)
            pre=clf.predict(te_x).reshape(-1,1)
            train[test_index]=pre
            test_pre[i,:]=clf.predict(test_x).reshape(-1,1)
            cv_scores.append(mean_squared_error(te_y, pre))
        elif clf_name in ["xgb"]:
            train_matrix = clf.DMatrix(tr_x, label=tr_y, missing=-1)
            test_matrix = clf.DMatrix(te_x, label=te_y, missing=-1)
            z = clf.DMatrix(test_x, label=te_y, missing=-1)
            params = {'booster': 'gbtree',
                      'eval_metric': 'rmse',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.03,
                      'tree_method': 'exact',
                      'seed': 2021,
                      'nthread': 12
                      }
            num_round = 10000
            early_stopping_rounds = 20
            watchlist = [(train_matrix, 'train'),
                         (test_matrix, 'eval')
                         ]
            if test_matrix:
                model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
                                  early_stopping_rounds=early_stopping_rounds
                                  )
                pre= model.predict(test_matrix,ntree_limit=model.best_ntree_limit).reshape(-1,1)
                train[test_index]=pre
                test_pre[i, :]= model.predict(z, ntree_limit=model.best_ntree_limit).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y, pre))

        elif clf_name in ["lgb"]:
            train_matrix = clf.Dataset(tr_x, label=tr_y)
            test_matrix = clf.Dataset(te_x, label=te_y)
            params = {
                      'boosting_type': 'gbdt',
                      'objective': 'regression_l2',
                      'metric': 'mse',
                      'min_child_weight': 1.5,
                      'num_leaves': 2**5,
                      'lambda_l2': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'learning_rate': 0.03,
                      'tree_method': 'exact',
                      'seed': 2021,
                      'nthread': 12,
                      'silent': True,
                      }
            num_round = 10000
            early_stopping_rounds = 20
            if test_matrix:
                model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
                                  early_stopping_rounds=early_stopping_rounds
                                  )
                pre= model.predict(te_x,num_iteration=model.best_iteration).reshape(-1,1)
                train[test_index]=pre
                test_pre[i, :]= model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)
                cv_scores.append(mean_squared_error(te_y, pre))
        else:
            raise IOError("Please add new clf.")
        print("%s now score is:"%clf_name,cv_scores)
    test[:]=test_pre.mean(axis=0)
    print("%s_score_list:"%clf_name,cv_scores)
    print("%s_score_mean:"%clf_name,np.mean(cv_scores))
    return train.reshape(-1,1),test.reshape(-1,1)

建模学习器

def rf_reg(x_train, y_train, x_valid, kf, label_split=None):
    randomforest = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2021, max_features="auto",verbose=1)
    rf_train, rf_test = stacking_reg(randomforest, x_train, y_train, x_valid, "rf", kf, label_split=label_split)
    return rf_train, rf_test,"rf_reg"

def ada_reg(x_train, y_train, x_valid, kf, label_split=None):
    adaboost = AdaBoostRegressor(n_estimators=30, random_state=2021, learning_rate=0.01)
    ada_train, ada_test = stacking_reg(adaboost, x_train, y_train, x_valid, "ada", kf, label_split=label_split)
    return ada_train, ada_test,"ada_reg"

def gb_reg(x_train, y_train, x_valid, kf, label_split=None):
    gbdt = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2021,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking_reg(gbdt, x_train, y_train, x_valid, "gb", kf, label_split=label_split)
    return gbdt_train, gbdt_test,"gb_reg"

def et_reg(x_train, y_train, x_valid, kf, label_split=None):
    extratree = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2021,verbose=1)
    et_train, et_test = stacking_reg(extratree, x_train, y_train, x_valid, "et", kf, label_split=label_split)
    return et_train, et_test,"et_reg"

def lr_reg(x_train, y_train, x_valid, kf, label_split=None):
    lr_reg=LinearRegression(n_jobs=-1)
    lr_train, lr_test = stacking_reg(lr_reg, x_train, y_train, x_valid, "lr", kf, label_split=label_split)
    return lr_train, lr_test, "lr_reg"

def xgb_reg(x_train, y_train, x_valid, kf, label_split=None):
    xgb_train, xgb_test = stacking_reg(xgboost, x_train, y_train, x_valid, "xgb", kf, label_split=label_split)
    return xgb_train, xgb_test,"xgb_reg"

def lgb_reg(x_train, y_train, x_valid, kf, label_split=None):
    lgb_train, lgb_test = stacking_reg(lightgbm, x_train, y_train, x_valid, "lgb", kf, label_split=label_split)
    return lgb_train, lgb_test,"lgb_reg"

预测

def stacking_pred(x_train, y_train, x_valid, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True):
    
    # 初级学习器
    for k, clf_list in enumerate(clf_list):
        clf_list = [clf_list]
        column_list = []
        train_data_list=[]
        test_data_list=[]
        for clf in clf_list:
            # 调用模型函数
            train_data,test_data,clf_name=clf(x_train, y_train, x_valid, kf, label_split=label_split)
            train_data_list.append(train_data)
            test_data_list.append(test_data)
            column_list.append("clf_%s" % (clf_name))
            
    # 组合次学习器的数据
    train = np.concatenate(train_data_list, axis=1)
    test = np.concatenate(test_data_list, axis=1)
    
    # 经验表明,将第一层模型的预测值和原始特征并入到第二层模型的训练中,可以得到更好的模型效果,并且可以防止过拟合
    if if_concat_origin:
        train = np.concatenate([x_train, train], axis=1)
        test = np.concatenate([x_valid, test], axis=1)
    print(x_train.shape)
    print(train.shape)
    print(clf_name)
    print(clf_name in ["lgb"])
    
    # 次级学习器
    if clf_fin in ["rf","ada","gb","et","lr","lsvc","knn"]:
        if clf_fin in ["rf"]:
            clf = RandomForestRegressor(n_estimators=600, max_depth=20, n_jobs=-1, random_state=2021, max_features="auto",verbose=1)
        elif clf_fin in ["ada"]:
            clf = AdaBoostRegressor(n_estimators=30, random_state=2021, learning_rate=0.01)
        elif clf_fin in ["gb"]:
            clf = GradientBoostingRegressor(learning_rate=0.04, n_estimators=100, subsample=0.8, random_state=2021,max_depth=5,verbose=1)
        elif clf_fin in ["et"]:
            clf = ExtraTreesRegressor(n_estimators=600, max_depth=35, max_features="auto", n_jobs=-1, random_state=2021,verbose=1)
        elif clf_fin in ["lr"]:
            clf = LinearRegression(n_jobs=-1)
            
        clf.fit(train, y_train)
        pred = clf.predict(test).reshape(-1,1)
        return pred
    
    elif clf_fin in ["xgb"]:
        clf = xgboost
        train_matrix = clf.DMatrix(train, label=y_train, missing=-1)
        test_matrix = clf.DMatrix(train, label=y_train, missing=-1)
        params = {'booster': 'gbtree',
                  'eval_metric': 'rmse',
                  'gamma': 1,
                  
                  'min_child_weight': 1.5,
                  'max_depth': 5,
                  'lambda': 10,
                  
                  'subsample': 0.7,
                  'colsample_bytree': 0.7,
                  'colsample_bylevel': 0.7,
                  'eta': 0.03,
                  'tree_method': 'exact',
                  'seed': 2021,
                  'nthread': 12
                  }
        num_round = 10000
        early_stopping_rounds = 20
        watchlist = [(train_matrix, 'train'),
                     (test_matrix, 'eval')
                     ]
        model = clf.train(params, train_matrix, num_boost_round=num_round,evals=watchlist,
                          early_stopping_rounds=early_stopping_rounds
                          )
        pre = model.predict(test,ntree_limit=model.best_ntree_limit).reshape(-1,1)
        return pre
    elif clf_fin in ["lgb"]:
        print(clf_name)
        clf = lightgbm
        train_matrix = clf.Dataset(train, label=y_train)
        test_matrix = clf.Dataset(train, label=y_train)
        params = {
                  'boosting_type': 'gbdt',
                  'objective': 'regression_l2',
                  'metric': 'mse',
                  'min_child_weight': 1.5,
                  'num_leaves': 2**5,
                  'lambda_l2': 10,
                  'subsample': 0.7,
                  'colsample_bytree': 0.7,
                  'colsample_bylevel': 0.7,
                  'learning_rate': 0.03,
                  'tree_method': 'exact',
                  'seed': 2021,
                  'nthread': 12,
                  'silent': True,
                  }
        num_round = 10000
        early_stopping_rounds = 20
        model = clf.train(params, train_matrix,num_round,valid_sets=test_matrix,
                          early_stopping_rounds=early_stopping_rounds
                          )
        print('pred')
        pre = model.predict(test,num_iteration=model.best_iteration).reshape(-1,1)
        print(pre)
        return pre
# 加载数据
train_data = pd.read_csv(train_data_file, sep="\t")
test_data = pd.read_csv(test_data_file, sep="\t")
corr_col = ['V0', 'V31', 'V1', 'V8', 'V27', 'V2', 'V16', 'V3', 'V4',
       'V12', 'V10', 'V36', 'V37', 'V24', 'V5', 'V6', 'V20', 'V11', 'V15',
       'V29', 'V7', 'V19', 'V18', 'V13', 'V17', 'V22', 'V30']

train_data_scalar = train_data[corr_col]
train_data_scalar = pd.DataFrame(mms.fit_transform(train_data_scalar), columns=corr_col)
train_data_scalar["target"] = train_data["target"]
test_data_scalar = test_data[corr_col]
test_data_scalar = pd.DataFrame(mms.fit_transform(test_data_scalar), columns=corr_col)
folds = 5
seed = 1
kf = KFold(n_splits=folds, shuffle=True, random_state=2021)

x_train = train_data_scalar[corr_col].values
x_val = test_data_scalar.values
y_train = train_data_scalar["target"].values

clf_list = [lgb_reg, gb_reg, rf_reg]

pred = stacking_pred(x_train, y_train, x_val, kf, clf_list, label_split=None, clf_fin="xgb", if_concat_origin=True)

参考

python scipy stats.probplot用法及代码示例
distplot与kdeplot详解
浅谈线性回归(1)–来点干货
使用scipy.stats.boxcox完成BoxCox变换
偏度与峰度的正态性分布判断
matplotlib基础绘图命令之errorbar的使用

  • 3
    点赞
  • 16
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值