阿里云天池蒸汽预测(一)

背景
火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量。
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.metrics import mean_squared_error, r2_score

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
custom_style = {'axes.labelcolor': 'black',
                'xtick.color': 'black',
                'ytick.color': 'black'}
sns.set_style("darkgrid", rc=custom_style)

数据探索

1、查看数据

train_data_file = "zhengqi_train.txt"
test_data_file = "zhengqi_test.txt"
train_data = pd.read_csv(train_data_file,sep = '\t')
test_data = pd.read_csv(test_data_file,sep = '\t')
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      2888 non-null   float64
 1   V1      2888 non-null   float64
 2   V2      2888 non-null   float64
 3   V3      2888 non-null   float64
 4   V4      2888 non-null   float64
 5   V5      2888 non-null   float64
 6   V6      2888 non-null   float64
 7   V7      2888 non-null   float64
 8   V8      2888 non-null   float64
 9   V9      2888 non-null   float64
 10  V10     2888 non-null   float64
 11  V11     2888 non-null   float64
 12  V12     2888 non-null   float64
 13  V13     2888 non-null   float64
 14  V14     2888 non-null   float64
 15  V15     2888 non-null   float64
 16  V16     2888 non-null   float64
 17  V17     2888 non-null   float64
 18  V18     2888 non-null   float64
 19  V19     2888 non-null   float64
 20  V20     2888 non-null   float64
 21  V21     2888 non-null   float64
 22  V22     2888 non-null   float64
 23  V23     2888 non-null   float64
 24  V24     2888 non-null   float64
 25  V25     2888 non-null   float64
 26  V26     2888 non-null   float64
 27  V27     2888 non-null   float64
 28  V28     2888 non-null   float64
 29  V29     2888 non-null   float64
 30  V30     2888 non-null   float64
 31  V31     2888 non-null   float64
 32  V32     2888 non-null   float64
 33  V33     2888 non-null   float64
 34  V34     2888 non-null   float64
 35  V35     2888 non-null   float64
 36  V36     2888 non-null   float64
 37  V37     2888 non-null   float64
 38  target  2888 non-null   float64
dtypes: float64(39)
memory usage: 880.1 KB
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V0      1925 non-null   float64
 1   V1      1925 non-null   float64
 2   V2      1925 non-null   float64
 3   V3      1925 non-null   float64
 4   V4      1925 non-null   float64
 5   V5      1925 non-null   float64
 6   V6      1925 non-null   float64
 7   V7      1925 non-null   float64
 8   V8      1925 non-null   float64
 9   V9      1925 non-null   float64
 10  V10     1925 non-null   float64
 11  V11     1925 non-null   float64
 12  V12     1925 non-null   float64
 13  V13     1925 non-null   float64
 14  V14     1925 non-null   float64
 15  V15     1925 non-null   float64
 16  V16     1925 non-null   float64
 17  V17     1925 non-null   float64
 18  V18     1925 non-null   float64
 19  V19     1925 non-null   float64
 20  V20     1925 non-null   float64
 21  V21     1925 non-null   float64
 22  V22     1925 non-null   float64
 23  V23     1925 non-null   float64
 24  V24     1925 non-null   float64
 25  V25     1925 non-null   float64
 26  V26     1925 non-null   float64
 27  V27     1925 non-null   float64
 28  V28     1925 non-null   float64
 29  V29     1925 non-null   float64
 30  V30     1925 non-null   float64
 31  V31     1925 non-null   float64
 32  V32     1925 non-null   float64
 33  V33     1925 non-null   float64
 34  V34     1925 non-null   float64
 35  V35     1925 non-null   float64
 36  V36     1925 non-null   float64
 37  V37     1925 non-null   float64
dtypes: float64(38)
memory usage: 571.6 KB

2、可视化数据分布

2.1、箱型图

fig = plt.figure(figsize = (80,60),dpi = 100)
for i in range(38):
    plt.subplot(7,8,i+1)
    sns.boxplot(train_data[column[i]],orient = 'v',width = 0.5)
    plt.ylabel(column[i],fontsize = 36)

2.2、获取异常数据

#function to detect outliers based on the predictions of a model
def detect_outliers(model,X,y,sigma = 3):
    
    #predict y using model
    try:
        y_pred = pd.Series(model.predict(X),index = y.index)
        #if predicting fails,try fitting the model first
    except:
        model.fit(X,y)
        y_pred = pd.Series(model.predict(X),index = y.index)
        
    #calculate the residuals between the prediction and true y values    
    resid = y - y_pred
    mean_resid = resid.mean()
    std_resid = resid.std()
    
    #calculate z statistics ,define outliers to be where |z| > sigma
    z = (resid-mean_resid)/std_resid
    outliers = z[abs(z)>sigma].index
    
    #print and plot the results
    print("R2=",model.score(X,y))
    print("MSE=",mean_squared_error(y,y_pred))
    print('------------------------------------')
    
    print('mean of residuals=',mean_resid)
    print('std of residuals=',std_resid)
    print('-------------------------------------')
    
    print(len(outliers),'outliers')
    print(outliers.tolist())
    
    plt.figure(figsize=(15,5),dpi = 100)
    ax_131 = plt.subplot(1,3,1)
    plt.plot(y,y_pred,'.')
    plt.plot(y.loc[outliers],y_pred.loc[outliers],'ro')
    plt.legend(['Accepted','Outliers'])
    plt.xlabel('y')
    plt.ylabel('y_pred')
    
    ax_132 = plt.subplot(1,3,2)
    plt.plot(y,y-y_pred,'.')
    plt.plot(y.loc[outliers],y.loc[outliers]-y_pred.loc[outliers],'ro')
    plt.legend(['Accepted','Outliers'])
    plt.xlabel('y')
    plt.ylabel('y-y_pred')

    ax_133 = plt.subplot(1,3,3)
    z.plot.hist(bins = 50,ax = ax_133)
    z.loc[outliers].plot.hist(color = 'r',bins = 50,ax = ax_133)
    plt.legend(['Accepted','Outliers'])
    plt.xlabel('z')
        
    plt.savefig('outliers.png')
    
    return outliers

通过岭回归模型找出异常值

from sklearn.linear_model import Ridge
X_train = train_data.iloc[:,:-1]
y_train = train_data.iloc[:,-1]
outliers = detect_outliers(Ridge(),X_train,y_train)
R2= 0.8890858938210386
MSE= 0.10734857773123631
------------------------------------
mean of residuals= -2.5295247583218593e-17
std of residuals= 0.3276976673193502
-------------------------------------
31 outliers
[321, 348, 376, 777, 884, 1145, 1164, 1310, 1458, 1466, 1484, 1523, 1704, 1874, 1879, 1979, 2002, 2279, 2528, 2620, 2645, 2647, 2667, 2668, 2669, 2696, 2767, 2769, 2807, 2842, 2863]

在这里插入图片描述

2.3、直方图和QQ图

#qq图
plt.figure(figsize = (10,5))

ax = plt.subplot(1,2,1)
sns.distplot(train_data['V0'],fit = stats.norm,ax = ax)
ax = plt.subplot(1,2,2)
res = stats.probplot(train_data['V0'],plot = ax)

在这里插入图片描述

train_cols = 6
train_rows = len(train_data.columns)

plt.figure(figsize=(4*train_cols,4*train_rows))
i = 0
for col in train_data.columns:
    i+=1
    ax = plt.subplot(train_rows,train_cols,i)
    sns.distplot(train_data[col],fit = stats.norm,ax = ax)
    
    i+=1
    ax = plt.subplot(train_rows,train_cols,i)
    res = stats.probplot(train_data[col],plot = ax)
plt.tight_layout()

在这里插入图片描述

可以发现很多的变量都不是服从正太分布的

2.4、KDE分布图

KDE(kernel density estimation)分布图可以理解为直方图的加窗平滑,可以查看对比训练集和测试集中特征变量的分布情况,发现在两个数据集中分布不一样的特征变量

#对于V0
plt.figure(figsize = (10,5))
ax = sns.kdeplot(train_data['V0'],color = 'r',shade = True)
ax = sns.kdeplot(test_data['V0'],color = 'b',shade = True)
ax.set_xlabel('V0')
ax.set_ylabel('frequency')
ax.legend(['train','test'])

在这里插入图片描述
基本一致

绘制所有变量的KDE分布图对比

train_cols = 6
train_rows = len(train_data.columns)

plt.figure(figsize=(4*train_cols,4*train_rows))
i = 0
for col in train_data.columns[:-1]:
    i+=1
    plt.subplot(train_rows,train_cols,i)
    sns.kdeplot(train_data[col],color = 'r',shade = True)
    sns.kdeplot(test_data[col],color = 'b',shade = True)
    plt.xlabel(col)
    plt.ylabel('frequency')
    plt.legend(['train','test'])

在这里插入图片描述

发现V5,V9,V11,V17,V22,V28在训练集和测试集中分布有差异,故而删去,防止影响模型泛化能力

2.5线性回归关系图

分析变量之间的线性回归关系

#vo与target
fcols = 2
frows = 1
plt.figure(figsize=(5*fcols,5*frows))
ax = plt.subplot(frows,fcols,1)
sns.regplot(x='V0',y='target',data=train_data,ax=ax,scatter_kws={'marker':'.','s':3,'alpha':0.3},line_kws={'color':'k'})
plt.xlabel('V0')
plt.ylabel('target')

ax = plt.subplot(frows,fcols,2)
sns.distplot(train_data['V0'].dropna())
plt.xlabel('V0')

在这里插入图片描述

#所有变量
#vo与target
fcols = 6
frows = len(test_data.columns)
plt.figure(figsize=(5*fcols,5*frows))
i = 0
for col in test_data.columns:
    i+=1
    ax = plt.subplot(frows,fcols,i)
    sns.regplot(x=col,y='target',data=train_data,ax=ax,scatter_kws={'marker':'.','s':3,'alpha':0.3},line_kws={'color':'k'})
    plt.xlabel(col)
    plt.ylabel('target')
    
    i+=1
    ax = plt.subplot(frows,fcols,i)
    sns.distplot(train_data[col].dropna())
    plt.xlabel(col)

在这里插入图片描述

3、查看数据相关性

3.1、计算相关系数并画热力图

#calculate the corr
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
data_train1 = train_data.drop(cols_drop,axis = 1)
train_corr = data_train1.corr()
train_corr
V0V1V2V3V4...V34V35V36V37target
V01.0000000.9086070.4636430.4095760.781212...-0.0193420.1389330.231417-0.4940760.873212
V10.9086071.0000000.5065140.3839240.657790...-0.0291150.1463290.235299-0.4940430.871846
V20.4636430.5065141.0000000.4101480.057697...-0.0256200.0436480.316462-0.7349560.638878
V30.4095760.3839240.4101481.0000000.315046...-0.0318980.0800340.324475-0.2296130.512074
V40.7812120.6577900.0576970.3150461.000000...0.0286590.1000100.113609-0.0310540.603984
....................................
V34-0.019342-0.029115-0.025620-0.0318980.028659...1.0000000.233616-0.019032-0.006854-0.006034
V350.1389330.1463290.0436480.0800340.100010...0.2336161.0000000.025401-0.0779910.140294
V360.2314170.2352990.3164620.3244750.113609...-0.0190320.0254011.000000-0.0394780.319309
V37-0.494076-0.494043-0.734956-0.229613-0.031054...-0.006854-0.077991-0.0394781.000000-0.565795
target0.8732120.8718460.6388780.5120740.603984...-0.0060340.1402940.319309-0.5657951.000000

33 rows × 33 columns

##画热力图
plt.figure(figsize=(20,16))
sns.heatmap(train_corr,square=True,annot=True)

在这里插入图片描述

#寻找K个与target变量最相关的特征变量
k = 10
cols = train_corr.nlargest(k,'target')['target'].index

plt.figure(figsize=(10,8))
sns.heatmap(train_data[cols].corr(),square=True,annot=True)

在这里插入图片描述

#找出与target相关性比0.5高的特征向量
threshold = 0.5
corrmat = train_data.corr()
top_corr_features = corrmat.index[abs(corrmat['target'])>threshold]
plt.figure(figsize=(10,8))
sns.heatmap(train_data[top_corr_features].corr(),square=True,annot=True)

在这里插入图片描述

3.2、box_cox变换

#合并训练集和测试集
train_x = train_data.drop(['target'],axis = 1)
data_all = pd.concat([train_x,test_data])
data_all.drop(cols_drop,axis = 1,inplace = True)
data_all.head()
V0V1V2V3V4...V33V34V35V36V37
00.5660.016-0.1430.4070.452...-4.627-4.789-5.101-2.608-3.508
10.9680.4370.0660.5660.194...-0.8430.1600.364-0.335-0.730
21.0130.5680.2350.3700.112...-0.8430.1600.3640.765-0.589
30.7330.3680.2830.1650.599...-0.843-0.0650.3640.333-0.112
40.6840.6380.2600.2090.337...-0.843-0.2150.364-0.280-0.028

5 rows × 32 columns

#归一化
cols_numeric = list(data_all.columns)
def scale_minmax(col):
    return (col-col.min())/(col.max()-col.min())
data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax,axis = 0)
data_all.describe()
V0V1V2V3V4...V33V34V35V36V37
count4813.0000004813.0000004813.0000004813.0000004813.000000...4813.0000004813.0000004813.0000004813.0000004813.000000
mean0.6941720.7213570.6023000.6031390.523743...0.4584930.4837900.7628730.3323850.545795
std0.1441980.1314430.1406280.1524620.106430...0.0990950.1010200.1020370.1274560.150356
min0.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.000000
25%0.6266760.6794160.5144140.5038880.478182...0.4090370.4544900.7272730.2705840.445647
50%0.7294880.7524970.6170720.6142700.535866...0.4545180.4999490.8000200.3470560.539317
75%0.7901950.7995530.7004640.7104740.585036...0.5000000.5113650.8000200.4148610.643061
max1.0000001.0000001.0000001.0000001.000000...1.0000001.0000001.0000001.0000001.000000

8 rows × 32 columns

data_all.iloc[:2888]
V0V1V2V3V4...V33V34V35V36V37
00.7757750.7234490.5821970.6651930.571839...0.0000000.0000000.2424240.0000000.018343
10.8337420.7787850.6115880.6894340.544381...0.3749500.4999490.8000200.2897020.436025
20.8402310.7960040.6353540.6595520.535653...0.3749500.4999490.8000200.4299010.457224
30.7998560.7697160.6421040.6282970.587484...0.3749500.4772200.8000200.3748410.528943
40.7927900.8052050.6388690.6350050.559600...0.3749500.4620670.8000200.2967120.541573
....................................
28830.7215570.7180600.5829000.6276870.587590...0.4829570.4810590.7272730.4058120.648925
28840.7672670.7945580.6439320.6310410.580140...0.5340860.5340940.7272730.2540150.488648
28850.6373470.6265770.5341020.6159480.538208...0.5340860.5340940.7272730.4536070.658247
28860.6625810.6842800.5539310.5956700.571520...0.5454820.5454090.7394140.2940350.629229
28870.7472240.7712930.5706650.5956700.564070...0.5113950.4828770.7434960.2601330.604120

2888 rows × 32 columns

#进行box_cox变换
train_data_process = pd.concat([data_all.iloc[:2888],train_data['target']],axis = 1)

fcols = 6
frows = len(train_data_process.columns)
plt.figure(figsize=(4*fcols,4*frows))

i = 0

for var in cols_numeric:
    dat = train_data_process[[var,'target']].dropna()
    
    i+=1
    plt.subplot(frows,fcols,i)
    sns.distplot(dat[var],fit = stats.norm)
    plt.title(var+'Original')
#     plt.xlabel()
    
    i+=1
    plt.subplot(frows,fcols,i)
    _ = stats.probplot(dat[var],plot = plt)
    plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))
    plt.xlabel('')
    plt.ylabel('')
    
    i+=1
    plt.subplot(frows,fcols,i)
    plt.plot(dat[var],dat['target'],'.',alpha = 0.5)
    plt.title('corr='+'{:.2f}'.format(dat.corr().values[0][1]))
    
    ###进行box_cox变换
    trans_var,lambda_var = stats.boxcox(dat[var].dropna() +1)
    trans_var = scale_minmax(trans_var)
    
    i+=1
    plt.subplot(frows,fcols,i)
    sns.distplot(trans_var,fit = stats.norm)
    plt.title(var+' Transformed')
    plt.xlabel('')
    
    i+=1
    plt.subplot(frows,fcols,i)
    _ = stats.probplot(trans_var,plot = plt)
    plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))
    plt.xlabel('')
    plt.ylabel('')
    
    i+=1
    plt.subplot(frows,fcols,i)
    plt.plot(trans_var,dat['target'],'.',alpha = 0.5)
    plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))
    
    

在这里插入图片描述

特征工程

异常值分析

##绘制箱线图
plt.figure(figsize=(18,10))
plt.boxplot(x = train_data.values,labels=train_data.columns)
plt.hlines([-7.5,7.5],0,40,colors = 'r')

在这里插入图片描述

train_data = train_data[train_data['V9']>-7.5]
test_data = test_data[test_data['V9']>-7.5]
train_data.describe()
test_data.describe()
V0V1V2V3V4...V33V34V35V36V37
count1925.0000001925.0000001925.0000001925.0000001925.000000...1925.0000001925.0000001925.0000001925.0000001925.000000
mean-0.184404-0.083912-0.4347620.101671-0.019172...-0.011433-0.009985-0.296895-0.0462700.195735
std1.0733331.0766700.9695411.0349251.147286...0.9897320.9952130.9468961.0408540.940599
min-4.814000-5.488000-4.283000-3.276000-4.921000...-4.627000-4.789000-7.477000-2.608000-3.346000
25%-0.664000-0.451000-0.978000-0.644000-0.497000...-0.460000-0.290000-0.349000-0.593000-0.432000
50%0.0650000.195000-0.2670000.2200000.118000...-0.0400000.160000-0.2700000.0830000.152000
75%0.5490000.5890000.2780000.7930000.610000...0.4190000.2730000.3640000.6510000.797000
max2.1000002.1200001.9460002.6030004.475000...5.4650005.1100001.6710002.8610003.021000

8 rows × 38 columns

train_data.describe()
V0V1V2V3V4...V34V35V36V37target
count2886.0000002886.0000002886.0000002886.0000002886.000000...2886.0000002886.0000002886.0000002886.0000002886.000000
mean0.1237250.0568560.290340-0.0683640.012254...0.0069590.1985130.030099-0.1319570.127451
std0.9279840.9412690.9112310.9703570.888037...1.0034110.9850580.9702581.0156660.983144
min-4.335000-5.122000-3.420000-3.956000-4.742000...-4.789000-5.695000-2.608000-3.630000-3.044000
25%-0.292000-0.224250-0.310000-0.652750-0.385000...-0.290000-0.199750-0.412750-0.798750-0.347500
50%0.3595000.2730000.386000-0.0450000.109500...0.1600000.3640000.137000-0.1860000.314000
75%0.7260000.5990000.9187500.6235000.550000...0.2730000.6020000.6437500.4930000.793750
max2.1210001.9180002.8280002.4570002.689000...5.1100002.3240005.2380003.0000002.538000

8 rows × 39 columns

最大值和最小值的归一化

##进行归一化处理
from sklearn import preprocessing

features_columns = [col for col in test_data.columns]

min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit(train_data[features_columns])

train_data_scaler = min_max_scaler.transform(train_data[features_columns])
test_data_scaler = min_max_scaler.transform(test_data[features_columns])

train_data_scaler = pd.DataFrame(train_data_scaler,columns = features_columns)
test_data_scaler = pd.DataFrame(test_data_scaler,columns = features_columns)
train_data_scaler['target'] = train_data['target']
display(train_data_scaler.describe())
display(test_data_scaler.describe())
V0V1V2V3V4...V34V35V36V37target
count2886.0000002886.0000002886.0000002886.0000002886.000000...2886.0000002886.0000002886.0000002886.0000002884.000000
mean0.6906330.7356330.5938440.6062120.639787...0.4844890.7349440.3362350.5276080.127274
std0.1437400.1337030.1458440.1513110.119504...0.1013650.1228400.1236630.1531920.983462
min0.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.000000-3.044000
25%0.6262390.6957030.4977590.5150870.586328...0.4544900.6852790.2797920.427036-0.348500
50%0.7271530.7663350.6091550.6098550.652873...0.4999490.7555800.3498600.5194570.313000
75%0.7839220.8126420.6944220.7140960.712152...0.5113650.7852600.4144470.6218700.794250
max1.0000001.0000001.0000001.0000001.000000...1.0000001.0000001.0000001.0000002.538000

8 rows × 39 columns

V0V1V2V3V4...V33V34V35V36V37
count1925.0000001925.0000001925.0000001925.0000001925.000000...1925.0000001925.0000001925.0000001925.0000001925.000000
mean0.6429050.7156370.4777910.6327260.635558...0.4573490.4827780.6731640.3265010.577034
std0.1662530.1529360.1551760.1613790.154392...0.0980710.1005370.1180820.1326610.141870
min-0.074195-0.051989-0.1381240.106035-0.024088...0.0000000.000000-0.2222220.0000000.042836
25%0.5686180.6634940.3908450.5164510.571256...0.4129010.4544900.6666670.2568190.482353
50%0.6815370.7552560.5046410.6511770.654017...0.4545180.4999490.6765180.3429770.570437
75%0.7565060.8112220.5918690.7405270.720226...0.5000000.5113650.7555800.4153710.667722
max0.9967471.0286930.8588351.0227661.240345...1.0000001.0000000.9185680.6970431.003167

8 rows × 38 columns

查看数据分布

在数据探索章节中,我们通过KDE图发现有一些变量在训练集和测试集中的分布不同,这些变量将会影响模型的泛化能力,故而将这些变量删除

cols_drop
['V5', 'V9', 'V11', 'V17', 'V22', 'V28']
# train_data_scaler.drop(cols_drop,axis = 1,inplace=True)
# test_data_scaler.drop(cols_drop,axis = 1,inplace=True)

特征相关性

plt.figure(figsize=(20,16))
column = train_data_scaler.columns.tolist()
mcorr = train_data_scaler.corr(method='spearman')
mask = np.zeros_like(mcorr,dtype = np.bool)
mask[np.triu_indices_from(mask)] = True #triu_indices_from指的是得到给定矩阵上三角阵的索引,这句代码目的为将上三角矩阵转换为Ture
cmap = sns.diverging_palette(220,10,as_cmap=True)
g = sns.heatmap(mcorr,mask=mask,cmap=cmap,square=True,annot=True,fmt='0.2f')

在这里插入图片描述


mask[np.triu_indices_from(mask)] = True

特征选择

##根据相关性进行特征选择
mcorr = mcorr.abs()
numerical_corr = mcorr[mcorr['target'] > 0.1]['target']
numerical_corr.sort_values(ascending=False)
target    1.000000
V0        0.712403
V31       0.711636
V1        0.682909
V8        0.679469
            ...   
V18       0.149741
V13       0.149199
V17       0.126262
V22       0.112743
V30       0.101378
Name: target, Length: 28, dtype: float64
index0 = numerical_corr.index.tolist()

多重共线性分析

若存在多重共线性,则需要使用PCA进行降维,去除多重共线性

from statsmodels.stats.outliers_influence import variance_inflation_factor
##VIF方差膨胀因子
new_numerical = index0[:-1]
X = np.matrix(train_data_scaler[new_numerical])
VIF_list = [variance_inflation_factor(X,i) for i in range(X.shape[1])]
VIF_list
[375.2588090598281,
 474.15498437652406,
 126.97684602301281,
 29.848200852284968,
 393.099679241423,
 92.49455965657118,
 606.8157116621701,
 397.34364499107465,
 543.375332085974,
 90.84960173250062,
 87.78896924869898,
 384.3905248315176,
 29.912661878759433,
 118.2994192741307,
 610.7245485191555,
 22.683529358626934,
 23.715095416410442,
 23.34500935943297,
 25.02316787437953,
 15.680507346101447,
 5.185851115708424,
 277.76658405576114,
 142.5865460050103,
 38.14831530855407,
 520.3089375604505,
 85.663077288378,
 50.933680077518105]

VIF基本都大于10,存在较强的多重共线性

PCA处理

from sklearn.decomposition import PCA

pca = PCA(n_components = 0.9)#保留90%的信息
new_train_pca_90 = pca.fit_transform(train_data_scaler.iloc[:,:-1])
new_test_pca_90 = pca.transform(test_data_scaler)
new_train_pca_90 = pd.DataFrame(new_train_pca_90)
new_test_pca_90 = pd.DataFrame(new_test_pca_90)
new_train_pca_90['target'] = train_data_scaler['target']
new_train_pca_90.describe()
01234...12131415target
count2.886000e+032.886000e+032.886000e+032.886000e+032.886000e+03...2.886000e+032.886000e+032.886000e+032.886000e+032884.000000
mean9.876864e-17-1.559352e-161.269486e-176.520541e-185.512646e-17...6.414750e-18-2.975117e-17-3.040995e-177.419579e-170.127274
std3.998976e-013.500240e-012.938631e-012.728023e-012.077128e-01...1.193301e-011.149758e-011.133507e-011.019259e-010.983462
min-1.071795e+00-9.429479e-01-9.948314e-01-7.103087e-01-7.703987e-01...-4.175153e-01-4.310613e-01-4.170535e-01-3.601627e-01-3.044000
25%-2.804085e-01-2.613727e-01-2.090797e-01-1.945196e-01-1.315620e-01...-7.139961e-02-7.474073e-02-7.709743e-02-6.603914e-02-0.348500
50%-1.417104e-02-1.277241e-022.112166e-02-2.337401e-02-5.122797e-03...-4.140670e-031.054915e-03-1.758387e-03-7.533392e-040.313000
75%2.287306e-012.317720e-012.069571e-011.657590e-011.281660e-01...6.786199e-027.574868e-027.116829e-026.357449e-020.794250
max1.597730e+001.382802e+001.010250e+001.448007e+001.034061e+00...5.156118e-014.978126e-014.673189e-014.570870e-012.538000

8 rows × 17 columns

pca = PCA(n_components = 16)#保留13个主成分
new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,:-1])
new_test_pca_16 = pca.transform(test_data_scaler)
new_train_pca_16 = pd.DataFrame(new_train_pca_16)
new_test_pca_16 = pd.DataFrame(new_test_pca_16)
new_train_pca_16['target'] = train_data_scaler['target']
new_train_pca_16.describe()
01234...12131415target
count2.886000e+032.886000e+032.886000e+032.886000e+032.886000e+03...2.886000e+032.886000e+032.886000e+032.886000e+032884.000000
mean1.619532e-169.140298e-172.396635e-171.009818e-172.158126e-17...2.862113e-17-5.855984e-17-4.845204e-177.810583e-170.127274
std3.998976e-013.500240e-012.938631e-012.728023e-012.077128e-01...1.193301e-011.149757e-011.133507e-011.019258e-010.983462
min-1.071795e+00-9.429479e-01-9.948314e-01-7.103087e-01-7.704007e-01...-4.175059e-01-4.310984e-01-4.170395e-01-3.601786e-01-3.044000
25%-2.804085e-01-2.613727e-01-2.090797e-01-1.945196e-01-1.315626e-01...-7.140021e-02-7.482178e-02-7.709831e-02-6.606376e-02-0.348500
50%-1.417104e-02-1.277241e-022.112170e-02-2.337402e-02-5.123358e-03...-4.140699e-031.070748e-03-1.764363e-03-8.300824e-040.313000
75%2.287306e-012.317720e-012.069571e-011.657590e-011.281664e-01...6.786778e-027.584871e-027.124562e-026.359870e-020.794250
max1.597730e+001.382802e+001.010250e+001.448007e+001.034055e+00...5.156127e-014.978497e-014.673024e-014.571319e-012.538000

8 rows × 17 columns

模型训练

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import lightgbm as lgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

切分数据

new_train_pca_16 = new_train_pca_16.fillna(0)
train = new_train_pca_16.iloc[:,:-1]
target = new_train_pca_16.iloc[:,-1]

train_data,test_data,train_target,test_target = train_test_split(train,target,test_size=0.2,random_state=0)

线性回归

clf = LinearRegression()
clf.fit(train_data,train_target)
score = mean_squared_error(test_target,clf.predict(test_data))
print('LinearRegression:',score)
LinearRegression: 0.27169898675980547

K近邻回归

clf = KNeighborsRegressor(n_neighbors = 8)
clf.fit(train_data,train_target)
score = mean_squared_error(test_target,clf.predict(test_data))
print('KNeighborsRegressor:',score)
KNeighborsRegressor: 0.2734067076124567

随机森林回归

clf = RandomForestRegressor(n_estimators=200)
clf.fit(train_data,train_target)
score = mean_squared_error(test_target,clf.predict(test_data))
print('RandomForestRegressor:',score)
RandomForestRegressor: 0.2509856772478806

决策树回归


LGB模型回归

clf = lgb.LGBMRegressor(learning_rate=0.1,max_depth=-1,n_estimators=5000,boosting_type='gbdt',random_state=2019,objective='regression')
clf.fit(X = train_data,y = train_target,eval_metric='MSE',verbose=50)
score = mean_squared_error(test_target,clf.predict(test_data))
print('lightGbm:',score)
lightGbm: 0.25057483646268536
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值