阿里云天池蒸汽预测(一)
背景:
火力发电的基本原理是:燃料在燃烧时加热水生成蒸汽,蒸汽压力推动汽轮机旋转,然后汽轮机带动发电机旋转,产生电能。在这一系列的能量转化中,影响发电效率的核心是锅炉的燃烧效率,即燃料燃烧加热水产生高温高压蒸汽。锅炉的燃烧效率的影响因素很多,包括锅炉的可调参数,如燃烧给量,一二次风,引风,返料风,给水水量;以及锅炉的工况,比如锅炉床温、床压,炉膛温度、压力,过热器的温度等。
经脱敏后的锅炉传感器采集的数据(采集频率是分钟级别),根据锅炉的工况,预测产生的蒸汽量。
数据分成训练数据(train.txt)和测试数据(test.txt),其中字段”V0”-“V37”,这38个字段是作为特征变量,”target”作为目标变量。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
custom_style = {'axes.labelcolor': 'black',
'xtick.color': 'black',
'ytick.color': 'black'}
sns.set_style("darkgrid", rc=custom_style)
数据探索
1、查看数据
train_data_file = "zhengqi_train.txt"
test_data_file = "zhengqi_test.txt"
train_data = pd.read_csv(train_data_file,sep = '\t')
test_data = pd.read_csv(test_data_file,sep = '\t')
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2888 entries, 0 to 2887
Data columns (total 39 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V0 2888 non-null float64
1 V1 2888 non-null float64
2 V2 2888 non-null float64
3 V3 2888 non-null float64
4 V4 2888 non-null float64
5 V5 2888 non-null float64
6 V6 2888 non-null float64
7 V7 2888 non-null float64
8 V8 2888 non-null float64
9 V9 2888 non-null float64
10 V10 2888 non-null float64
11 V11 2888 non-null float64
12 V12 2888 non-null float64
13 V13 2888 non-null float64
14 V14 2888 non-null float64
15 V15 2888 non-null float64
16 V16 2888 non-null float64
17 V17 2888 non-null float64
18 V18 2888 non-null float64
19 V19 2888 non-null float64
20 V20 2888 non-null float64
21 V21 2888 non-null float64
22 V22 2888 non-null float64
23 V23 2888 non-null float64
24 V24 2888 non-null float64
25 V25 2888 non-null float64
26 V26 2888 non-null float64
27 V27 2888 non-null float64
28 V28 2888 non-null float64
29 V29 2888 non-null float64
30 V30 2888 non-null float64
31 V31 2888 non-null float64
32 V32 2888 non-null float64
33 V33 2888 non-null float64
34 V34 2888 non-null float64
35 V35 2888 non-null float64
36 V36 2888 non-null float64
37 V37 2888 non-null float64
38 target 2888 non-null float64
dtypes: float64(39)
memory usage: 880.1 KB
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1925 entries, 0 to 1924
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 V0 1925 non-null float64
1 V1 1925 non-null float64
2 V2 1925 non-null float64
3 V3 1925 non-null float64
4 V4 1925 non-null float64
5 V5 1925 non-null float64
6 V6 1925 non-null float64
7 V7 1925 non-null float64
8 V8 1925 non-null float64
9 V9 1925 non-null float64
10 V10 1925 non-null float64
11 V11 1925 non-null float64
12 V12 1925 non-null float64
13 V13 1925 non-null float64
14 V14 1925 non-null float64
15 V15 1925 non-null float64
16 V16 1925 non-null float64
17 V17 1925 non-null float64
18 V18 1925 non-null float64
19 V19 1925 non-null float64
20 V20 1925 non-null float64
21 V21 1925 non-null float64
22 V22 1925 non-null float64
23 V23 1925 non-null float64
24 V24 1925 non-null float64
25 V25 1925 non-null float64
26 V26 1925 non-null float64
27 V27 1925 non-null float64
28 V28 1925 non-null float64
29 V29 1925 non-null float64
30 V30 1925 non-null float64
31 V31 1925 non-null float64
32 V32 1925 non-null float64
33 V33 1925 non-null float64
34 V34 1925 non-null float64
35 V35 1925 non-null float64
36 V36 1925 non-null float64
37 V37 1925 non-null float64
dtypes: float64(38)
memory usage: 571.6 KB
2、可视化数据分布
2.1、箱型图
fig = plt.figure(figsize = (80,60),dpi = 100)
for i in range(38):
plt.subplot(7,8,i+1)
sns.boxplot(train_data[column[i]],orient = 'v',width = 0.5)
plt.ylabel(column[i],fontsize = 36)
2.2、获取异常数据
#function to detect outliers based on the predictions of a model
def detect_outliers(model,X,y,sigma = 3):
#predict y using model
try:
y_pred = pd.Series(model.predict(X),index = y.index)
#if predicting fails,try fitting the model first
except:
model.fit(X,y)
y_pred = pd.Series(model.predict(X),index = y.index)
#calculate the residuals between the prediction and true y values
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
#calculate z statistics ,define outliers to be where |z| > sigma
z = (resid-mean_resid)/std_resid
outliers = z[abs(z)>sigma].index
#print and plot the results
print("R2=",model.score(X,y))
print("MSE=",mean_squared_error(y,y_pred))
print('------------------------------------')
print('mean of residuals=',mean_resid)
print('std of residuals=',std_resid)
print('-------------------------------------')
print(len(outliers),'outliers')
print(outliers.tolist())
plt.figure(figsize=(15,5),dpi = 100)
ax_131 = plt.subplot(1,3,1)
plt.plot(y,y_pred,'.')
plt.plot(y.loc[outliers],y_pred.loc[outliers],'ro')
plt.legend(['Accepted','Outliers'])
plt.xlabel('y')
plt.ylabel('y_pred')
ax_132 = plt.subplot(1,3,2)
plt.plot(y,y-y_pred,'.')
plt.plot(y.loc[outliers],y.loc[outliers]-y_pred.loc[outliers],'ro')
plt.legend(['Accepted','Outliers'])
plt.xlabel('y')
plt.ylabel('y-y_pred')
ax_133 = plt.subplot(1,3,3)
z.plot.hist(bins = 50,ax = ax_133)
z.loc[outliers].plot.hist(color = 'r',bins = 50,ax = ax_133)
plt.legend(['Accepted','Outliers'])
plt.xlabel('z')
plt.savefig('outliers.png')
return outliers
通过岭回归模型找出异常值
from sklearn.linear_model import Ridge
X_train = train_data.iloc[:,:-1]
y_train = train_data.iloc[:,-1]
outliers = detect_outliers(Ridge(),X_train,y_train)
R2= 0.8890858938210386
MSE= 0.10734857773123631
------------------------------------
mean of residuals= -2.5295247583218593e-17
std of residuals= 0.3276976673193502
-------------------------------------
31 outliers
[321, 348, 376, 777, 884, 1145, 1164, 1310, 1458, 1466, 1484, 1523, 1704, 1874, 1879, 1979, 2002, 2279, 2528, 2620, 2645, 2647, 2667, 2668, 2669, 2696, 2767, 2769, 2807, 2842, 2863]
2.3、直方图和QQ图
#qq图
plt.figure(figsize = (10,5))
ax = plt.subplot(1,2,1)
sns.distplot(train_data['V0'],fit = stats.norm,ax = ax)
ax = plt.subplot(1,2,2)
res = stats.probplot(train_data['V0'],plot = ax)
train_cols = 6
train_rows = len(train_data.columns)
plt.figure(figsize=(4*train_cols,4*train_rows))
i = 0
for col in train_data.columns:
i+=1
ax = plt.subplot(train_rows,train_cols,i)
sns.distplot(train_data[col],fit = stats.norm,ax = ax)
i+=1
ax = plt.subplot(train_rows,train_cols,i)
res = stats.probplot(train_data[col],plot = ax)
plt.tight_layout()
可以发现很多的变量都不是服从正太分布的
2.4、KDE分布图
KDE(kernel density estimation)分布图可以理解为直方图的加窗平滑,可以查看对比训练集和测试集中特征变量的分布情况,发现在两个数据集中分布不一样的特征变量
#对于V0
plt.figure(figsize = (10,5))
ax = sns.kdeplot(train_data['V0'],color = 'r',shade = True)
ax = sns.kdeplot(test_data['V0'],color = 'b',shade = True)
ax.set_xlabel('V0')
ax.set_ylabel('frequency')
ax.legend(['train','test'])
基本一致
绘制所有变量的KDE分布图对比
train_cols = 6
train_rows = len(train_data.columns)
plt.figure(figsize=(4*train_cols,4*train_rows))
i = 0
for col in train_data.columns[:-1]:
i+=1
plt.subplot(train_rows,train_cols,i)
sns.kdeplot(train_data[col],color = 'r',shade = True)
sns.kdeplot(test_data[col],color = 'b',shade = True)
plt.xlabel(col)
plt.ylabel('frequency')
plt.legend(['train','test'])
发现V5,V9,V11,V17,V22,V28在训练集和测试集中分布有差异,故而删去,防止影响模型泛化能力
2.5线性回归关系图
分析变量之间的线性回归关系
#vo与target
fcols = 2
frows = 1
plt.figure(figsize=(5*fcols,5*frows))
ax = plt.subplot(frows,fcols,1)
sns.regplot(x='V0',y='target',data=train_data,ax=ax,scatter_kws={'marker':'.','s':3,'alpha':0.3},line_kws={'color':'k'})
plt.xlabel('V0')
plt.ylabel('target')
ax = plt.subplot(frows,fcols,2)
sns.distplot(train_data['V0'].dropna())
plt.xlabel('V0')
#所有变量
#vo与target
fcols = 6
frows = len(test_data.columns)
plt.figure(figsize=(5*fcols,5*frows))
i = 0
for col in test_data.columns:
i+=1
ax = plt.subplot(frows,fcols,i)
sns.regplot(x=col,y='target',data=train_data,ax=ax,scatter_kws={'marker':'.','s':3,'alpha':0.3},line_kws={'color':'k'})
plt.xlabel(col)
plt.ylabel('target')
i+=1
ax = plt.subplot(frows,fcols,i)
sns.distplot(train_data[col].dropna())
plt.xlabel(col)
3、查看数据相关性
3.1、计算相关系数并画热力图
#calculate the corr
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',10)
data_train1 = train_data.drop(cols_drop,axis = 1)
train_corr = data_train1.corr()
train_corr
V0 | V1 | V2 | V3 | V4 | ... | V34 | V35 | V36 | V37 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
V0 | 1.000000 | 0.908607 | 0.463643 | 0.409576 | 0.781212 | ... | -0.019342 | 0.138933 | 0.231417 | -0.494076 | 0.873212 |
V1 | 0.908607 | 1.000000 | 0.506514 | 0.383924 | 0.657790 | ... | -0.029115 | 0.146329 | 0.235299 | -0.494043 | 0.871846 |
V2 | 0.463643 | 0.506514 | 1.000000 | 0.410148 | 0.057697 | ... | -0.025620 | 0.043648 | 0.316462 | -0.734956 | 0.638878 |
V3 | 0.409576 | 0.383924 | 0.410148 | 1.000000 | 0.315046 | ... | -0.031898 | 0.080034 | 0.324475 | -0.229613 | 0.512074 |
V4 | 0.781212 | 0.657790 | 0.057697 | 0.315046 | 1.000000 | ... | 0.028659 | 0.100010 | 0.113609 | -0.031054 | 0.603984 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
V34 | -0.019342 | -0.029115 | -0.025620 | -0.031898 | 0.028659 | ... | 1.000000 | 0.233616 | -0.019032 | -0.006854 | -0.006034 |
V35 | 0.138933 | 0.146329 | 0.043648 | 0.080034 | 0.100010 | ... | 0.233616 | 1.000000 | 0.025401 | -0.077991 | 0.140294 |
V36 | 0.231417 | 0.235299 | 0.316462 | 0.324475 | 0.113609 | ... | -0.019032 | 0.025401 | 1.000000 | -0.039478 | 0.319309 |
V37 | -0.494076 | -0.494043 | -0.734956 | -0.229613 | -0.031054 | ... | -0.006854 | -0.077991 | -0.039478 | 1.000000 | -0.565795 |
target | 0.873212 | 0.871846 | 0.638878 | 0.512074 | 0.603984 | ... | -0.006034 | 0.140294 | 0.319309 | -0.565795 | 1.000000 |
33 rows × 33 columns
##画热力图
plt.figure(figsize=(20,16))
sns.heatmap(train_corr,square=True,annot=True)
#寻找K个与target变量最相关的特征变量
k = 10
cols = train_corr.nlargest(k,'target')['target'].index
plt.figure(figsize=(10,8))
sns.heatmap(train_data[cols].corr(),square=True,annot=True)
#找出与target相关性比0.5高的特征向量
threshold = 0.5
corrmat = train_data.corr()
top_corr_features = corrmat.index[abs(corrmat['target'])>threshold]
plt.figure(figsize=(10,8))
sns.heatmap(train_data[top_corr_features].corr(),square=True,annot=True)
3.2、box_cox变换
#合并训练集和测试集
train_x = train_data.drop(['target'],axis = 1)
data_all = pd.concat([train_x,test_data])
data_all.drop(cols_drop,axis = 1,inplace = True)
data_all.head()
V0 | V1 | V2 | V3 | V4 | ... | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.566 | 0.016 | -0.143 | 0.407 | 0.452 | ... | -4.627 | -4.789 | -5.101 | -2.608 | -3.508 |
1 | 0.968 | 0.437 | 0.066 | 0.566 | 0.194 | ... | -0.843 | 0.160 | 0.364 | -0.335 | -0.730 |
2 | 1.013 | 0.568 | 0.235 | 0.370 | 0.112 | ... | -0.843 | 0.160 | 0.364 | 0.765 | -0.589 |
3 | 0.733 | 0.368 | 0.283 | 0.165 | 0.599 | ... | -0.843 | -0.065 | 0.364 | 0.333 | -0.112 |
4 | 0.684 | 0.638 | 0.260 | 0.209 | 0.337 | ... | -0.843 | -0.215 | 0.364 | -0.280 | -0.028 |
5 rows × 32 columns
#归一化
cols_numeric = list(data_all.columns)
def scale_minmax(col):
return (col-col.min())/(col.max()-col.min())
data_all[cols_numeric] = data_all[cols_numeric].apply(scale_minmax,axis = 0)
data_all.describe()
V0 | V1 | V2 | V3 | V4 | ... | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 4813.000000 | 4813.000000 | 4813.000000 | 4813.000000 | 4813.000000 | ... | 4813.000000 | 4813.000000 | 4813.000000 | 4813.000000 | 4813.000000 |
mean | 0.694172 | 0.721357 | 0.602300 | 0.603139 | 0.523743 | ... | 0.458493 | 0.483790 | 0.762873 | 0.332385 | 0.545795 |
std | 0.144198 | 0.131443 | 0.140628 | 0.152462 | 0.106430 | ... | 0.099095 | 0.101020 | 0.102037 | 0.127456 | 0.150356 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.626676 | 0.679416 | 0.514414 | 0.503888 | 0.478182 | ... | 0.409037 | 0.454490 | 0.727273 | 0.270584 | 0.445647 |
50% | 0.729488 | 0.752497 | 0.617072 | 0.614270 | 0.535866 | ... | 0.454518 | 0.499949 | 0.800020 | 0.347056 | 0.539317 |
75% | 0.790195 | 0.799553 | 0.700464 | 0.710474 | 0.585036 | ... | 0.500000 | 0.511365 | 0.800020 | 0.414861 | 0.643061 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 32 columns
data_all.iloc[:2888]
V0 | V1 | V2 | V3 | V4 | ... | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.775775 | 0.723449 | 0.582197 | 0.665193 | 0.571839 | ... | 0.000000 | 0.000000 | 0.242424 | 0.000000 | 0.018343 |
1 | 0.833742 | 0.778785 | 0.611588 | 0.689434 | 0.544381 | ... | 0.374950 | 0.499949 | 0.800020 | 0.289702 | 0.436025 |
2 | 0.840231 | 0.796004 | 0.635354 | 0.659552 | 0.535653 | ... | 0.374950 | 0.499949 | 0.800020 | 0.429901 | 0.457224 |
3 | 0.799856 | 0.769716 | 0.642104 | 0.628297 | 0.587484 | ... | 0.374950 | 0.477220 | 0.800020 | 0.374841 | 0.528943 |
4 | 0.792790 | 0.805205 | 0.638869 | 0.635005 | 0.559600 | ... | 0.374950 | 0.462067 | 0.800020 | 0.296712 | 0.541573 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2883 | 0.721557 | 0.718060 | 0.582900 | 0.627687 | 0.587590 | ... | 0.482957 | 0.481059 | 0.727273 | 0.405812 | 0.648925 |
2884 | 0.767267 | 0.794558 | 0.643932 | 0.631041 | 0.580140 | ... | 0.534086 | 0.534094 | 0.727273 | 0.254015 | 0.488648 |
2885 | 0.637347 | 0.626577 | 0.534102 | 0.615948 | 0.538208 | ... | 0.534086 | 0.534094 | 0.727273 | 0.453607 | 0.658247 |
2886 | 0.662581 | 0.684280 | 0.553931 | 0.595670 | 0.571520 | ... | 0.545482 | 0.545409 | 0.739414 | 0.294035 | 0.629229 |
2887 | 0.747224 | 0.771293 | 0.570665 | 0.595670 | 0.564070 | ... | 0.511395 | 0.482877 | 0.743496 | 0.260133 | 0.604120 |
2888 rows × 32 columns
#进行box_cox变换
train_data_process = pd.concat([data_all.iloc[:2888],train_data['target']],axis = 1)
fcols = 6
frows = len(train_data_process.columns)
plt.figure(figsize=(4*fcols,4*frows))
i = 0
for var in cols_numeric:
dat = train_data_process[[var,'target']].dropna()
i+=1
plt.subplot(frows,fcols,i)
sns.distplot(dat[var],fit = stats.norm)
plt.title(var+'Original')
# plt.xlabel()
i+=1
plt.subplot(frows,fcols,i)
_ = stats.probplot(dat[var],plot = plt)
plt.title('skew='+'{:.4f}'.format(stats.skew(dat[var])))
plt.xlabel('')
plt.ylabel('')
i+=1
plt.subplot(frows,fcols,i)
plt.plot(dat[var],dat['target'],'.',alpha = 0.5)
plt.title('corr='+'{:.2f}'.format(dat.corr().values[0][1]))
###进行box_cox变换
trans_var,lambda_var = stats.boxcox(dat[var].dropna() +1)
trans_var = scale_minmax(trans_var)
i+=1
plt.subplot(frows,fcols,i)
sns.distplot(trans_var,fit = stats.norm)
plt.title(var+' Transformed')
plt.xlabel('')
i+=1
plt.subplot(frows,fcols,i)
_ = stats.probplot(trans_var,plot = plt)
plt.title('skew='+'{:.4f}'.format(stats.skew(trans_var)))
plt.xlabel('')
plt.ylabel('')
i+=1
plt.subplot(frows,fcols,i)
plt.plot(trans_var,dat['target'],'.',alpha = 0.5)
plt.title('corr='+'{:.2f}'.format(np.corrcoef(trans_var,dat['target'])[0][1]))
特征工程
异常值分析
##绘制箱线图
plt.figure(figsize=(18,10))
plt.boxplot(x = train_data.values,labels=train_data.columns)
plt.hlines([-7.5,7.5],0,40,colors = 'r')
train_data = train_data[train_data['V9']>-7.5]
test_data = test_data[test_data['V9']>-7.5]
train_data.describe()
test_data.describe()
V0 | V1 | V2 | V3 | V4 | ... | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | ... | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 |
mean | -0.184404 | -0.083912 | -0.434762 | 0.101671 | -0.019172 | ... | -0.011433 | -0.009985 | -0.296895 | -0.046270 | 0.195735 |
std | 1.073333 | 1.076670 | 0.969541 | 1.034925 | 1.147286 | ... | 0.989732 | 0.995213 | 0.946896 | 1.040854 | 0.940599 |
min | -4.814000 | -5.488000 | -4.283000 | -3.276000 | -4.921000 | ... | -4.627000 | -4.789000 | -7.477000 | -2.608000 | -3.346000 |
25% | -0.664000 | -0.451000 | -0.978000 | -0.644000 | -0.497000 | ... | -0.460000 | -0.290000 | -0.349000 | -0.593000 | -0.432000 |
50% | 0.065000 | 0.195000 | -0.267000 | 0.220000 | 0.118000 | ... | -0.040000 | 0.160000 | -0.270000 | 0.083000 | 0.152000 |
75% | 0.549000 | 0.589000 | 0.278000 | 0.793000 | 0.610000 | ... | 0.419000 | 0.273000 | 0.364000 | 0.651000 | 0.797000 |
max | 2.100000 | 2.120000 | 1.946000 | 2.603000 | 4.475000 | ... | 5.465000 | 5.110000 | 1.671000 | 2.861000 | 3.021000 |
8 rows × 38 columns
train_data.describe()
V0 | V1 | V2 | V3 | V4 | ... | V34 | V35 | V36 | V37 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 | ... | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 |
mean | 0.123725 | 0.056856 | 0.290340 | -0.068364 | 0.012254 | ... | 0.006959 | 0.198513 | 0.030099 | -0.131957 | 0.127451 |
std | 0.927984 | 0.941269 | 0.911231 | 0.970357 | 0.888037 | ... | 1.003411 | 0.985058 | 0.970258 | 1.015666 | 0.983144 |
min | -4.335000 | -5.122000 | -3.420000 | -3.956000 | -4.742000 | ... | -4.789000 | -5.695000 | -2.608000 | -3.630000 | -3.044000 |
25% | -0.292000 | -0.224250 | -0.310000 | -0.652750 | -0.385000 | ... | -0.290000 | -0.199750 | -0.412750 | -0.798750 | -0.347500 |
50% | 0.359500 | 0.273000 | 0.386000 | -0.045000 | 0.109500 | ... | 0.160000 | 0.364000 | 0.137000 | -0.186000 | 0.314000 |
75% | 0.726000 | 0.599000 | 0.918750 | 0.623500 | 0.550000 | ... | 0.273000 | 0.602000 | 0.643750 | 0.493000 | 0.793750 |
max | 2.121000 | 1.918000 | 2.828000 | 2.457000 | 2.689000 | ... | 5.110000 | 2.324000 | 5.238000 | 3.000000 | 2.538000 |
8 rows × 39 columns
最大值和最小值的归一化
##进行归一化处理
from sklearn import preprocessing
features_columns = [col for col in test_data.columns]
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit(train_data[features_columns])
train_data_scaler = min_max_scaler.transform(train_data[features_columns])
test_data_scaler = min_max_scaler.transform(test_data[features_columns])
train_data_scaler = pd.DataFrame(train_data_scaler,columns = features_columns)
test_data_scaler = pd.DataFrame(test_data_scaler,columns = features_columns)
train_data_scaler['target'] = train_data['target']
display(train_data_scaler.describe())
display(test_data_scaler.describe())
V0 | V1 | V2 | V3 | V4 | ... | V34 | V35 | V36 | V37 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 | ... | 2886.000000 | 2886.000000 | 2886.000000 | 2886.000000 | 2884.000000 |
mean | 0.690633 | 0.735633 | 0.593844 | 0.606212 | 0.639787 | ... | 0.484489 | 0.734944 | 0.336235 | 0.527608 | 0.127274 |
std | 0.143740 | 0.133703 | 0.145844 | 0.151311 | 0.119504 | ... | 0.101365 | 0.122840 | 0.123663 | 0.153192 | 0.983462 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -3.044000 |
25% | 0.626239 | 0.695703 | 0.497759 | 0.515087 | 0.586328 | ... | 0.454490 | 0.685279 | 0.279792 | 0.427036 | -0.348500 |
50% | 0.727153 | 0.766335 | 0.609155 | 0.609855 | 0.652873 | ... | 0.499949 | 0.755580 | 0.349860 | 0.519457 | 0.313000 |
75% | 0.783922 | 0.812642 | 0.694422 | 0.714096 | 0.712152 | ... | 0.511365 | 0.785260 | 0.414447 | 0.621870 | 0.794250 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.538000 |
8 rows × 39 columns
V0 | V1 | V2 | V3 | V4 | ... | V33 | V34 | V35 | V36 | V37 | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | ... | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 | 1925.000000 |
mean | 0.642905 | 0.715637 | 0.477791 | 0.632726 | 0.635558 | ... | 0.457349 | 0.482778 | 0.673164 | 0.326501 | 0.577034 |
std | 0.166253 | 0.152936 | 0.155176 | 0.161379 | 0.154392 | ... | 0.098071 | 0.100537 | 0.118082 | 0.132661 | 0.141870 |
min | -0.074195 | -0.051989 | -0.138124 | 0.106035 | -0.024088 | ... | 0.000000 | 0.000000 | -0.222222 | 0.000000 | 0.042836 |
25% | 0.568618 | 0.663494 | 0.390845 | 0.516451 | 0.571256 | ... | 0.412901 | 0.454490 | 0.666667 | 0.256819 | 0.482353 |
50% | 0.681537 | 0.755256 | 0.504641 | 0.651177 | 0.654017 | ... | 0.454518 | 0.499949 | 0.676518 | 0.342977 | 0.570437 |
75% | 0.756506 | 0.811222 | 0.591869 | 0.740527 | 0.720226 | ... | 0.500000 | 0.511365 | 0.755580 | 0.415371 | 0.667722 |
max | 0.996747 | 1.028693 | 0.858835 | 1.022766 | 1.240345 | ... | 1.000000 | 1.000000 | 0.918568 | 0.697043 | 1.003167 |
8 rows × 38 columns
查看数据分布
在数据探索章节中,我们通过KDE图发现有一些变量在训练集和测试集中的分布不同,这些变量将会影响模型的泛化能力,故而将这些变量删除
cols_drop
['V5', 'V9', 'V11', 'V17', 'V22', 'V28']
# train_data_scaler.drop(cols_drop,axis = 1,inplace=True)
# test_data_scaler.drop(cols_drop,axis = 1,inplace=True)
特征相关性
plt.figure(figsize=(20,16))
column = train_data_scaler.columns.tolist()
mcorr = train_data_scaler.corr(method='spearman')
mask = np.zeros_like(mcorr,dtype = np.bool)
mask[np.triu_indices_from(mask)] = True #triu_indices_from指的是得到给定矩阵上三角阵的索引,这句代码目的为将上三角矩阵转换为Ture
cmap = sns.diverging_palette(220,10,as_cmap=True)
g = sns.heatmap(mcorr,mask=mask,cmap=cmap,square=True,annot=True,fmt='0.2f')
mask[np.triu_indices_from(mask)] = True
特征选择
##根据相关性进行特征选择
mcorr = mcorr.abs()
numerical_corr = mcorr[mcorr['target'] > 0.1]['target']
numerical_corr.sort_values(ascending=False)
target 1.000000
V0 0.712403
V31 0.711636
V1 0.682909
V8 0.679469
...
V18 0.149741
V13 0.149199
V17 0.126262
V22 0.112743
V30 0.101378
Name: target, Length: 28, dtype: float64
index0 = numerical_corr.index.tolist()
多重共线性分析
若存在多重共线性,则需要使用PCA进行降维,去除多重共线性
from statsmodels.stats.outliers_influence import variance_inflation_factor
##VIF方差膨胀因子
new_numerical = index0[:-1]
X = np.matrix(train_data_scaler[new_numerical])
VIF_list = [variance_inflation_factor(X,i) for i in range(X.shape[1])]
VIF_list
[375.2588090598281,
474.15498437652406,
126.97684602301281,
29.848200852284968,
393.099679241423,
92.49455965657118,
606.8157116621701,
397.34364499107465,
543.375332085974,
90.84960173250062,
87.78896924869898,
384.3905248315176,
29.912661878759433,
118.2994192741307,
610.7245485191555,
22.683529358626934,
23.715095416410442,
23.34500935943297,
25.02316787437953,
15.680507346101447,
5.185851115708424,
277.76658405576114,
142.5865460050103,
38.14831530855407,
520.3089375604505,
85.663077288378,
50.933680077518105]
VIF基本都大于10,存在较强的多重共线性
PCA处理
from sklearn.decomposition import PCA
pca = PCA(n_components = 0.9)#保留90%的信息
new_train_pca_90 = pca.fit_transform(train_data_scaler.iloc[:,:-1])
new_test_pca_90 = pca.transform(test_data_scaler)
new_train_pca_90 = pd.DataFrame(new_train_pca_90)
new_test_pca_90 = pd.DataFrame(new_test_pca_90)
new_train_pca_90['target'] = train_data_scaler['target']
new_train_pca_90.describe()
0 | 1 | 2 | 3 | 4 | ... | 12 | 13 | 14 | 15 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | ... | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2884.000000 |
mean | 9.876864e-17 | -1.559352e-16 | 1.269486e-17 | 6.520541e-18 | 5.512646e-17 | ... | 6.414750e-18 | -2.975117e-17 | -3.040995e-17 | 7.419579e-17 | 0.127274 |
std | 3.998976e-01 | 3.500240e-01 | 2.938631e-01 | 2.728023e-01 | 2.077128e-01 | ... | 1.193301e-01 | 1.149758e-01 | 1.133507e-01 | 1.019259e-01 | 0.983462 |
min | -1.071795e+00 | -9.429479e-01 | -9.948314e-01 | -7.103087e-01 | -7.703987e-01 | ... | -4.175153e-01 | -4.310613e-01 | -4.170535e-01 | -3.601627e-01 | -3.044000 |
25% | -2.804085e-01 | -2.613727e-01 | -2.090797e-01 | -1.945196e-01 | -1.315620e-01 | ... | -7.139961e-02 | -7.474073e-02 | -7.709743e-02 | -6.603914e-02 | -0.348500 |
50% | -1.417104e-02 | -1.277241e-02 | 2.112166e-02 | -2.337401e-02 | -5.122797e-03 | ... | -4.140670e-03 | 1.054915e-03 | -1.758387e-03 | -7.533392e-04 | 0.313000 |
75% | 2.287306e-01 | 2.317720e-01 | 2.069571e-01 | 1.657590e-01 | 1.281660e-01 | ... | 6.786199e-02 | 7.574868e-02 | 7.116829e-02 | 6.357449e-02 | 0.794250 |
max | 1.597730e+00 | 1.382802e+00 | 1.010250e+00 | 1.448007e+00 | 1.034061e+00 | ... | 5.156118e-01 | 4.978126e-01 | 4.673189e-01 | 4.570870e-01 | 2.538000 |
8 rows × 17 columns
pca = PCA(n_components = 16)#保留13个主成分
new_train_pca_16 = pca.fit_transform(train_data_scaler.iloc[:,:-1])
new_test_pca_16 = pca.transform(test_data_scaler)
new_train_pca_16 = pd.DataFrame(new_train_pca_16)
new_test_pca_16 = pd.DataFrame(new_test_pca_16)
new_train_pca_16['target'] = train_data_scaler['target']
new_train_pca_16.describe()
0 | 1 | 2 | 3 | 4 | ... | 12 | 13 | 14 | 15 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | ... | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2.886000e+03 | 2884.000000 |
mean | 1.619532e-16 | 9.140298e-17 | 2.396635e-17 | 1.009818e-17 | 2.158126e-17 | ... | 2.862113e-17 | -5.855984e-17 | -4.845204e-17 | 7.810583e-17 | 0.127274 |
std | 3.998976e-01 | 3.500240e-01 | 2.938631e-01 | 2.728023e-01 | 2.077128e-01 | ... | 1.193301e-01 | 1.149757e-01 | 1.133507e-01 | 1.019258e-01 | 0.983462 |
min | -1.071795e+00 | -9.429479e-01 | -9.948314e-01 | -7.103087e-01 | -7.704007e-01 | ... | -4.175059e-01 | -4.310984e-01 | -4.170395e-01 | -3.601786e-01 | -3.044000 |
25% | -2.804085e-01 | -2.613727e-01 | -2.090797e-01 | -1.945196e-01 | -1.315626e-01 | ... | -7.140021e-02 | -7.482178e-02 | -7.709831e-02 | -6.606376e-02 | -0.348500 |
50% | -1.417104e-02 | -1.277241e-02 | 2.112170e-02 | -2.337402e-02 | -5.123358e-03 | ... | -4.140699e-03 | 1.070748e-03 | -1.764363e-03 | -8.300824e-04 | 0.313000 |
75% | 2.287306e-01 | 2.317720e-01 | 2.069571e-01 | 1.657590e-01 | 1.281664e-01 | ... | 6.786778e-02 | 7.584871e-02 | 7.124562e-02 | 6.359870e-02 | 0.794250 |
max | 1.597730e+00 | 1.382802e+00 | 1.010250e+00 | 1.448007e+00 | 1.034055e+00 | ... | 5.156127e-01 | 4.978497e-01 | 4.673024e-01 | 4.571319e-01 | 2.538000 |
8 rows × 17 columns
模型训练
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
切分数据
new_train_pca_16 = new_train_pca_16.fillna(0)
train = new_train_pca_16.iloc[:,:-1]
target = new_train_pca_16.iloc[:,-1]
train_data,test_data,train_target,test_target = train_test_split(train,target,test_size=0.2,random_state=0)
线性回归
clf = LinearRegression()
clf.fit(train_data,train_target)
score = mean_squared_error(test_target,clf.predict(test_data))
print('LinearRegression:',score)
LinearRegression: 0.27169898675980547
K近邻回归
clf = KNeighborsRegressor(n_neighbors = 8)
clf.fit(train_data,train_target)
score = mean_squared_error(test_target,clf.predict(test_data))
print('KNeighborsRegressor:',score)
KNeighborsRegressor: 0.2734067076124567
随机森林回归
clf = RandomForestRegressor(n_estimators=200)
clf.fit(train_data,train_target)
score = mean_squared_error(test_target,clf.predict(test_data))
print('RandomForestRegressor:',score)
RandomForestRegressor: 0.2509856772478806
决策树回归
LGB模型回归
clf = lgb.LGBMRegressor(learning_rate=0.1,max_depth=-1,n_estimators=5000,boosting_type='gbdt',random_state=2019,objective='regression')
clf.fit(X = train_data,y = train_target,eval_metric='MSE',verbose=50)
score = mean_squared_error(test_target,clf.predict(test_data))
print('lightGbm:',score)
lightGbm: 0.25057483646268536