2020-11-15

最新推荐文章于 2023-05-13 01:41:34 发布

qq_44162075

最新推荐文章于 2023-05-13 01:41:34 发布

阅读量285

点赞数 1

分类专栏：初学文章标签：机器学习

本文链接：https://blog.csdn.net/qq_44162075/article/details/109703310

版权

初学专栏收录该内容

1 篇文章 0 订阅

订阅专栏

练习---工业蒸汽量预测

赛题链接：https://tianchi.aliyun.com/competition/entrance/231693/introduction

之前一段时间学习了唐宇迪老师的机器学习，感觉掌握得不是很好，想那这道题练习一下，在做之前也参考了网上多位大佬的做法，运行时遇到的问题暂时还没想到解决的办法。

一、整体思路：

从官网拿到训练集和测试集，数据已经做了处理，V0~V37是38个特征属性，target为目标属性。

先对数据进行预处理，剔除一些不相关的特征，然后对训练集进行数据划分，划分为新的训练集和测试集，分别对新的训练集和测试集进行模型分析,最后根据分析结果做出简单调参。

二、具体步骤

2.1 导库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas import DataFrame
%matplotlib inline

2.2 导入数据，观察数据

train=pd.read_table('F:/机器学习/zhengqi_train.txt')
test_x=pd.read_table('F:/机器学习/zhengqi_test.txt')
train.head()

结果：

	V0	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V29	V30	V31	V32	V33	V34	V35	V36	V37	target
0	0.566	0.016	-0.143	0.407	0.452	-0.901	-1.812	-2.360	-0.436	-2.114	...	0.136	0.109	-0.615	0.327	-4.627	-4.789	-5.101	-2.608	-3.508	0.175
1	0.968	0.437	0.066	0.566	0.194	-0.893	-1.566	-2.360	0.332	-2.114	...	-0.128	0.124	0.032	0.600	-0.843	0.160	0.364	-0.335	-0.730	0.676
2	1.013	0.568	0.235	0.370	0.112	-0.797	-1.367	-2.360	0.396	-2.114	...	-0.009	0.361	0.277	-0.116	-0.843	0.160	0.364	0.765	-0.589	0.633
3	0.733	0.368	0.283	0.165	0.599	-0.679	-1.200	-2.086	0.403	-2.114	...	0.015	0.417	0.279	0.603	-0.843	-0.065	0.364	0.333	-0.112	0.206
4	0.684	0.638	0.260	0.209	0.337	-0.454	-1.073	-2.086	0.314	-2.114	...	0.183	1.078	0.328	0.418	-0.843	-0.215	0.364	-0.280	-0.028	0.384

5 rows × 39 columns

2.3 检查缺省值

#缺失值检查
train.isnull().sum()

V0        0
V1        0
V2        0
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
target    0
dtype: int64

2.4 画出训练集与测试集的数据分布情况

#观察测试集与训练集的分布
train_x=train.drop(columns=['target'])
train_y=train['target']
sns.distplot(train_y)
plt.show()
for c in train_x.columns:
    sns.distplot(train_x[c])
    sns.distplot(test_x[c])
    plt.show()

由图可看出，'V5', 'V9' , 'V11' , 'V14' , 'V17' , 'V19' , 'V20' , 'V21 ', 'V22' , 'V28' 这些特征差异较大，因此剔除这些特征。

train_x=train_x.drop(['V5', 'V9' , 'V11' , 'V14','V17','V19','V20','V21','V22', 'V28'], axis=1)
test_x=test_x.drop(['V5', 'V9' , 'V11' , 'V14','V17','V19','V20','V21','V22', 'V28'], axis=1)

2.5 相关性检测

这里预留相关性系数大于0.5的值。

corr = train.drop('target', axis=1).corrwith(train.target)
corr = corr[np.abs(corr) >= 0.5]
print(corr)

V0     0.873212
V1     0.871846
V2     0.638878
V3     0.512074
V4     0.603984
V8     0.831904
V12    0.594189
V16    0.536748
V27    0.812585
V31    0.750297
V37   -0.565795
dtype: float64

2.6 划分数据

# 将数据划分为训练集与测试集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train_x,train_y, test_size = 0.25, random_state = 0)

2.7 特征归一化处理

使数据范围得到统一，同时能加快运行速度，得到的结果精确度更高。

#特征归一化处理
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.transform(x_test)

2.8 模型选择

在借鉴了各位大佬的模型后，决定使用随机森林算法。

随机森林算法的优点：

每棵树随机选择样本并随机选择特征，使得具有很好的抗噪能力，性能稳定；
每棵树都选择部分样本及部分特征，一定程度避免过拟合；

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=50)
model.fit(x_train, y_train)
print("随机森林训练集的分数为：",model.score(x_train, y_train))
print("随机森林测试集的分数为：",model.score(x_test, y_test))

结果：

随机森林训练集的分数为： 0.9821838347924899
随机森林测试集的分数为： 0.8710006940490087

可见，在训练集的表现很不错，在测试集表现还好吧。

2.9 输出结果

test_y_pred=model.predict(test_x)
result=pd.DataFrame(test_y_pred)
result.to_csv('F:/机器学习/zhengqi_test_y.txt',index=False,header=False)

三、总结

这个分数不高，还有一些需要改进的地方，比如对于特征的处理还不够，模型上也没有选择多个模型进行对比，最后水平还不够，没能进行模型调优。

我想在下次对此进行相应的改进。

qq_44162075

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
2020-11-15

练习---工业蒸汽量预测赛题链接：https://tianchi.aliyun.com/competition/entrance/231693/introduction 之前一段时间学习了唐宇迪老师的机器学习，感觉掌握得不是很好，想那这道题练习一下，在做之前也参考了网上多位大佬的做法，运行时遇到的问题暂时还没想到解决的办法。一、整体思路：从官网拿到训练集和测试集，数据已经做了处理，V0~V37是38个特征属性，target为目标属性。先对数据进行预...
复制链接

扫一扫