2024年Python最新【RNN入门到实战】LSTM从入门到实战——实现空气质量预测

2401_84585462

于 2024-05-01 10:17:23 发布

阅读量860

点赞数 19

分类专栏：程序员文章标签： rnn python lstm

本文链接：https://blog.csdn.net/2401_84585462/article/details/138367914

版权

程序员专栏收录该内容

119 篇文章 0 订阅

订阅专栏

浏览前5行数据

print(dataset.head(5))

save to file

dataset.to_csv(‘pollution.csv’)

加载了“pollution.csv”文件，并对除了类别型特性“风速”的每一列数据分别绘图。

dataset = pd.read_csv(‘pollution.csv’, header=0, index_col=0)

values = dataset.values

specify columns to plot

groups = [0, 1, 2, 3, 5, 6, 7]

i = 1

plot each column

pyplot.figure(figsize=(10, 10))

for group in groups:

pyplot.subplot(len(groups), 1, i)

pyplot.plot(values[:, group])

pyplot.title(dataset.columns[group], y=0.5, loc=‘right’)

i += 1

pyplot.show()

运行上面的代码，并对7个变量在5年的范围内绘图。在这里插入图片描述

利用sklearn的预处理模块对类别特征“风向”进行编码，当然也可以对该特征进行one-hot编码。接着对所有的特征进行归一化处理，然后将数据集转化为有监督学习问题，同时将需要预测的当前时刻（t）的天气条件特征移除，代码如下：

def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):

convert series to supervised learning

n_vars = 1 if type(data) is list else data.shape[1]

df = pd.DataFrame(data)

cols, names = list(), list()

input sequence (t-n, … t-1)

for i in range(n_in, 0, -1):

cols.append(df.shift(i))

names += [(‘var%d(t-%d)’ % (j + 1, i)) for j in range(n_vars)]

forecast sequence (t, t+1, … t+n)

for i in range(0, n_out):

cols.append(df.shift(-i))

if i == 0:

names += [(‘var%d(t)’ % (j + 1)) for j in range(n_vars)]

else:

names += [(‘var%d(t+%d)’ % (j + 1, i)) for j in range(n_vars)]

put it all together

agg = pd.concat(cols, axis=1)

agg.columns = names

drop rows with NaN values

if dropnan:

agg.dropna(inplace=True)

return agg

load dataset

dataset = pd.read_csv(‘pollution.csv’, header=0, index_col=0)

values = dataset.values

integer encode direction

encoder = LabelEncoder()

print(values[:, 4])

values[:, 4] = encoder.fit_transform(values[:, 4])

print(values[:, 4])

ensure all data is float

values = values.astype(‘float32’)

normalize features

scaler = MinMaxScaler(feature_range=(0, 1))

scaled = scaler.fit_transform(values)

frame as supervised learning

reframed = series_to_supervised(scaled, 1, 1)

drop columns we don’t want to predict

reframed.drop(reframed.columns[[9, 10, 11, 12, 13, 14, 15]], axis=1, inplace=True)

print(reframed.head())

构造模型

首先，我们需要将处理后的数据集划分为训练集和测试集。为了加速模型的训练，我们仅利用第一年数据进行训练，然后利用剩下的4年进行评估。

下面的代码将数据集进行划分，然后将训练集和测试集划分为输入和输出变量，最终将输入（X）改造为LSTM的输入格式，即[samples,timesteps,features]。

split into train and test sets

values = reframed.values

n_train_hours = 365 * 24

train = values[:n_train_hours, :]

test = values[n_train_hours:, :]

split into input and outputs

train_X, train_y = train[:, :-1], train[:, -1]

test_X, test_y = test[:, :-1], test[:, -1]

reshape input to be 3D [samples, timesteps, features]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

运行上述代码打印训练集和测试集的输入输出格式，其中9K小时数据作训练集，35K小时数据作测试集。

(8760, 1, 8) (8760,) (35039, 1, 8) (35039,)

现在可以搭建LSTM模型了。 LSTM模型中，隐藏层有50个神经元，输出层1个神经元（回归问题），输入变量是一个时间步（t-1）的特征，损失函数采用Mean Absolute Error(MAE)，优化算法采用Adam，模型采用50个epochs并且每个batch的大小为72。

最后，在fit()函数中设置validation_data参数，记录训练集和测试集的损失，并在完成训练和测试后绘制损失图。

checkpointer = ModelCheckpoint(filepath=‘best_model.hdf5’, monitor=‘val_loss’, verbose=1, save_best_only=True,

mode=‘min’)

reduce = ReduceLROnPlateau(monitor=‘val_loss’, patience=10, verbose=1, factor=0.5, min_lr=1e-6)

model = Sequential()

model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

model.add(Dense(1))

model.compile(loss=‘mae’, optimizer=‘adam’)

fit network

history = model.fit(train_X, train_y, epochs=300, batch_size=64, validation_data=(test_X, test_y), verbose=1,

callbacks=[checkpointer, reduce],

shuffle=True)

plot history

pyplot.plot(history.history[‘loss’], label=‘train’)

pyplot.plot(history.history[‘val_loss’], label=‘test’)

pyplot.legend()

pyplot.show()

模型评估

接下里我们对模型效果进行评估。

值得注意的是：需要将预测结果和部分测试集数据组合然后进行比例反转（invert the scaling），同时也需要将测试集上的预期值也进行比例转换。

（We combine the forecast with the test dataset and invert the scaling. We also invert scaling on the test dataset with the expected pollution numbers.）

至于在这里为什么进行比例反转，是因为我们将原始数据进行了预处理（连同输出值y），此时的误差损失计算是在处理之后的数据上进行的，为了计算在原始比例上的误差需要将数据进行转化。同时笔者有个小Tips：就是反转时的矩阵大小一定要和原来的大小（shape）完全相同，否则就会报错。

通过以上处理之后，再结合RMSE（均方根误差）计算损失。

yhat = model.predict(test_X)

test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

invert scaling for forecast

inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

inv_yhat = inv_yhat[:, 0]

invert scaling for actual

inv_y = scaler.inverse_transform(test_X)

inv_y = inv_y[:, 0]

calculate RMSE

rmse = sqrt(mean_squared_error(inv_y, inv_yhat))

print(‘Test RMSE: %.3f’ % rmse)

完整代码

import pandas as pd

from datetime import datetime

from matplotlib import pyplot

from sklearn.preprocessing import LabelEncoder, MinMaxScaler

from sklearn.metrics import mean_squared_error

from tensorflow.keras.models import Sequential

from tensorflow.keras.layers import Dense

from tensorflow.keras.layers import LSTM

from numpy import concatenate

from math import sqrt

load data

def parse(x):

return datetime.strptime(x, ‘%Y %m %d %H’)

def read_raw():

dataset = pd.read_csv(‘raw.csv’, parse_dates=[[‘year’, ‘month’, ‘day’, ‘hour’]], index_col=0, date_parser=parse)

dataset.drop(‘No’, axis=1, inplace=True)

manually specify column names

dataset.columns = [‘pollution’, ‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’]

dataset.index.name = ‘date’

mark all NA values with 0

dataset[‘pollution’].fillna(0, inplace=True)

drop the first 24 hours

dataset = dataset[24:]

summarize first 5 rows

print(dataset.head(5))

save to file

dataset.to_csv(‘pollution.csv’)

def drow_pollution():

dataset = pd.read_csv(‘pollution.csv’, header=0, index_col=0)

values = dataset.values

specify columns to plot

groups = [0, 1, 2, 3, 5, 6, 7]

i = 1

plot each column

pyplot.figure(figsize=(10, 10))

for group in groups:

pyplot.subplot(len(groups), 1, i)

pyplot.plot(values[:, group])

pyplot.title(dataset.columns[group], y=0.5, loc=‘right’)

i += 1

pyplot.show()

def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):

convert series to supervised learning

n_vars = 1 if type(data) is list else data.shape[1]

df = pd.DataFrame(data)

cols, names = list(), list()

input sequence (t-n, … t-1)

for i in range(n_in, 0, -1):

cols.append(df.shift(i))

names += [(‘var%d(t-%d)’ % (j + 1, i)) for j in range(n_vars)]

forecast sequence (t, t+1, … t+n)

for i in range(0, n_out):

cols.append(df.shift(-i))

if i == 0:

names += [(‘var%d(t)’ % (j + 1)) for j in range(n_vars)]

else:

names += [(‘var%d(t+%d)’ % (j + 1, i)) for j in range(n_vars)]

put it all together

agg = pd.concat(cols, axis=1)

agg.columns = names

drop rows with NaN values

if dropnan:

agg.dropna(inplace=True)

return agg

def cs_to_sl():

load dataset

dataset = pd.read_csv(‘pollution.csv’, header=0, index_col=0)

values = dataset.values

integer encode direction

encoder = LabelEncoder()

print(values[:, 4])

values[:, 4] = encoder.fit_transform(values[:, 4])

print(values[:, 4])

ensure all data is float

values = values.astype(‘float32’)

normalize features

scaler = MinMaxScaler(feature_range=(0, 1))

scaled = scaler.fit_transform(values)

frame as supervised learning

reframed = series_to_supervised(scaled, 1, 1)

drop columns we don’t want to predict

reframed.drop(reframed.columns[[9, 10, 11, 12, 13, 14, 15]], axis=1, inplace=True)

print(reframed.head())

return reframed, scaler

def train_test(reframed):

split into train and test sets

values = reframed.values

n_train_hours = 365 * 24

train = values[:n_train_hours, :]

test = values[n_train_hours:, :]

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化学习资料的朋友，可以戳这里无偿获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

2401_84585462

关注

19
点赞
踩
22

收藏

觉得还不错? 一键收藏
0
评论
2024年Python最新【RNN入门到实战】LSTM从入门到实战——实现空气质量预测

LSTM模型中，隐藏层有50个神经元，输出层1个神经元（回归问题），输入变量是一个时间步（t-1）的特征，损失函数采用Mean Absolute Error(MAE)，优化算法采用Adam，模型采用50个epochs并且每个batch的大小为72。至于在这里为什么进行比例反转，是因为我们将原始数据进行了预处理（连同输出值y），此时的误差损失计算是在处理之后的数据上进行的，为了计算在原始比例上的误差需要将数据进行转化。为了加速模型的训练，我们仅利用第一年数据进行训练，然后利用剩下的4年进行评估。
复制链接

扫一扫