循环神经网络 递归神经网络_如何用递归神经网络预测空气污染

循环神经网络 递归神经网络

After the citizen science project of Curieuze Neuzen, I wanted to learn more about air pollution to see if I could make a data science project out of it. On the website of the European Environment Agency, you can find a huge amount of data and information about air pollution.

Curieuze Neuzen的公民科学项目结束后 ,我想了解有关空气污染的更多信息,以查看是否可以从中进行数据科学项目。 在欧洲环境署的网站上,您可以找到有关空气污染的大量数据和信息。

In this notebook, we will focus on the air quality in Belgium, and more specifically on the pollution by sulphur dioxide (SO2). The data can be downloaded via https://www.eea.europa.eu/data-and-maps/data/aqereporting-2/be.

在本笔记本中,我们将重点关注比利时的空气质量,尤其是二氧化硫(SO2)的污染。 可以通过https://www.eea.europa.eu/data-and-maps/data/aqereporting-2/be下载数据。

The zip file contains separate files for different air pollutants and aggregation levels. The first digit represents the pollutant ID as described in the vocabulary. The file used in this notebook is BE_1_2013–2015_aggregated_timeseries.csv. This is the SO2 pollution in Belgium, but you can also find similar data for other European countries.

该zip文件包含用于不同空气污染物和聚集水平的单独文件。 第一位代表词汇表中描述的污染物ID。 该笔记本中使用的文件为BE_1_2013–2015_aggregated_timeseries.csv。 这是比利时的SO2污染,但您也可以找到其他欧洲国家的类似数据。

Descriptions of the fields in the CSV files are available on the data download page. More background information on air pollutants can be found on Wikipedia.

CSV文件中字段的描述可在数据下载页面上找到 。 有关空气污染物的更多背景信息可以在Wikipedia上找到。

项目设置 (Project Set-up)

# Importing packages
from pathlib import Path
import pandas as pd
import numpy as np
import pandas_profiling
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)
from sklearn.preprocessing import MinMaxScaler

from keras.preprocessing.sequence import TimeseriesGenerator
from keras.models import Sequential
from keras.layers import Dense, LSTM, SimpleRNN
from keras.optimizers import RMSprop
from keras.callbacks import ModelCheckpoint, EarlyStopping
from keras.models import model_from_json

# Setting the project directory
project_dir = Path('/Users/bertcarremans/Data Science/Projecten/air_pollution_forecasting')

加载数据 (Loading the data)

date_vars = ['DatetimeBegin','DatetimeEnd']

agg_ts = pd.read_csv(project_dir / 'data/raw/BE_1_2013-2015_aggregated_timeseries.csv', sep='\t', parse_dates=date_vars, date_parser=pd.to_datetime)
meta = pd.read_csv(project_dir / 'data/raw/BE_2013-2015_metadata.csv', sep='\t')

print('aggregated timeseries shape:{}'.format(agg_ts.shape))
print('metadata shape:{}'.format(meta.shape))

数据探索 (Data Exploration)

Let’s use pandas_profiling to inspect the data.

让我们使用pandas_profiling来检查数据。

pandas_profiling.ProfileReport(agg_ts)

I won’t show the output of pandas_profiling in this story in order not to clutter it with charts. But you can find it in my GitHub repo.

我不会在这个故事中显示pandas_profiling的输出,以免使图表混乱。 但是您可以在我的GitHub存储库中找到它。

The pandas_profiling report shows us the following:

pandas_profiling报告向我们显示以下内容:

  • There are 6 constant variables. We can remove these from the data set.

    有6个常量变量。 我们可以将它们从数据集中删除。
  • No missing values exist, so probably we will not need to apply imputation.

    没有遗漏的值存在,因此可能我们将不需要应用估算。
  • AirPollutionLevel has some zeroes, but this could be perfectly normal. On the other hand, these variables have some extreme values, which might be incorrect recordings of air pollution.

    AirPollutionLevel有一些零,但这可能是完全正常的。 另一方面,这些变量具有一些极端值,可能是不正确的空气污染记录。

  • There are 53 AirQualityStations, which are probably the same as the SamplingPoints. AirQualityStationEoICode is simply a shorter code for the AirQualityStation, so that variable can also be removed.

    有53个AirQualityStation ,可能与SamplingPoints相同。 AirQualityStationEoICode只是AirQualityStation的较短代码,因此也可以删除该变量。

  • There are 3 values for AirQualityNetwork (Brussels, Flanders and Wallonia). Most measurements come from Flanders.

    AirQualityNetwork有3个值(布鲁塞尔,法兰德斯和瓦隆)。 大多数测量来自法兰德斯。

  • DataAggregationProcess: most rows contain data aggregated as the 24-hour mean of one day of measurements (P1D). More information on the other values can be found here. In this project, we will only consider P1D values.

    DataAggregationProcess :大多数行包含的数据汇总为一天的24小时平均值(P1D)。 有关其他值的更多信息,请参见此处 。 在此项目中,我们将仅考虑P1D值。

  • DataCapture: Proportion of valid measurement time relative to the total measured time (time coverage) in the averaging period, expressed as a percentage. Almost all rows have about 100% of valid measurement time. Some rows have a DataCapture that is slightly lower than 100%.

    DataCapture :有效测量时间相对于平均周期中相对于总测量时间(时间覆盖)的比例,以百分比表示。 几乎所有行都有大约100%的有效测量时间。 有些行的

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值