python线性回归模型预测北京市未来12小时PM2.5值

项目源码请见:https://github.com/dennis0818/weather_data_analysis

一、分析目标

通过北京市历史24小时气象和大气污染物数据,采用普通线性回归、岭回归和Lasso回归进行建模分析,对比模型预测效果选择最优模型,预测之后12小时天气数据中PM2.5的值

二、项目背景

在北京,冬天最令人头疼的就是雾霾问题,每当雾霾天气来临,那种灰蒙蒙的空气和将口鼻掩盖在厚厚的口罩下呼吸困难的感觉,让人情绪低落。而雾霾的罪魁祸首就是PM2.5。

本次分析主要是想要使用线性回归模型对PM2.5值进行预测

8月8日已经对(北京)历史24小时即08/07日22时-08/0821时共24小时数据进行了采集,今天先来采集新生成的数据做测试数据集,由于已经过去了12个小时,所以共有12组数据可用,即08/08日22时-08/09日09时

三、 数据来源

本次分析数据来自心知天气网,该网站可以通过Restful风格URL直接获取Json格式气象和大气数据,获取方式较简单。

四、 数据分析

1. 数据规整

import pandas as pd
import numpy as np
from io import StringIO
from urllib import request
import json
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

通过请求request从网络获取数据并进行整理

url_beijing_w = 'https://api.seniverse.com/v3/weather/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&unit=c'
url_beijing_p = 'https://api.seniverse.com/v3/air/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&scope=city'
s_p = request.urlopen(url_beijing_p).read().decode('utf8')
s_w = request.urlopen(url_beijing_w).read().decode('utf8')
data_dict_p = json.loads(s_p)
data_dict_w = json.loads(s_w)

获取的数据是Json格式的,先将其解析为字典,在根据字典键值存入DataFrame中

def gen_table_p(list2):
    data_dict = {}
    for i, value in enumerate(list2):
        data_dict[i] = value['city']
    return data_dict
def gen_table_w(list2):
    data_dict = {}
    for i, value in enumerate(list2):
        data_dict[i] = value
    return data_dict

将气象和大气污染物数据转换成DataFrame表格

data_list_p = data_dict_p['results'][0]['hourly_history']
data_list_p = gen_table_p(data_list_p)
data_p = pd.DataFrame(data_list_p)
data_list_w = data_dict_w['results'][0]['hourly_history']
data_list_w = gen_table_w(data_list_w)
data_w = pd.DataFrame(data_list_w)
data_p = data_p.T
data_w = data_w.T

调整时间格式,用datetime格式将日期转换成想要的格式,再存回表格中。删除几个不重要的特征变量

def adjust_time(data):
    time = data['last_update'].astype(str)
    time = time.str[:19]
    time = time.str.replace('T', ' ')
    time = time.map(lambda x : parse(x))
    time = time.dt.strftime('%H-%m/%d')
    data['last_update'] = time
    return data
data_p = adjust_time(data_p)
data_w = adjust_time(data_w)
data = pd.merge(data_p, data_w, on = 'last_update')
data_test = data[:12]
data_test.head()
aqicolast_updateno2o3pm10pm25qualityso2clouds...feels_likehumiditypressuretemperaturetextvisibilitywind_directionwind_direction_degreewind_scalewind_speed
01050.81714-08/19182098148轻度污染550...2946100430多云19.4188319.44
1970.90813-08/19191978449650...3048100430多云18.5西248318.0
2690.87512-08/19261628749450...2853100529多云13.7西250315.84
3740.80811-08/19411149851390...285310052811.0西南228211.16
4740.80010-08/1951779848290...27561005289.9188314.04

5 rows × 22 columns

data_test = data_test.drop(['dew_point', 'wind_direction', 'wind_direction_degree', 'text', 'code', 'wind_scale','last_update', 'quality'], axis = 1)
data_test
aqicono2o3pm10pm25so2cloudsfeels_likehumiditypressuretemperaturevisibilitywind_speed
01050.817182098148550294610043019.419.44
1970.908191978449650304810043018.518.0
2690.875261628749450285310052913.715.84
3740.808411149851390285310052811.011.16
4740.8005177984829027561005289.914.04
5730.7755650954619026601005269.612.24
6670.7505628834019025641005258.96.12
7610.675581772341023711005239.06.12
8590.6505420683320217510042210.14.68
9590.6084826673320217810042111.66.48
10580.5924832663120217910042112.47.56
11590.6174344683230227610042212.56.84

2.生成训练数据集合测试数据集

数据备份,生成训练数据集和测试数据集

#data_test.to_excel('D:/python/practise/sample/weather/data_test(22-09).xlsx')
data_train = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_all(22-21).xlsx')
data_test = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
data_train.head()
Unnamed: 0aqicolast_updateno2o3pm10pm25so2stationcloudsfeels_likehumiditypressuretemperaturevisibilitywind_speed
00490.721-08/082911149313万寿西宫90297110022912.19.0
11510.721-08/081110852333定陵90297110022912.19.0
22520.821-08/081912453303东四90297110022912.19.0
33480.621-08/08208846333天坛90297110022912.19.0
44550.721-08/082311859323农展馆90297110022912.19.0
data_train.shape
(288, 17)

提取出各个站的aqi数据

aqi_all_station = data_train[['last_update','aqi', 'station']].set_index(['station', 'last_update']).unstack(0)
aqi_all_station = aqi_all_station.replace({0: np.nan}).fillna(aqi_all_station.median())
aqi_all_station
aqi
station万寿西宫东四农展馆古城天坛奥体中心官园定陵怀柔镇昌平镇海淀区万柳顺义新城
last_update
00-08/0855.062.030.036.028.052.530.035.022.031.028.060.0
01-08/0845.051.039.031.040.045.040.022.024.022.026.050.0
02-08/0860.053.039.031.052.051.043.026.024.023.035.060.0
03-08/0865.054.042.029.057.045.032.023.026.023.044.060.0
04-08/0864.058.038.026.049.045.036.025.024.022.049.058.0
05-08/0850.048.049.026.042.043.039.025.026.024.054.049.0
06-08/0839.050.045.022.054.046.051.029.031.030.057.036.0
07-08/0853.062.042.054.058.056.063.033.029.035.045.043.0
08-08/0864.062.069.063.069.060.065.039.035.038.052.051.0
09-08/0870.077.077.053.079.052.060.031.024.039.054.063.0
10-08/0873.082.083.052.085.067.062.032.029.027.060.065.0
11-08/0864.074.074.044.088.053.065.037.040.036.058.063.0
12-08/0864.073.067.050.060.057.068.038.040.033.052.058.0
13-08/0864.094.075.065.065.078.099.048.045.044.083.079.0
14-08/0884.0111.0104.088.0100.088.0111.053.050.077.094.089.0
15-08/0877.0122.0114.098.0106.0107.0118.074.085.0100.097.092.0
16-08/0857.0112.0108.084.089.0101.0109.0102.0111.0113.0101.0109.0
17-08/0848.088.088.062.060.072.075.0120.0112.0130.063.094.0
18-08/0850.085.080.050.065.053.063.0128.0108.0127.047.083.0
19-08/0848.074.070.052.055.060.054.0122.084.082.048.065.0
20-08/0846.052.046.059.051.052.047.057.057.048.053.054.0
21-08/0849.052.055.072.048.049.051.051.060.053.052.062.0
22-08/0740.059.059.041.043.051.040.024.027.041.061.054.0
23-08/0755.062.037.051.036.052.539.035.024.037.039.060.0
plt.rcParams['font.sans-serif']=['SimHei'] #正常显示中文
aqi_all_station.plot(figsize = (20, 5))
plt.xticks(np.arange(24), aqi_all_station.index.values, fontsize = 15, rotation = 30)
plt.grid()
plt.show()

在这里插入图片描述
"aqi"表示空气质量指数(数值越低空气质量越好),从上图可以看出,各个城区全天24小时的aqi指数变化趋势,总体的最值大约出现在下午15时至18时,18时之后快速下降。其中,“万寿西宫,怀柔和昌平”aqi最高值出现在下午17时-18时时段,除此之外的其他城区aqi最高值均出现在下午15时。

计算各城区一天24小时aqi指数平均值

plt.figure(figsize = (12,6))
aqi_all_station.mean().reset_index(level = 0, drop = True).sort_values().plot.barh()
plt.xlabel('aqi', fontsize = 20)
Text(0.5, 0, 'aqi')

在这里插入图片描述
从图上可以看出,空气质量指数值最高的前五名分别为:东四、顺义新城、农展馆、天坛、官园,除顺义新城外其余四处均在三环以内,这其中天坛、官园和东四又均地处二环以内,这说明越靠近市中心,空气质量越差

data_test.head()
Unnamed: 0last_updateaqicono2o3pm10pm25so2cloudsfeels_likehumiditypressuretemperaturevisibilitywind_speed
0009-08/08780.8172961615715028771001283.911.16
1108-08/08800.8502949595915027791001273.78.64
2207-08/08830.8423039626125026831001273.18.64
3306-08/08840.8172844616225025851001263.16.12
4405-08/08800.7922949625925025861001253.19.00
data_train = data_train.drop(['Unnamed: 0', ], axis = 1)
data_test = data_test.drop('Unnamed: 0', axis = 1)
def every_station_plot(station):
    data_wanshou = data_train[data_train['station'] == station]
    fig, ax = plt.subplots(2,1, figsize = (20,12))

    data_wanshou = data_wanshou.set_index('last_update')
    data_wanshou = data_wanshou.replace({0:np.nan}).fillna(data_wanshou.median())
    data_wanshou.drop(['co', 'pressure', 'so2', 'visibility', 'wind_speed', 'clouds'], axis = 1).sort_values(by = 'last_update').plot(ax = ax[0], marker = 'o')
    data_wanshou[['co', 'so2', 'visibility', 'wind_speed']].sort_values(by = 'last_update').plot(ax = ax[1], marker = 'o')

    ax[0].set_title('aqi, no2, o3, pm10, pm25, feels_like, humidity, temperature')
    ax[1].set_title('co, so2, visibility, wind_speed')

    plt.suptitle(station, fontsize = 20)

根据不同的城区,分别绘制各城区全天24小时各项污染物随时间变化的趋势曲线图
由于co、so2、能见度、风速四项指标的数量级与其他指标相差较多,为使图形能够明显反映出各污染物变化趋势,故将这四项指标分开作图

for st in data_train['station'].unique():
    every_station_plot(st)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
从图中可以看出,各城区各项污染物中:

  1. 臭氧变化起伏最大,主要是因为臭氧属于二次污染物,是复杂光化学反应的特征产物,收太阳辐射和温度等因素影响,从图中可以看到,各城区臭氧浓度普遍在早上07时即日出后达到最低值,在下午15时即正午后达到最高值
  2. 其次pm10变化起伏较大,各城区普遍在早上08-10时和晚间18-22时出现两次高峰,与城市早、晚高峰时间段近似
  3. 氮氧化物和一氧化碳两项指标变化幅度较小
  4. pm2.5逐渐成为我国大部分地区面临的首要大气问题,对人类的健康和大气环境都会产生严重的危害。由图可以看出pm2.5的变化趋势并不随其他任何污染物和气象指标的趋势变化,相关性均不明显。pm2.5的成分和影响因素复杂,我们可以用建模的手段来讨论各个指标与其之间的相关性

五、 建模预测

各特征变量间相关性热力图

plt.figure(figsize = (15, 15))
sns.heatmap(data_train.corr(), annot = True, cmap='RdYlGn',linewidths=0.2)
<matplotlib.axes._subplots.AxesSubplot at 0x2771d039ac8>

在这里插入图片描述
提取北京市各监测站历史24小时监测数据作为训练集

data_train = data_train.drop(['last_update', 'station'], axis = 1)
data_train = data_train.replace({0:np.nan}).fillna(data_train.median())
data_train.head()
aqicono2o3pm10pm25so2cloudsfeels_likehumiditypressuretemperaturevisibilitywind_speed
049.00.729.0111.049.031.03.090.029.071.01002.029.012.19.0
151.00.711.0108.052.033.03.090.029.071.01002.029.012.19.0
252.00.819.0124.053.030.03.090.029.071.01002.029.012.19.0
348.00.620.088.046.033.03.090.029.071.01002.029.012.19.0
455.00.723.0118.059.032.03.090.029.071.01002.029.012.19.0
data_test = data_test.drop(['last_update'], axis = 1)
data_test
aqicono2o3pm10pm25so2cloudsfeels_likehumiditypressuretemperaturevisibilitywind_speed
0780.8172961615715028771001283.911.16
1800.8502949595915027791001273.78.64
2830.8423039626125026831001273.18.64
3840.8172844616225025851001263.16.12
4800.7922949625925025861001253.19.00
5800.7752854655925026821001263.59.00
6750.7672761645525026791001274.69.72
7690.8082573655025027791002274.39.36
8670.8172581694825027791002274.19.00
9620.7672688674425027791002275.17.56
10590.75825103604229028781002286.26.48
11560.73324114613929028741002288.310.08
y_train = data_train['pm25'].values
x_train = data_train.drop(['pm25'], axis = 1).values

把PM2.5提取出来做真实值

y_true = data_test['pm25'].values
x_test = data_test.drop(['pm25'], axis = 1).values

先用普通线性回归模型预测

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
y_esti = model.predict(x_test)

这是预测结果值

y_esti
array([41.62610649, 43.71780508, 45.51334405, 44.44962464, 39.64758352,
       40.84696283, 39.40450324, 41.28101583, 41.70510816, 38.61434584,
       38.2644185 , 33.7721591 ])

这是真实值

y_true
array([57, 59, 61, 62, 59, 59, 55, 50, 48, 44, 42, 39], dtype=int64)

用残差平方和查看预测效果,公式如下: s c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 score = \frac{1}{n}\sum_1^n(y_i-y^*)^2 score=n11n(yiy)2

((y_esti - y_true)**2).mean()
179.24659005909845
model.score(x_test, y_true)
-2.051366469855796
def gen_polynomial(x, power):
    p = x[:,0].reshape((len(x), 1))
    a = x[:,1].reshape((len(x), 1))
    s = x[:,2].reshape((len(x), 1))
    for i in range(2, power + 1):
        for j in range(i + 1):
            for k in range(i - j + 1):
                #print(f'{j}{k}{i-j-k}')
                add = (p**j) * (a**k) * (s**(i-j-k))
                x = np.concatenate([x, add], axis = 1)
    return x
x_train_poly = gen_polynomial(x_train, 3)
x_test_poly = gen_polynomial(x_test, 3)
model2 = LinearRegression()
model2.fit(x_train_poly, y_train)
y_esti_poly = model2.predict(x_test_poly)
((y_esti_poly - y_true)**2).mean()
35.94599749989958
model2.score(x_test_poly, y_true)
0.38808090318175426

再用Ridge岭回归模型建模

Ridge岭回归模型适用于解决两类问题:一是样本少于变量个数,二是变量间存在多重共线性

from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
reg = Ridge(alpha = 0.000001)
reg.fit(x_train_poly, y_train)
D:\ProgramData\lib\site-packages\sklearn\linear_model\ridge.py:125: LinAlgWarning: Ill-conditioned matrix (rcond=4.03488e-18): result may not be accurate.
  overwrite_a=True).T





Ridge(alpha=1e-06, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)
y_esti_reg_poly = reg.predict(x_test_poly)
((y_esti_reg_poly - y_true)**2).mean()
35.9459549360529

用RidgeCV模型

多个alpha值输入,得出多个对应最佳的 θ \theta θ,然后计算最佳的 θ \theta θ及对应的alpha

reg_cv = RidgeCV(alphas=[0.001, 10, 10000])
reg_cv.fit(x_train_poly, y_train)
RidgeCV(alphas=array([1.e-03, 1.e+01, 1.e+04]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
y_esti_regcv_poly = reg_cv.predict(x_test_poly)
((y_esti_regcv_poly - y_true)**2).mean()
42.06173546500281

再用Lasso回归建模,用于估计稀疏系数的线性模型

Lasso模型适用于参数少的情况,产生稀疏矩阵

from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.00001, max_iter = 1000000)
lasso.fit(x_train_poly, y_train)
Lasso(alpha=1e-05, copy_X=True, fit_intercept=True, max_iter=1000000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
y_esti_lasso_poly = lasso.predict(x_test_poly)
((y_esti_lasso_poly - y_true)**2).mean()
35.944247424908795
lasso.score(x_test_poly, y_true)
0.3881106952137526
plt.figure(figsize = (14,8))
plt.plot(y_true, '--', label = 'real')
plt.plot(y_esti, label = 'normal')
plt.plot(y_esti_poly, label = 'normal_poly', marker =  'o')
plt.plot(y_esti_reg_poly, label = 'ridge_poly')

plt.plot(y_esti_regcv_poly, label = 'ridgeCV_poly')
plt.plot(y_esti_lasso_poly, label = 'lasso_poly')

plt.legend()
uu = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
plt.xticks(np.arange(12), uu.last_update.values, rotation = 30, fontsize = 12)
plt.grid()
plt.title('预测值对比')
plt.show()

在这里插入图片描述
从上图可以看出各种模型建预测值的对比,误差最大的是用原数据进行普通线型回归的结果,其他四种方式的训练集加入了多项式,准确度大幅提高且结果相近,其中普通线性回归和ridge回归结果几乎重合

各模型残差平方和评分如下:

compare_table = pd.DataFrame(np.zeros((5,2)), index = ['normal', 'normal_poly', 'ridge_poly', 'ridgeCV_poly', 'lasso'], columns = ['MSE', 'score'])
mse = np.array([((y_esti - y_true)**2).mean(), 
               ((y_esti_poly - y_true)**2).mean(), 
               ((y_esti_reg_poly - y_true)**2).mean(), 
               ((y_esti_regcv_poly - y_true)**2).mean(), 
               ((y_esti_lasso_poly - y_true)**2).mean()])
score = np.array([model.score(x_test, y_true), 
                 model2.score(x_test_poly, y_true), 
                 reg.score(x_test_poly, y_true), 
                 reg_cv.score(x_test_poly, y_true), 
                 lasso.score(x_test_poly, y_true)])
compare_table['MSE'] = mse
compare_table['score'] = score
compare_table
MSEscore
normal179.246590-2.051366
normal_poly35.9459970.388081
ridge_poly35.9459550.388082
ridgeCV_poly42.0617350.283971
lasso45.3716000.388111

五、结论

预测值有一定偏差,但基本反映了变化趋势。经过几个模型的筛选,最终还是普通线性回归模型效果稍好。


  • 9
    点赞
  • 112
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值