python线性回归模型预测北京市未来12小时PM2.5值

最新推荐文章于 2024-05-10 22:22:24 发布

dennis_110319

最新推荐文章于 2024-05-10 22:22:24 发布

阅读量7.2k

点赞数 9

分类专栏：环保数据分析文章标签： python 环保算法数据分析

本文链接：https://blog.csdn.net/qq_17541489/article/details/99818532

版权

环保数据分析专栏收录该内容

2 篇文章 2 订阅

订阅专栏

项目源码请见：https://github.com/dennis0818/weather_data_analysis

一、分析目标

通过北京市历史24小时气象和大气污染物数据，采用普通线性回归、岭回归和Lasso回归进行建模分析，对比模型预测效果选择最优模型，预测之后12小时天气数据中PM2.5的值

二、项目背景

在北京，冬天最令人头疼的就是雾霾问题，每当雾霾天气来临，那种灰蒙蒙的空气和将口鼻掩盖在厚厚的口罩下呼吸困难的感觉，让人情绪低落。而雾霾的罪魁祸首就是PM2.5。

本次分析主要是想要使用线性回归模型对PM2.5值进行预测

8月8日已经对（北京）历史24小时即08/07日22时-08/0821时共24小时数据进行了采集，今天先来采集新生成的数据做测试数据集，由于已经过去了12个小时，所以共有12组数据可用，即08/08日22时-08/09日09时

三、数据来源

本次分析数据来自心知天气网，该网站可以通过Restful风格URL直接获取Json格式气象和大气数据，获取方式较简单。

四、数据分析

1. 数据规整

import pandas as pd
import numpy as np
from io import StringIO
from urllib import request
import json
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

通过请求request从网络获取数据并进行整理

url_beijing_w = 'https://api.seniverse.com/v3/weather/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&unit=c'
url_beijing_p = 'https://api.seniverse.com/v3/air/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&scope=city'

s_p = request.urlopen(url_beijing_p).read().decode('utf8')
s_w = request.urlopen(url_beijing_w).read().decode('utf8')
data_dict_p = json.loads(s_p)
data_dict_w = json.loads(s_w)

获取的数据是Json格式的，先将其解析为字典，在根据字典键值存入DataFrame中

def gen_table_p(list2):
    data_dict = {}
    for i, value in enumerate(list2):
        data_dict[i] = value['city']
    return data_dict

def gen_table_w(list2):
    data_dict = {}
    for i, value in enumerate(list2):
        data_dict[i] = value
    return data_dict

将气象和大气污染物数据转换成DataFrame表格

data_list_p = data_dict_p['results'][0]['hourly_history']
data_list_p = gen_table_p(data_list_p)
data_p = pd.DataFrame(data_list_p)

data_list_w = data_dict_w['results'][0]['hourly_history']
data_list_w = gen_table_w(data_list_w)
data_w = pd.DataFrame(data_list_w)

data_p = data_p.T
data_w = data_w.T

调整时间格式，用datetime格式将日期转换成想要的格式，再存回表格中。删除几个不重要的特征变量

def adjust_time(data):
    time = data['last_update'].astype(str)
    time = time.str[:19]
    time = time.str.replace('T', ' ')
    time = time.map(lambda x : parse(x))
    time = time.dt.strftime('%H-%m/%d')
    data['last_update'] = time
    return data

data_p = adjust_time(data_p)
data_w = adjust_time(data_w)

data = pd.merge(data_p, data_w, on = 'last_update')

data_test = data[:12]

data_test.head()

	aqi	co	last_update	no2	o3	pm10	pm25	quality	so2	clouds	...	feels_like	humidity	pressure	temperature	text	visibility	wind_direction	wind_direction_degree	wind_scale	wind_speed
0	105	0.817	14-08/19	18	209	81	48	轻度污染	5	50	...	29	46	1004	30	多云	19.4	南	188	3	19.44
1	97	0.908	13-08/19	19	197	84	49	良	6	50	...	30	48	1004	30	多云	18.5	西	248	3	18.0
2	69	0.875	12-08/19	26	162	87	49	良	4	50	...	28	53	1005	29	多云	13.7	西	250	3	15.84
3	74	0.808	11-08/19	41	114	98	51	良	3	90	...	28	53	1005	28	阴	11.0	西南	228	2	11.16
4	74	0.800	10-08/19	51	77	98	48	良	2	90	...	27	56	1005	28	阴	9.9	南	188	3	14.04

5 rows × 22 columns

data_test = data_test.drop(['dew_point', 'wind_direction', 'wind_direction_degree', 'text', 'code', 'wind_scale','last_update', 'quality'], axis = 1)

data_test

	aqi	co	no2	o3	pm10	pm25	so2	clouds	feels_like	humidity	pressure	temperature	visibility	wind_speed
0	105	0.817	18	209	81	48	5	50	29	46	1004	30	19.4	19.44
1	97	0.908	19	197	84	49	6	50	30	48	1004	30	18.5	18.0
2	69	0.875	26	162	87	49	4	50	28	53	1005	29	13.7	15.84
3	74	0.808	41	114	98	51	3	90	28	53	1005	28	11.0	11.16
4	74	0.800	51	77	98	48	2	90	27	56	1005	28	9.9	14.04
5	73	0.775	56	50	95	46	1	90	26	60	1005	26	9.6	12.24
6	67	0.750	56	28	83	40	1	90	25	64	1005	25	8.9	6.12
7	61	0.675	58	17	72	34	1	0	23	71	1005	23	9.0	6.12
8	59	0.650	54	20	68	33	2	0	21	75	1004	22	10.1	4.68
9	59	0.608	48	26	67	33	2	0	21	78	1004	21	11.6	6.48
10	58	0.592	48	32	66	31	2	0	21	79	1004	21	12.4	7.56
11	59	0.617	43	44	68	32	3	0	22	76	1004	22	12.5	6.84

2.生成训练数据集合测试数据集

数据备份，生成训练数据集和测试数据集

#data_test.to_excel('D:/python/practise/sample/weather/data_test(22-09).xlsx')

data_train = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_all(22-21).xlsx')
data_test = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')

data_train.head()

	Unnamed: 0	aqi	co	last_update	no2	o3	pm10	pm25	so2	station	clouds	feels_like	humidity	pressure	temperature	visibility	wind_speed
0	0	49	0.7	21-08/08	29	111	49	31	3	万寿西宫	90	29	71	1002	29	12.1	9.0
1	1	51	0.7	21-08/08	11	108	52	33	3	定陵	90	29	71	1002	29	12.1	9.0
2	2	52	0.8	21-08/08	19	124	53	30	3	东四	90	29	71	1002	29	12.1	9.0
3	3	48	0.6	21-08/08	20	88	46	33	3	天坛	90	29	71	1002	29	12.1	9.0
4	4	55	0.7	21-08/08	23	118	59	32	3	农展馆	90	29	71	1002	29	12.1	9.0

data_train.shape

(288, 17)

提取出各个站的aqi数据

aqi_all_station = data_train[['last_update','aqi', 'station']].set_index(['station', 'last_update']).unstack(0)
aqi_all_station = aqi_all_station.replace({0: np.nan}).fillna(aqi_all_station.median())
aqi_all_station

	aqi
station	万寿西宫	东四	农展馆	古城	天坛	奥体中心	官园	定陵	怀柔镇	昌平镇	海淀区万柳	顺义新城
last_update
00-08/08	55.0	62.0	30.0	36.0	28.0	52.5	30.0	35.0	22.0	31.0	28.0	60.0
01-08/08	45.0	51.0	39.0	31.0	40.0	45.0	40.0	22.0	24.0	22.0	26.0	50.0
02-08/08	60.0	53.0	39.0	31.0	52.0	51.0	43.0	26.0	24.0	23.0	35.0	60.0
03-08/08	65.0	54.0	42.0	29.0	57.0	45.0	32.0	23.0	26.0	23.0	44.0	60.0
04-08/08	64.0	58.0	38.0	26.0	49.0	45.0	36.0	25.0	24.0	22.0	49.0	58.0
05-08/08	50.0	48.0	49.0	26.0	42.0	43.0	39.0	25.0	26.0	24.0	54.0	49.0
06-08/08	39.0	50.0	45.0	22.0	54.0	46.0	51.0	29.0	31.0	30.0	57.0	36.0
07-08/08	53.0	62.0	42.0	54.0	58.0	56.0	63.0	33.0	29.0	35.0	45.0	43.0
08-08/08	64.0	62.0	69.0	63.0	69.0	60.0	65.0	39.0	35.0	38.0	52.0	51.0
09-08/08	70.0	77.0	77.0	53.0	79.0	52.0	60.0	31.0	24.0	39.0	54.0	63.0
10-08/08	73.0	82.0	83.0	52.0	85.0	67.0	62.0	32.0	29.0	27.0	60.0	65.0
11-08/08	64.0	74.0	74.0	44.0	88.0	53.0	65.0	37.0	40.0	36.0	58.0	63.0
12-08/08	64.0	73.0	67.0	50.0	60.0	57.0	68.0	38.0	40.0	33.0	52.0	58.0
13-08/08	64.0	94.0	75.0	65.0	65.0	78.0	99.0	48.0	45.0	44.0	83.0	79.0
14-08/08	84.0	111.0	104.0	88.0	100.0	88.0	111.0	53.0	50.0	77.0	94.0	89.0
15-08/08	77.0	122.0	114.0	98.0	106.0	107.0	118.0	74.0	85.0	100.0	97.0	92.0
16-08/08	57.0	112.0	108.0	84.0	89.0	101.0	109.0	102.0	111.0	113.0	101.0	109.0
17-08/08	48.0	88.0	88.0	62.0	60.0	72.0	75.0	120.0	112.0	130.0	63.0	94.0
18-08/08	50.0	85.0	80.0	50.0	65.0	53.0	63.0	128.0	108.0	127.0	47.0	83.0
19-08/08	48.0	74.0	70.0	52.0	55.0	60.0	54.0	122.0	84.0	82.0	48.0	65.0
20-08/08	46.0	52.0	46.0	59.0	51.0	52.0	47.0	57.0	57.0	48.0	53.0	54.0
21-08/08	49.0	52.0	55.0	72.0	48.0	49.0	51.0	51.0	60.0	53.0	52.0	62.0
22-08/07	40.0	59.0	59.0	41.0	43.0	51.0	40.0	24.0	27.0	41.0	61.0	54.0
23-08/07	55.0	62.0	37.0	51.0	36.0	52.5	39.0	35.0	24.0	37.0	39.0	60.0

plt.rcParams['font.sans-serif']=['SimHei'] #正常显示中文
aqi_all_station.plot(figsize = (20, 5))
plt.xticks(np.arange(24), aqi_all_station.index.values, fontsize = 15, rotation = 30)
plt.grid()
plt.show()

在这里插入图片描述
"aqi"表示空气质量指数（数值越低空气质量越好），从上图可以看出，各个城区全天24小时的aqi指数变化趋势，总体的最值大约出现在下午15时至18时，18时之后快速下降。其中，“万寿西宫，怀柔和昌平”aqi最高值出现在下午17时-18时时段，除此之外的其他城区aqi最高值均出现在下午15时。

计算各城区一天24小时aqi指数平均值

plt.figure(figsize = (12,6))
aqi_all_station.mean().reset_index(level = 0, drop = True).sort_values().plot.barh()
plt.xlabel('aqi', fontsize = 20)

Text(0.5, 0, 'aqi')

在这里插入图片描述
从图上可以看出，空气质量指数值最高的前五名分别为：东四、顺义新城、农展馆、天坛、官园，除顺义新城外其余四处均在三环以内，这其中天坛、官园和东四又均地处二环以内，这说明越靠近市中心，空气质量越差

data_test.head()

	Unnamed: 0	last_update	aqi	co	no2	o3	pm10	pm25	so2	clouds	feels_like	humidity	pressure	temperature	visibility	wind_speed
0	0	09-08/08	78	0.817	29	61	61	57	1	50	28	77	1001	28	3.9	11.16
1	1	08-08/08	80	0.850	29	49	59	59	1	50	27	79	1001	27	3.7	8.64
2	2	07-08/08	83	0.842	30	39	62	61	2	50	26	83	1001	27	3.1	8.64
3	3	06-08/08	84	0.817	28	44	61	62	2	50	25	85	1001	26	3.1	6.12
4	4	05-08/08	80	0.792	29	49	62	59	2	50	25	86	1001	25	3.1	9.00

data_train = data_train.drop(['Unnamed: 0', ], axis = 1)
data_test = data_test.drop('Unnamed: 0', axis = 1)

def every_station_plot(station):
    data_wanshou = data_train[data_train['station'] == station]
    fig, ax = plt.subplots(2,1, figsize = (20,12))

    data_wanshou = data_wanshou.set_index('last_update')
    data_wanshou = data_wanshou.replace({0:np.nan}).fillna(data_wanshou.median())
    data_wanshou.drop(['co', 'pressure', 'so2', 'visibility', 'wind_speed', 'clouds'], axis = 1).sort_values(by = 'last_update').plot(ax = ax[0], marker = 'o')
    data_wanshou[['co', 'so2', 'visibility', 'wind_speed']].sort_values(by = 'last_update').plot(ax = ax[1], marker = 'o')

    ax[0].set_title('aqi, no2, o3, pm10, pm25, feels_like, humidity, temperature')
    ax[1].set_title('co, so2, visibility, wind_speed')

    plt.suptitle(station, fontsize = 20)

根据不同的城区，分别绘制各城区全天24小时各项污染物随时间变化的趋势曲线图
由于co、so2、能见度、风速四项指标的数量级与其他指标相差较多，为使图形能够明显反映出各污染物变化趋势，故将这四项指标分开作图

for st in data_train['station'].unique():
    every_station_plot(st)

在这里插入图片描述

从图中可以看出，各城区各项污染物中：

臭氧变化起伏最大，主要是因为臭氧属于二次污染物，是复杂光化学反应的特征产物，收太阳辐射和温度等因素影响，从图中可以看到，各城区臭氧浓度普遍在早上07时即日出后达到最低值，在下午15时即正午后达到最高值
其次pm10变化起伏较大，各城区普遍在早上08-10时和晚间18-22时出现两次高峰，与城市早、晚高峰时间段近似
氮氧化物和一氧化碳两项指标变化幅度较小
pm2.5逐渐成为我国大部分地区面临的首要大气问题,对人类的健康和大气环境都会产生严重的危害。由图可以看出pm2.5的变化趋势并不随其他任何污染物和气象指标的趋势变化，相关性均不明显。pm2.5的成分和影响因素复杂，我们可以用建模的手段来讨论各个指标与其之间的相关性

五、建模预测

各特征变量间相关性热力图

plt.figure(figsize = (15, 15))
sns.heatmap(data_train.corr(), annot = True, cmap='RdYlGn',linewidths=0.2)

<matplotlib.axes._subplots.AxesSubplot at 0x2771d039ac8>

在这里插入图片描述
提取北京市各监测站历史24小时监测数据作为训练集

data_train = data_train.drop(['last_update', 'station'], axis = 1)
data_train = data_train.replace({0:np.nan}).fillna(data_train.median())
data_train.head()

	aqi	co	no2	o3	pm10	pm25	so2	clouds	feels_like	humidity	pressure	temperature	visibility	wind_speed
0	49.0	0.7	29.0	111.0	49.0	31.0	3.0	90.0	29.0	71.0	1002.0	29.0	12.1	9.0
1	51.0	0.7	11.0	108.0	52.0	33.0	3.0	90.0	29.0	71.0	1002.0	29.0	12.1	9.0
2	52.0	0.8	19.0	124.0	53.0	30.0	3.0	90.0	29.0	71.0	1002.0	29.0	12.1	9.0
3	48.0	0.6	20.0	88.0	46.0	33.0	3.0	90.0	29.0	71.0	1002.0	29.0	12.1	9.0
4	55.0	0.7	23.0	118.0	59.0	32.0	3.0	90.0	29.0	71.0	1002.0	29.0	12.1	9.0

data_test = data_test.drop(['last_update'], axis = 1)
data_test

	aqi	co	no2	o3	pm10	pm25	so2	clouds	feels_like	humidity	pressure	temperature	visibility	wind_speed
0	78	0.817	29	61	61	57	1	50	28	77	1001	28	3.9	11.16
1	80	0.850	29	49	59	59	1	50	27	79	1001	27	3.7	8.64
2	83	0.842	30	39	62	61	2	50	26	83	1001	27	3.1	8.64
3	84	0.817	28	44	61	62	2	50	25	85	1001	26	3.1	6.12
4	80	0.792	29	49	62	59	2	50	25	86	1001	25	3.1	9.00
5	80	0.775	28	54	65	59	2	50	26	82	1001	26	3.5	9.00
6	75	0.767	27	61	64	55	2	50	26	79	1001	27	4.6	9.72
7	69	0.808	25	73	65	50	2	50	27	79	1002	27	4.3	9.36
8	67	0.817	25	81	69	48	2	50	27	79	1002	27	4.1	9.00
9	62	0.767	26	88	67	44	2	50	27	79	1002	27	5.1	7.56
10	59	0.758	25	103	60	42	2	90	28	78	1002	28	6.2	6.48
11	56	0.733	24	114	61	39	2	90	28	74	1002	28	8.3	10.08

y_train = data_train['pm25'].values
x_train = data_train.drop(['pm25'], axis = 1).values

把PM2.5提取出来做真实值

y_true = data_test['pm25'].values
x_test = data_test.drop(['pm25'], axis = 1).values

先用普通线性回归模型预测

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

y_esti = model.predict(x_test)

这是预测结果值

y_esti

array([41.62610649, 43.71780508, 45.51334405, 44.44962464, 39.64758352,
       40.84696283, 39.40450324, 41.28101583, 41.70510816, 38.61434584,
       38.2644185 , 33.7721591 ])

这是真实值

y_true

array([57, 59, 61, 62, 59, 59, 55, 50, 48, 44, 42, 39], dtype=int64)

用残差平方和查看预测效果，公式如下： $\frac{1}{n}\sum_1^n(y_i-y^*)^2$

((y_esti - y_true)**2).mean()

179.24659005909845

model.score(x_test, y_true)

-2.051366469855796

def gen_polynomial(x, power):
    p = x[:,0].reshape((len(x), 1))
    a = x[:,1].reshape((len(x), 1))
    s = x[:,2].reshape((len(x), 1))
    for i in range(2, power + 1):
        for j in range(i + 1):
            for k in range(i - j + 1):
                #print(f'{j}{k}{i-j-k}')
                add = (p**j) * (a**k) * (s**(i-j-k))
                x = np.concatenate([x, add], axis = 1)
    return x

x_train_poly = gen_polynomial(x_train, 3)
x_test_poly = gen_polynomial(x_test, 3)

model2 = LinearRegression()
model2.fit(x_train_poly, y_train)
y_esti_poly = model2.predict(x_test_poly)
((y_esti_poly - y_true)**2).mean()

35.94599749989958

model2.score(x_test_poly, y_true)

0.38808090318175426

再用Ridge岭回归模型建模

Ridge岭回归模型适用于解决两类问题：一是样本少于变量个数，二是变量间存在多重共线性

from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV

reg = Ridge(alpha = 0.000001)

reg.fit(x_train_poly, y_train)

D:\ProgramData\lib\site-packages\sklearn\linear_model\ridge.py:125: LinAlgWarning: Ill-conditioned matrix (rcond=4.03488e-18): result may not be accurate.
  overwrite_a=True).T





Ridge(alpha=1e-06, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

y_esti_reg_poly = reg.predict(x_test_poly)

((y_esti_reg_poly - y_true)**2).mean()

35.9459549360529

用RidgeCV模型

多个alpha值输入，得出多个对应最佳的 $\theta$ ，然后计算最佳的 $\theta$ 及对应的alpha

reg_cv = RidgeCV(alphas=[0.001, 10, 10000])

reg_cv.fit(x_train_poly, y_train)

RidgeCV(alphas=array([1.e-03, 1.e+01, 1.e+04]), cv=None, fit_intercept=True,
    gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)

y_esti_regcv_poly = reg_cv.predict(x_test_poly)

((y_esti_regcv_poly - y_true)**2).mean()

42.06173546500281

再用Lasso回归建模，用于估计稀疏系数的线性模型

Lasso模型适用于参数少的情况，产生稀疏矩阵

from sklearn.linear_model import Lasso

lasso = Lasso(alpha = 0.00001, max_iter = 1000000)

lasso.fit(x_train_poly, y_train)

Lasso(alpha=1e-05, copy_X=True, fit_intercept=True, max_iter=1000000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

y_esti_lasso_poly = lasso.predict(x_test_poly)

((y_esti_lasso_poly - y_true)**2).mean()

35.944247424908795

lasso.score(x_test_poly, y_true)

0.3881106952137526

plt.figure(figsize = (14,8))
plt.plot(y_true, '--', label = 'real')
plt.plot(y_esti, label = 'normal')
plt.plot(y_esti_poly, label = 'normal_poly', marker =  'o')
plt.plot(y_esti_reg_poly, label = 'ridge_poly')

plt.plot(y_esti_regcv_poly, label = 'ridgeCV_poly')
plt.plot(y_esti_lasso_poly, label = 'lasso_poly')

plt.legend()
uu = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
plt.xticks(np.arange(12), uu.last_update.values, rotation = 30, fontsize = 12)
plt.grid()
plt.title('预测值对比')
plt.show()

在这里插入图片描述
从上图可以看出各种模型建预测值的对比，误差最大的是用原数据进行普通线型回归的结果，其他四种方式的训练集加入了多项式，准确度大幅提高且结果相近，其中普通线性回归和ridge回归结果几乎重合

各模型残差平方和评分如下：

compare_table = pd.DataFrame(np.zeros((5,2)), index = ['normal', 'normal_poly', 'ridge_poly', 'ridgeCV_poly', 'lasso'], columns = ['MSE', 'score'])
mse = np.array([((y_esti - y_true)**2).mean(), 
               ((y_esti_poly - y_true)**2).mean(), 
               ((y_esti_reg_poly - y_true)**2).mean(), 
               ((y_esti_regcv_poly - y_true)**2).mean(), 
               ((y_esti_lasso_poly - y_true)**2).mean()])
score = np.array([model.score(x_test, y_true), 
                 model2.score(x_test_poly, y_true), 
                 reg.score(x_test_poly, y_true), 
                 reg_cv.score(x_test_poly, y_true), 
                 lasso.score(x_test_poly, y_true)])
compare_table['MSE'] = mse
compare_table['score'] = score
compare_table

	MSE	score
normal	179.246590	-2.051366
normal_poly	35.945997	0.388081
ridge_poly	35.945955	0.388082
ridgeCV_poly	42.061735	0.283971
lasso	45.371600	0.388111

五、结论

预测值有一定偏差，但基本反映了变化趋势。经过几个模型的筛选，最终还是普通线性回归模型效果稍好。

dennis_110319

关注

9
点赞
踩
112

收藏

觉得还不错? 一键收藏
2
评论
python线性回归模型预测北京市未来12小时PM2.5值

北京市空气PM2.5值预测一、分析目标通过北京市历史24小时气象和大气污染物数据，采用普通线性回归、岭回归和Lasso回归进行建模分析，对比模型预测效果选择最优模型，预测之后12小时天气数据中PM2.5的值二、项目背景在北京，冬天最令人头疼的就是雾霾问题，每当雾霾天气来临，那种灰蒙蒙的空气和将口鼻掩盖在厚厚的口罩下呼吸困难的感觉，让人情绪低落。而雾霾的罪魁祸首就是PM2.5。本次分析主要...
复制链接

扫一扫