项目源码请见:https://github.com/dennis0818/weather_data_analysis
一、分析目标
通过北京市历史24小时气象和大气污染物数据,采用普通线性回归、岭回归和Lasso回归进行建模分析,对比模型预测效果选择最优模型,预测之后12小时天气数据中PM2.5的值
二、项目背景
在北京,冬天最令人头疼的就是雾霾问题,每当雾霾天气来临,那种灰蒙蒙的空气和将口鼻掩盖在厚厚的口罩下呼吸困难的感觉,让人情绪低落。而雾霾的罪魁祸首就是PM2.5。
本次分析主要是想要使用线性回归模型对PM2.5值进行预测
8月8日已经对(北京)历史24小时即08/07日22时-08/0821时共24小时数据进行了采集,今天先来采集新生成的数据做测试数据集,由于已经过去了12个小时,所以共有12组数据可用,即08/08日22时-08/09日09时
三、 数据来源
本次分析数据来自心知天气网,该网站可以通过Restful风格URL直接获取Json格式气象和大气数据,获取方式较简单。
四、 数据分析
1. 数据规整
import pandas as pd
import numpy as np
from io import StringIO
from urllib import request
import json
from dateutil.parser import parse
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
通过请求request从网络获取数据并进行整理
url_beijing_w = 'https://api.seniverse.com/v3/weather/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&unit=c'
url_beijing_p = 'https://api.seniverse.com/v3/air/hourly_history.json?key=Sz6GmmiQ6SAjYTKbc&location=beijing&language=zh-Hans&scope=city'
s_p = request.urlopen(url_beijing_p).read().decode('utf8')
s_w = request.urlopen(url_beijing_w).read().decode('utf8')
data_dict_p = json.loads(s_p)
data_dict_w = json.loads(s_w)
获取的数据是Json格式的,先将其解析为字典,在根据字典键值存入DataFrame中
def gen_table_p(list2):
data_dict = {}
for i, value in enumerate(list2):
data_dict[i] = value['city']
return data_dict
def gen_table_w(list2):
data_dict = {}
for i, value in enumerate(list2):
data_dict[i] = value
return data_dict
将气象和大气污染物数据转换成DataFrame表格
data_list_p = data_dict_p['results'][0]['hourly_history']
data_list_p = gen_table_p(data_list_p)
data_p = pd.DataFrame(data_list_p)
data_list_w = data_dict_w['results'][0]['hourly_history']
data_list_w = gen_table_w(data_list_w)
data_w = pd.DataFrame(data_list_w)
data_p = data_p.T
data_w = data_w.T
调整时间格式,用datetime格式将日期转换成想要的格式,再存回表格中。删除几个不重要的特征变量
def adjust_time(data):
time = data['last_update'].astype(str)
time = time.str[:19]
time = time.str.replace('T', ' ')
time = time.map(lambda x : parse(x))
time = time.dt.strftime('%H-%m/%d')
data['last_update'] = time
return data
data_p = adjust_time(data_p)
data_w = adjust_time(data_w)
data = pd.merge(data_p, data_w, on = 'last_update')
data_test = data[:12]
data_test.head()
aqi | co | last_update | no2 | o3 | pm10 | pm25 | quality | so2 | clouds | ... | feels_like | humidity | pressure | temperature | text | visibility | wind_direction | wind_direction_degree | wind_scale | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 105 | 0.817 | 14-08/19 | 18 | 209 | 81 | 48 | 轻度污染 | 5 | 50 | ... | 29 | 46 | 1004 | 30 | 多云 | 19.4 | 南 | 188 | 3 | 19.44 |
1 | 97 | 0.908 | 13-08/19 | 19 | 197 | 84 | 49 | 良 | 6 | 50 | ... | 30 | 48 | 1004 | 30 | 多云 | 18.5 | 西 | 248 | 3 | 18.0 |
2 | 69 | 0.875 | 12-08/19 | 26 | 162 | 87 | 49 | 良 | 4 | 50 | ... | 28 | 53 | 1005 | 29 | 多云 | 13.7 | 西 | 250 | 3 | 15.84 |
3 | 74 | 0.808 | 11-08/19 | 41 | 114 | 98 | 51 | 良 | 3 | 90 | ... | 28 | 53 | 1005 | 28 | 阴 | 11.0 | 西南 | 228 | 2 | 11.16 |
4 | 74 | 0.800 | 10-08/19 | 51 | 77 | 98 | 48 | 良 | 2 | 90 | ... | 27 | 56 | 1005 | 28 | 阴 | 9.9 | 南 | 188 | 3 | 14.04 |
5 rows × 22 columns
data_test = data_test.drop(['dew_point', 'wind_direction', 'wind_direction_degree', 'text', 'code', 'wind_scale','last_update', 'quality'], axis = 1)
data_test
aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 105 | 0.817 | 18 | 209 | 81 | 48 | 5 | 50 | 29 | 46 | 1004 | 30 | 19.4 | 19.44 |
1 | 97 | 0.908 | 19 | 197 | 84 | 49 | 6 | 50 | 30 | 48 | 1004 | 30 | 18.5 | 18.0 |
2 | 69 | 0.875 | 26 | 162 | 87 | 49 | 4 | 50 | 28 | 53 | 1005 | 29 | 13.7 | 15.84 |
3 | 74 | 0.808 | 41 | 114 | 98 | 51 | 3 | 90 | 28 | 53 | 1005 | 28 | 11.0 | 11.16 |
4 | 74 | 0.800 | 51 | 77 | 98 | 48 | 2 | 90 | 27 | 56 | 1005 | 28 | 9.9 | 14.04 |
5 | 73 | 0.775 | 56 | 50 | 95 | 46 | 1 | 90 | 26 | 60 | 1005 | 26 | 9.6 | 12.24 |
6 | 67 | 0.750 | 56 | 28 | 83 | 40 | 1 | 90 | 25 | 64 | 1005 | 25 | 8.9 | 6.12 |
7 | 61 | 0.675 | 58 | 17 | 72 | 34 | 1 | 0 | 23 | 71 | 1005 | 23 | 9.0 | 6.12 |
8 | 59 | 0.650 | 54 | 20 | 68 | 33 | 2 | 0 | 21 | 75 | 1004 | 22 | 10.1 | 4.68 |
9 | 59 | 0.608 | 48 | 26 | 67 | 33 | 2 | 0 | 21 | 78 | 1004 | 21 | 11.6 | 6.48 |
10 | 58 | 0.592 | 48 | 32 | 66 | 31 | 2 | 0 | 21 | 79 | 1004 | 21 | 12.4 | 7.56 |
11 | 59 | 0.617 | 43 | 44 | 68 | 32 | 3 | 0 | 22 | 76 | 1004 | 22 | 12.5 | 6.84 |
2.生成训练数据集合测试数据集
数据备份,生成训练数据集和测试数据集
#data_test.to_excel('D:/python/practise/sample/weather/data_test(22-09).xlsx')
data_train = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_all(22-21).xlsx')
data_test = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
data_train.head()
Unnamed: 0 | aqi | co | last_update | no2 | o3 | pm10 | pm25 | so2 | station | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 49 | 0.7 | 21-08/08 | 29 | 111 | 49 | 31 | 3 | 万寿西宫 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
1 | 1 | 51 | 0.7 | 21-08/08 | 11 | 108 | 52 | 33 | 3 | 定陵 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
2 | 2 | 52 | 0.8 | 21-08/08 | 19 | 124 | 53 | 30 | 3 | 东四 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
3 | 3 | 48 | 0.6 | 21-08/08 | 20 | 88 | 46 | 33 | 3 | 天坛 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
4 | 4 | 55 | 0.7 | 21-08/08 | 23 | 118 | 59 | 32 | 3 | 农展馆 | 90 | 29 | 71 | 1002 | 29 | 12.1 | 9.0 |
data_train.shape
(288, 17)
提取出各个站的aqi数据
aqi_all_station = data_train[['last_update','aqi', 'station']].set_index(['station', 'last_update']).unstack(0)
aqi_all_station = aqi_all_station.replace({0: np.nan}).fillna(aqi_all_station.median())
aqi_all_station
aqi | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
station | 万寿西宫 | 东四 | 农展馆 | 古城 | 天坛 | 奥体中心 | 官园 | 定陵 | 怀柔镇 | 昌平镇 | 海淀区万柳 | 顺义新城 |
last_update | ||||||||||||
00-08/08 | 55.0 | 62.0 | 30.0 | 36.0 | 28.0 | 52.5 | 30.0 | 35.0 | 22.0 | 31.0 | 28.0 | 60.0 |
01-08/08 | 45.0 | 51.0 | 39.0 | 31.0 | 40.0 | 45.0 | 40.0 | 22.0 | 24.0 | 22.0 | 26.0 | 50.0 |
02-08/08 | 60.0 | 53.0 | 39.0 | 31.0 | 52.0 | 51.0 | 43.0 | 26.0 | 24.0 | 23.0 | 35.0 | 60.0 |
03-08/08 | 65.0 | 54.0 | 42.0 | 29.0 | 57.0 | 45.0 | 32.0 | 23.0 | 26.0 | 23.0 | 44.0 | 60.0 |
04-08/08 | 64.0 | 58.0 | 38.0 | 26.0 | 49.0 | 45.0 | 36.0 | 25.0 | 24.0 | 22.0 | 49.0 | 58.0 |
05-08/08 | 50.0 | 48.0 | 49.0 | 26.0 | 42.0 | 43.0 | 39.0 | 25.0 | 26.0 | 24.0 | 54.0 | 49.0 |
06-08/08 | 39.0 | 50.0 | 45.0 | 22.0 | 54.0 | 46.0 | 51.0 | 29.0 | 31.0 | 30.0 | 57.0 | 36.0 |
07-08/08 | 53.0 | 62.0 | 42.0 | 54.0 | 58.0 | 56.0 | 63.0 | 33.0 | 29.0 | 35.0 | 45.0 | 43.0 |
08-08/08 | 64.0 | 62.0 | 69.0 | 63.0 | 69.0 | 60.0 | 65.0 | 39.0 | 35.0 | 38.0 | 52.0 | 51.0 |
09-08/08 | 70.0 | 77.0 | 77.0 | 53.0 | 79.0 | 52.0 | 60.0 | 31.0 | 24.0 | 39.0 | 54.0 | 63.0 |
10-08/08 | 73.0 | 82.0 | 83.0 | 52.0 | 85.0 | 67.0 | 62.0 | 32.0 | 29.0 | 27.0 | 60.0 | 65.0 |
11-08/08 | 64.0 | 74.0 | 74.0 | 44.0 | 88.0 | 53.0 | 65.0 | 37.0 | 40.0 | 36.0 | 58.0 | 63.0 |
12-08/08 | 64.0 | 73.0 | 67.0 | 50.0 | 60.0 | 57.0 | 68.0 | 38.0 | 40.0 | 33.0 | 52.0 | 58.0 |
13-08/08 | 64.0 | 94.0 | 75.0 | 65.0 | 65.0 | 78.0 | 99.0 | 48.0 | 45.0 | 44.0 | 83.0 | 79.0 |
14-08/08 | 84.0 | 111.0 | 104.0 | 88.0 | 100.0 | 88.0 | 111.0 | 53.0 | 50.0 | 77.0 | 94.0 | 89.0 |
15-08/08 | 77.0 | 122.0 | 114.0 | 98.0 | 106.0 | 107.0 | 118.0 | 74.0 | 85.0 | 100.0 | 97.0 | 92.0 |
16-08/08 | 57.0 | 112.0 | 108.0 | 84.0 | 89.0 | 101.0 | 109.0 | 102.0 | 111.0 | 113.0 | 101.0 | 109.0 |
17-08/08 | 48.0 | 88.0 | 88.0 | 62.0 | 60.0 | 72.0 | 75.0 | 120.0 | 112.0 | 130.0 | 63.0 | 94.0 |
18-08/08 | 50.0 | 85.0 | 80.0 | 50.0 | 65.0 | 53.0 | 63.0 | 128.0 | 108.0 | 127.0 | 47.0 | 83.0 |
19-08/08 | 48.0 | 74.0 | 70.0 | 52.0 | 55.0 | 60.0 | 54.0 | 122.0 | 84.0 | 82.0 | 48.0 | 65.0 |
20-08/08 | 46.0 | 52.0 | 46.0 | 59.0 | 51.0 | 52.0 | 47.0 | 57.0 | 57.0 | 48.0 | 53.0 | 54.0 |
21-08/08 | 49.0 | 52.0 | 55.0 | 72.0 | 48.0 | 49.0 | 51.0 | 51.0 | 60.0 | 53.0 | 52.0 | 62.0 |
22-08/07 | 40.0 | 59.0 | 59.0 | 41.0 | 43.0 | 51.0 | 40.0 | 24.0 | 27.0 | 41.0 | 61.0 | 54.0 |
23-08/07 | 55.0 | 62.0 | 37.0 | 51.0 | 36.0 | 52.5 | 39.0 | 35.0 | 24.0 | 37.0 | 39.0 | 60.0 |
plt.rcParams['font.sans-serif']=['SimHei'] #正常显示中文
aqi_all_station.plot(figsize = (20, 5))
plt.xticks(np.arange(24), aqi_all_station.index.values, fontsize = 15, rotation = 30)
plt.grid()
plt.show()
"aqi"表示空气质量指数(数值越低空气质量越好),从上图可以看出,各个城区全天24小时的aqi指数变化趋势,总体的最值大约出现在下午15时至18时,18时之后快速下降。其中,“万寿西宫,怀柔和昌平”aqi最高值出现在下午17时-18时时段,除此之外的其他城区aqi最高值均出现在下午15时。
计算各城区一天24小时aqi指数平均值
plt.figure(figsize = (12,6))
aqi_all_station.mean().reset_index(level = 0, drop = True).sort_values().plot.barh()
plt.xlabel('aqi', fontsize = 20)
Text(0.5, 0, 'aqi')
从图上可以看出,空气质量指数值最高的前五名分别为:东四、顺义新城、农展馆、天坛、官园,除顺义新城外其余四处均在三环以内,这其中天坛、官园和东四又均地处二环以内,这说明越靠近市中心,空气质量越差
data_test.head()
Unnamed: 0 | last_update | aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 09-08/08 | 78 | 0.817 | 29 | 61 | 61 | 57 | 1 | 50 | 28 | 77 | 1001 | 28 | 3.9 | 11.16 |
1 | 1 | 08-08/08 | 80 | 0.850 | 29 | 49 | 59 | 59 | 1 | 50 | 27 | 79 | 1001 | 27 | 3.7 | 8.64 |
2 | 2 | 07-08/08 | 83 | 0.842 | 30 | 39 | 62 | 61 | 2 | 50 | 26 | 83 | 1001 | 27 | 3.1 | 8.64 |
3 | 3 | 06-08/08 | 84 | 0.817 | 28 | 44 | 61 | 62 | 2 | 50 | 25 | 85 | 1001 | 26 | 3.1 | 6.12 |
4 | 4 | 05-08/08 | 80 | 0.792 | 29 | 49 | 62 | 59 | 2 | 50 | 25 | 86 | 1001 | 25 | 3.1 | 9.00 |
data_train = data_train.drop(['Unnamed: 0', ], axis = 1)
data_test = data_test.drop('Unnamed: 0', axis = 1)
def every_station_plot(station):
data_wanshou = data_train[data_train['station'] == station]
fig, ax = plt.subplots(2,1, figsize = (20,12))
data_wanshou = data_wanshou.set_index('last_update')
data_wanshou = data_wanshou.replace({0:np.nan}).fillna(data_wanshou.median())
data_wanshou.drop(['co', 'pressure', 'so2', 'visibility', 'wind_speed', 'clouds'], axis = 1).sort_values(by = 'last_update').plot(ax = ax[0], marker = 'o')
data_wanshou[['co', 'so2', 'visibility', 'wind_speed']].sort_values(by = 'last_update').plot(ax = ax[1], marker = 'o')
ax[0].set_title('aqi, no2, o3, pm10, pm25, feels_like, humidity, temperature')
ax[1].set_title('co, so2, visibility, wind_speed')
plt.suptitle(station, fontsize = 20)
根据不同的城区,分别绘制各城区全天24小时各项污染物随时间变化的趋势曲线图
由于co、so2、能见度、风速四项指标的数量级与其他指标相差较多,为使图形能够明显反映出各污染物变化趋势,故将这四项指标分开作图
for st in data_train['station'].unique():
every_station_plot(st)
从图中可以看出,各城区各项污染物中:
- 臭氧变化起伏最大,主要是因为臭氧属于二次污染物,是复杂光化学反应的特征产物,收太阳辐射和温度等因素影响,从图中可以看到,各城区臭氧浓度普遍在早上07时即日出后达到最低值,在下午15时即正午后达到最高值
- 其次pm10变化起伏较大,各城区普遍在早上08-10时和晚间18-22时出现两次高峰,与城市早、晚高峰时间段近似
- 氮氧化物和一氧化碳两项指标变化幅度较小
- pm2.5逐渐成为我国大部分地区面临的首要大气问题,对人类的健康和大气环境都会产生严重的危害。由图可以看出pm2.5的变化趋势并不随其他任何污染物和气象指标的趋势变化,相关性均不明显。pm2.5的成分和影响因素复杂,我们可以用建模的手段来讨论各个指标与其之间的相关性
五、 建模预测
各特征变量间相关性热力图
plt.figure(figsize = (15, 15))
sns.heatmap(data_train.corr(), annot = True, cmap='RdYlGn',linewidths=0.2)
<matplotlib.axes._subplots.AxesSubplot at 0x2771d039ac8>
提取北京市各监测站历史24小时监测数据作为训练集
data_train = data_train.drop(['last_update', 'station'], axis = 1)
data_train = data_train.replace({0:np.nan}).fillna(data_train.median())
data_train.head()
aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49.0 | 0.7 | 29.0 | 111.0 | 49.0 | 31.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
1 | 51.0 | 0.7 | 11.0 | 108.0 | 52.0 | 33.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
2 | 52.0 | 0.8 | 19.0 | 124.0 | 53.0 | 30.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
3 | 48.0 | 0.6 | 20.0 | 88.0 | 46.0 | 33.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
4 | 55.0 | 0.7 | 23.0 | 118.0 | 59.0 | 32.0 | 3.0 | 90.0 | 29.0 | 71.0 | 1002.0 | 29.0 | 12.1 | 9.0 |
data_test = data_test.drop(['last_update'], axis = 1)
data_test
aqi | co | no2 | o3 | pm10 | pm25 | so2 | clouds | feels_like | humidity | pressure | temperature | visibility | wind_speed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 78 | 0.817 | 29 | 61 | 61 | 57 | 1 | 50 | 28 | 77 | 1001 | 28 | 3.9 | 11.16 |
1 | 80 | 0.850 | 29 | 49 | 59 | 59 | 1 | 50 | 27 | 79 | 1001 | 27 | 3.7 | 8.64 |
2 | 83 | 0.842 | 30 | 39 | 62 | 61 | 2 | 50 | 26 | 83 | 1001 | 27 | 3.1 | 8.64 |
3 | 84 | 0.817 | 28 | 44 | 61 | 62 | 2 | 50 | 25 | 85 | 1001 | 26 | 3.1 | 6.12 |
4 | 80 | 0.792 | 29 | 49 | 62 | 59 | 2 | 50 | 25 | 86 | 1001 | 25 | 3.1 | 9.00 |
5 | 80 | 0.775 | 28 | 54 | 65 | 59 | 2 | 50 | 26 | 82 | 1001 | 26 | 3.5 | 9.00 |
6 | 75 | 0.767 | 27 | 61 | 64 | 55 | 2 | 50 | 26 | 79 | 1001 | 27 | 4.6 | 9.72 |
7 | 69 | 0.808 | 25 | 73 | 65 | 50 | 2 | 50 | 27 | 79 | 1002 | 27 | 4.3 | 9.36 |
8 | 67 | 0.817 | 25 | 81 | 69 | 48 | 2 | 50 | 27 | 79 | 1002 | 27 | 4.1 | 9.00 |
9 | 62 | 0.767 | 26 | 88 | 67 | 44 | 2 | 50 | 27 | 79 | 1002 | 27 | 5.1 | 7.56 |
10 | 59 | 0.758 | 25 | 103 | 60 | 42 | 2 | 90 | 28 | 78 | 1002 | 28 | 6.2 | 6.48 |
11 | 56 | 0.733 | 24 | 114 | 61 | 39 | 2 | 90 | 28 | 74 | 1002 | 28 | 8.3 | 10.08 |
y_train = data_train['pm25'].values
x_train = data_train.drop(['pm25'], axis = 1).values
把PM2.5提取出来做真实值
y_true = data_test['pm25'].values
x_test = data_test.drop(['pm25'], axis = 1).values
先用普通线性回归模型预测
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
y_esti = model.predict(x_test)
这是预测结果值
y_esti
array([41.62610649, 43.71780508, 45.51334405, 44.44962464, 39.64758352,
40.84696283, 39.40450324, 41.28101583, 41.70510816, 38.61434584,
38.2644185 , 33.7721591 ])
这是真实值
y_true
array([57, 59, 61, 62, 59, 59, 55, 50, 48, 44, 42, 39], dtype=int64)
用残差平方和查看预测效果,公式如下: s c o r e = 1 n ∑ 1 n ( y i − y ∗ ) 2 score = \frac{1}{n}\sum_1^n(y_i-y^*)^2 score=n11∑n(yi−y∗)2
((y_esti - y_true)**2).mean()
179.24659005909845
model.score(x_test, y_true)
-2.051366469855796
def gen_polynomial(x, power):
p = x[:,0].reshape((len(x), 1))
a = x[:,1].reshape((len(x), 1))
s = x[:,2].reshape((len(x), 1))
for i in range(2, power + 1):
for j in range(i + 1):
for k in range(i - j + 1):
#print(f'{j}{k}{i-j-k}')
add = (p**j) * (a**k) * (s**(i-j-k))
x = np.concatenate([x, add], axis = 1)
return x
x_train_poly = gen_polynomial(x_train, 3)
x_test_poly = gen_polynomial(x_test, 3)
model2 = LinearRegression()
model2.fit(x_train_poly, y_train)
y_esti_poly = model2.predict(x_test_poly)
((y_esti_poly - y_true)**2).mean()
35.94599749989958
model2.score(x_test_poly, y_true)
0.38808090318175426
再用Ridge岭回归模型建模
Ridge岭回归模型适用于解决两类问题:一是样本少于变量个数,二是变量间存在多重共线性
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
reg = Ridge(alpha = 0.000001)
reg.fit(x_train_poly, y_train)
D:\ProgramData\lib\site-packages\sklearn\linear_model\ridge.py:125: LinAlgWarning: Ill-conditioned matrix (rcond=4.03488e-18): result may not be accurate.
overwrite_a=True).T
Ridge(alpha=1e-06, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001)
y_esti_reg_poly = reg.predict(x_test_poly)
((y_esti_reg_poly - y_true)**2).mean()
35.9459549360529
用RidgeCV模型
多个alpha值输入,得出多个对应最佳的 θ \theta θ,然后计算最佳的 θ \theta θ及对应的alpha
reg_cv = RidgeCV(alphas=[0.001, 10, 10000])
reg_cv.fit(x_train_poly, y_train)
RidgeCV(alphas=array([1.e-03, 1.e+01, 1.e+04]), cv=None, fit_intercept=True,
gcv_mode=None, normalize=False, scoring=None, store_cv_values=False)
y_esti_regcv_poly = reg_cv.predict(x_test_poly)
((y_esti_regcv_poly - y_true)**2).mean()
42.06173546500281
再用Lasso回归建模,用于估计稀疏系数的线性模型
Lasso模型适用于参数少的情况,产生稀疏矩阵
from sklearn.linear_model import Lasso
lasso = Lasso(alpha = 0.00001, max_iter = 1000000)
lasso.fit(x_train_poly, y_train)
Lasso(alpha=1e-05, copy_X=True, fit_intercept=True, max_iter=1000000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
y_esti_lasso_poly = lasso.predict(x_test_poly)
((y_esti_lasso_poly - y_true)**2).mean()
35.944247424908795
lasso.score(x_test_poly, y_true)
0.3881106952137526
plt.figure(figsize = (14,8))
plt.plot(y_true, '--', label = 'real')
plt.plot(y_esti, label = 'normal')
plt.plot(y_esti_poly, label = 'normal_poly', marker = 'o')
plt.plot(y_esti_reg_poly, label = 'ridge_poly')
plt.plot(y_esti_regcv_poly, label = 'ridgeCV_poly')
plt.plot(y_esti_lasso_poly, label = 'lasso_poly')
plt.legend()
uu = pd.read_excel('D:/Python/exercise/samples/WeatherData/backup/data_test(22-09).xlsx')
plt.xticks(np.arange(12), uu.last_update.values, rotation = 30, fontsize = 12)
plt.grid()
plt.title('预测值对比')
plt.show()
从上图可以看出各种模型建预测值的对比,误差最大的是用原数据进行普通线型回归的结果,其他四种方式的训练集加入了多项式,准确度大幅提高且结果相近,其中普通线性回归和ridge回归结果几乎重合
各模型残差平方和评分如下:
compare_table = pd.DataFrame(np.zeros((5,2)), index = ['normal', 'normal_poly', 'ridge_poly', 'ridgeCV_poly', 'lasso'], columns = ['MSE', 'score'])
mse = np.array([((y_esti - y_true)**2).mean(),
((y_esti_poly - y_true)**2).mean(),
((y_esti_reg_poly - y_true)**2).mean(),
((y_esti_regcv_poly - y_true)**2).mean(),
((y_esti_lasso_poly - y_true)**2).mean()])
score = np.array([model.score(x_test, y_true),
model2.score(x_test_poly, y_true),
reg.score(x_test_poly, y_true),
reg_cv.score(x_test_poly, y_true),
lasso.score(x_test_poly, y_true)])
compare_table['MSE'] = mse
compare_table['score'] = score
compare_table
MSE | score | |
---|---|---|
normal | 179.246590 | -2.051366 |
normal_poly | 35.945997 | 0.388081 |
ridge_poly | 35.945955 | 0.388082 |
ridgeCV_poly | 42.061735 | 0.283971 |
lasso | 45.371600 | 0.388111 |
五、结论
预测值有一定偏差,但基本反映了变化趋势。经过几个模型的筛选,最终还是普通线性回归模型效果稍好。