使用fbprophet分析世界疫情感染人数
import pandas as pd
from fbprophet import Prophet
pred = pd.read_csv("../kaggle3/covid-19-all.csv")
pred.head()
Country/Region | Province/State | Latitude | Longitude | Confirmed | Recovered | Deaths | Date | |
---|---|---|---|---|---|---|---|---|
0 | China | Anhui | 31.8257 | 117.2264 | 1.0 | NaN | NaN | 2020-01-22 |
1 | China | Beijing | 40.1824 | 116.4142 | 14.0 | NaN | NaN | 2020-01-22 |
2 | China | Chongqing | 30.0572 | 107.8740 | 6.0 | NaN | NaN | 2020-01-22 |
3 | China | Fujian | 26.0789 | 117.9874 | 1.0 | NaN | NaN | 2020-01-22 |
4 | China | Gansu | 37.8099 | 101.0583 | NaN | NaN | NaN | 2020-01-22 |
pred = pred.fillna(0)
predgrp = pred.groupby("Date")[["Confirmed","Recovered","Deaths"]].sum().reset_index()
predgrp.head()
Date | Confirmed | Recovered | Deaths | |
---|---|---|---|---|
0 | 2020-01-22 | 555.0 | 28.0 | 17.0 |
1 | 2020-01-23 | 653.0 | 30.0 | 18.0 |
2 | 2020-01-24 | 941.0 | 36.0 | 26.0 |
3 | 2020-01-25 | 1438.0 | 39.0 | 42.0 |
4 | 2020-01-26 | 2118.0 | 52.0 | 56.0 |
观察前面五行,发现单日新增感染人数只有几百,再观察最近的感染人数,达到了恐怖的18万人之多!
predgrp.tail()
Date | Confirmed | Recovered | Deaths | |
---|---|---|---|---|
87 | 2020-04-18 | 2317759.0 | 592319.0 | 159510.0 |
88 | 2020-04-19 | 2401379.0 | 623903.0 | 165044.0 |
89 | 2020-04-20 | 2472259.0 | 645738.0 | 169986.0 |
90 | 2020-04-21 | 2549294.0 | 679819.0 | 176583.0 |
91 | 2020-04-22 | 2623415.0 | 709694.0 | 183027.0 |
pred_cnfrm = predgrp.loc[:,["Date","Confirmed"]]
pred_cnfrm.shape
(92, 2)
pr_data = pred_cnfrm
pr_data.columns = ['ds','y']
pr_data.head()
m=Prophet()
m.fit(pr_data)
future=m.make_future_dataframe(periods=15) #预测15天
forecast=m.predict(future)
forecast.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
ds | 2020-01-22 00:00:00 | 2020-01-23 00:00:00 | 2020-01-24 00:00:00 | 2020-01-25 00:00:00 | 2020-01-26 00:00:00 |
trend | -1231.05 | -218.877 | 793.296 | 1805.47 | 3092.11 |
trend_lower | -1231.05 | -218.877 | 793.296 | 1805.47 | 3092.11 |
trend_upper | -1231.05 | -218.877 | 793.296 | 1805.47 | 3092.11 |
yhat_lower | -11158.3 | -8236.24 | -3166.07 | 247.023 | -982.845 |
yhat_upper | 1602.52 | 5214.46 | 10051.4 | 12530.2 | 11597.8 |
additive_terms | -3974.05 | -571.137 | 2528.97 | 4295.44 | 2357.41 |
additive_terms_lower | -3974.05 | -571.137 | 2528.97 | 4295.44 | 2357.41 |
additive_terms_upper | -3974.05 | -571.137 | 2528.97 | 4295.44 | 2357.41 |
multiplicative_terms | 0 | 0 | 0 | 0 | 0 |
multiplicative_terms_lower | 0 | 0 | 0 | 0 | 0 |
multiplicative_terms_upper | 0 | 0 | 0 | 0 | 0 |
weekly | -3974.05 | -571.137 | 2528.97 | 4295.44 | 2357.41 |
weekly_lower | -3974.05 | -571.137 | 2528.97 | 4295.44 | 2357.41 |
weekly_upper | -3974.05 | -571.137 | 2528.97 | 4295.44 | 2357.41 |
yhat | -5205.09 | -790.014 | 3322.27 | 6100.91 | 5449.52 |
y_ped = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
y_ped.head()
ds | yhat | yhat_lower | yhat_upper | |
---|---|---|---|---|
0 | 2020-01-22 | -5205.094924 | -11158.349533 | 1602.523589 |
1 | 2020-01-23 | -790.014418 | -8236.237859 | 5214.464592 |
2 | 2020-01-24 | 3322.270436 | -3166.065919 | 10051.441647 |
3 | 2020-01-25 | 6100.908463 | 247.022750 | 12530.151876 |
4 | 2020-01-26 | 5449.521809 | -982.844950 | 11597.844552 |
m.plot(forecast,xlabel='Date',ylabel='Confirmed Count', uncertainty=True)
这里是区间预测,但是由于数据比较大,所以看到点线近似重合了,所以认为误差可以被接受。
如果想查看预测的成分分析,可以使用 Prophet.plot_components 方法。
默认情况下,将展示趋势、时间序列的年度季节性和周季节性。 如果之前包含了节假日,也会展示出来。
m.plot_components(forecast)
误差分析:主要看绝对值误差,因为均方误差一般针对回归问题。
用sklearn库的内置函数计算误差
from sklearn.metrics import mean_squared_error # 均方误差
from sklearn.metrics import mean_absolute_error # 平方绝对误差
y = y_ped["yhat"].loc[0:91]
观察前几个数据的预测效果,发现差异比较大。
(y-pred_cnfrm["Confirmed"]).head()
0 -5760.094924
1 -1443.014418
2 2381.270436
3 4662.908463
4 3331.521809
dtype: float64
MSE(Mean Squared Error)均方误差
mean_squared_error(pred_cnfrm["Confirmed"],y)
24838490.129674848
MAE (Mean absolute Error)平均绝对误差
mean_absolute_error(pred_cnfrm["Confirmed"],y)
3423.4048112033311
利用numpy计算误差
MSE(Mean Squared Error)均方误差
import numpy as np
mse_test=np.sum((y-pred_cnfrm["Confirmed"])**2)/len(y) #跟数学公式一样的
mse_test
24838490.129674848
RMSE(Root Mean Squared Error)均方根误差
rmse_test=mse_test ** 0.5
rmse_test
4983.8228429263863
MAE (Mean absolute Error)平均绝对误差
mae_test=np.sum(abs(y-pred_cnfrm["Confirmed"]))/len(y) #跟数学公式一样的
mae_test
3423.4048112033311