纪伯伦先知_先知能否准确预测网页浏览量?

纪伯伦先知

Forecasting web page views can be quite tricky. The reason for this is that page views tend to see significant “spikes” in the data — where the number of views is much higher than average.

预测网页浏览量可能非常棘手。 原因是页面浏览量往往会在数据中看到明显的“峰值”,即浏览量远高于平均值。

Let’s take an example of page views for the term “earthquake” based on statistics available from Wikimedia Toolforge from January 2016 — August 2020:

让我们以Wikimedia Toolforge从2016年1月至2020年8月提供的统计数据为例,说明“地震”一词的页面浏览量:

We can see that while there appears to be a generally decreasing trend, there are still large spikes in page views at particular points. This conforms to our expectations — it is natural to expect that spikes for the term “earthquakes” would occur at points where an earthquake has actually occurred.

我们可以看到,尽管趋势似乎总体呈下降趋势,但在特定点的网页浏览量仍存在较大的峰值。 这符合我们的预期-很自然地会想到,“地震”一词的峰值会在实际发生地震的地点发生。

While it is not possible for a time series to predict such spikes — it is still possible to use time series models for forecasting the trend more generally.

尽管时间序列无法预测此类峰值,但仍可以使用时间序列模型来更一般地预测趋势。

In this regard, a Prophet model is built in order to forecast page views for this search term.

在这方面,为了预测该搜索词的页面浏览量,建立了Prophet模型。

型号配置 (Model Configuration)

Prophet is a forecasting model by Facebook that forecasts time series using special adjustments for factors such as seasonality, holiday periods, and changepoints.

Prophet是Facebook的一种预测模型,该模型使用针对季节性,假期和变更点等因素的特殊调整来预测时间序列。

Here is a decomposition of the above time series in Python:

这是上述Python时间序列的分解:

Image for post
Source: Jupyter Notebook Output
资料来源:Jupyter Notebook输出

From the above, we can observe a strong decrease in the trend over time. Moreover, a visual inspection of the seasonality graph indicates yearly fluctuations.

从上面可以看出,随着时间的推移,趋势出现了明显的下降。 此外,目视检查季节性图表明年度波动。

An additive model with yearly seasonality and a Fourier order of 10 is defined as follows:

年度季节性和傅立叶阶数为10的加法模型定义如下:

prophet_basic = Prophet(seasonality_mode='additive')
prophet_basic.add_seasonality('yearly_seasonality', period=12, fourier_order=10)
prophet_basic.fit(train_dataset)

Here is a breakdown of the components as indicated by Prophet.

这是先知指示的组件细目。

fig1 = prophet_basic.plot_components(forecast)
Image for post
Source: Jupyter Notebook Output
资料来源:Jupyter Notebook输出

Now that the trend and seasonality components have been identified, the next task is to identify the number of changepoints in the data.

现在已经确定了趋势和季节性组成部分,下一个任务是确定数据中变更点的数量。

Simply put, a changepoint is where there is a significant change in the trajectory of a time series. Properly identifying these can help improve the model accuracy.

简而言之,一个变化点就是时间序列的轨迹发生重大变化的地方。 正确识别这些可以帮助提高模型的准确性。

In this regard, the changepoint parameter was varied and the RMSE (root mean squared error) score on the test set was calculated:

在这方面,更改了变化点参数,并计算了测试集上的RMSE(均方根误差)得分:

Image for post
Source: Author’s Calculations
资料来源:作者的计算

The RMSE is minimised with 6 changepoints, and as such the model will be configured with the same.

通过6个变更点将RMSE最小化,因此将使用相同的配置模型。

Here is a plot of the changepoints:

这是变更点的图:

import matplotlib.pyplot as pltfigure = pro_change.plot(forecast)
for changepoint in pro_change.changepoints:
plt.axvline(changepoint,ls='--', lw=1)
Image for post
Source: Jupyter Notebook Output
资料来源:Jupyter Notebook输出

结果 (Results)

The Prophet model above was trained on 80% of the dataset (training set). Now, the predictions are generated against the latter 20% of the dataset (test set), and the root mean squared error along with the mean forecast error are calculated.

上面的先知模型在数据集的80%(训练集)上进行了训练。 现在,针对数据集(测试集)的后20%生成预测,并计算均方根误差平均预测误差

>>> from sklearn.metrics import mean_squared_error
>>> from math import sqrt
>>> mse = mean_squared_error(actual, predicted)
>>> rmse = sqrt(mse)
>>> print('RMSE: %f' % rmse)RMSE: 1747.629375>>> forecast_error = (actual-predicted)
>>> mean_forecast_error = np.mean(forecast_error)
>>> mean_forecast_error-681.8562874251497

With a root mean squared error of 1,747 and a maximum value of 9,699 in the test set — the error seems to be reasonable in comparison to the maximum value (the size of the error is approximately 17% of the overall range).

测试集中的均方根误差为1,747 ,最大值为9,699 ,与最大值相比 ,该误差似乎是合理的(误差的大小约为整个范围的17% )。

When taking the 90th percentile as the maximum range (in the interests of excluding abnormally large values) — a maximum value of 5,378 is obtained.

当将第90个百分位数作为最大范围时(为了排除异常大的值),将获得最大值5,378

>>> np.quantile(actual, 0.9)5378.3

Assuming this value to be the maximum, the size of the root mean squared error accounts for 32% of the overall range — which is significantly higher.

假设此值为最大值,则均方根误差的大小占整个范围的32% -明显更高。

Here is a plot of the predicted vs actual values:

这是预测值与实际值的关系图:

Image for post
Source: Jupyter Notebook Output
资料来源:Jupyter Notebook输出

We can see that while the Prophet model is generally predicting the overall trend — it fails to predict these large “spikes” in the data — which is contributing to a higher prediction error.

我们可以看到,虽然Prophet模型通常在预测总体趋势,但无法预测数据中的这些大“峰值”,这会导致较高的预测误差。

From this standpoint, while Prophet can still be useful in determining a long-term trend for the data — predicting anomalies can be accomplished more effectively by using a Monte Carlo distribution.

从这个角度来看,尽管先知在确定数据的长期趋势方面仍然很有用,但使用蒙特卡洛分布可以更有效地预测异常。

In particular, web page view statistics appear to follow a Pareto distribution, e.g. in any given 100 days — 80% of the page views will be recorded on 20% of the days. If you are interested, another article on this topic is included under the References section.

特别是,网页浏览量统计数据似乎遵循Pareto分布,例如在任何给定的100天内-80%的网页浏览量将记录在20%的日子中。 如果您有兴趣,可以在“参考”部分中找到有关此主题的另一篇文章。

结论 (Conclusion)

In this example, we have seen that Prophet can be a useful time series tool in predicting the overall trend of a series and identify changepoints — or significant, abrupt shifts in a time series.

在此示例中,我们已经了解到,先知可以成为预测序列总体趋势并识别变化点(或时间序列中的重大突变)的有用的时间序列工具。

However, web page views are somewhat unique in that they are subject to a high degree of anomalies and Prophet can be limited in terms of predicting extreme values that are not necessarily dependent on seasonality or other time-dependent factors.

但是,网页视图在某种程度上是唯一的,因为它们受到高度异常的影响,并且在预测不一定依赖于季节性或其他时间相关因素的极值方面,先知可能会受​​到限制。

Many thanks for reading, and any questions or feedback are greatly welcomed. You can also find the code and datasets for this example at the relevant MGCodesandStats GitHub repository under the References section.

非常感谢您的阅读,我们非常欢迎任何问题或反馈。 您还可以在“参考”部分下的相关MGCodesandStats GitHub存储库中找到此示例的代码和数据集。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明:本文按“原样”撰写,不作任何担保。 它旨在提供数据科学概念的概述,并且不应以任何方式解释为专业建议。

翻译自: https://towardsdatascience.com/can-prophet-accurately-forecast-web-page-views-3537fe72e11b

纪伯伦先知

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值