使用机器学习算法预测时间序列时注意落入陷阱

最新推荐文章于 2024-03-24 13:52:39 发布

Andrew_jdw

最新推荐文章于 2024-03-24 13:52:39 发布

阅读量1.5k

点赞数

分类专栏： Python学习笔记

Python学习笔记专栏收录该内容

39 篇文章 0 订阅

订阅专栏

Data Scientist, AI/Machine Learning & Advanced Analytics at Kongsberg Digital

Time series forecasting is an important area of machine learning. It is important because there are so many prediction problems that involve a time component. However, while the time component adds additional information, it also makes time series problems more difficult to handle compared to many other prediction tasks.

This post will go through the task of time series forecasting using machine learning, and how to avoid some of the common pitfalls. Through a concrete example, I will demonstrate how one could seemingly have a good model and decide to put it into production, whereas in reality, the model might have no predictive power whatsoever, More specifically, I will focus on how to evaluate your model accuracy, and show how relying simply on common error metrics such as mean percentage error, R2 score etc. can be very misleading if they are applied without caution.

This is simply WRONG....

From the above figures and calculated error metrics, the model is apparently giving accurate predictions. However, this is not the case at all, and is simply an example of how choosing the wrong accuracy metric can be very misleading when evaluating model performance. In this example, for the sake of illustration, the data was explicitly chosen to represent data that actually cannot be predicted. More specifically, the data I called "stock index", was actually modeled using a random walk process. As the name indicates, a random walk is a completely stochastic process. Due to this, the idea of using historical data as a training set in order to learn the behavior and predict future outcomes is simply not possible. Given this, how could it then be that the model is seemingly giving us such accurate predictions? As I will get back to in more detail, it all comes down to the (wrong) choice of accuracy metric.

How to implement the models using open source software libraries

I usually define my neural network type of models using Keras, which is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. For other types of models I usually use Scikit-Learn, which is a free software machine learning library, It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to inter-operate with the Python numerical and scientific libraries NumPy and SciPy.

However, the main topic of this article is not how to implement a time series forecasting model, but rather how to evaluate the model predictions. Due to this, I will not go into the details of model building etc., as there are plenty of other blog posts and articles covering those subjects.

Accuracy metrics can be very misleading when used incorrectly

This means that when evaluating the model in terms of its ability of predicting the value directly, common error metrics such as mean percentage error and R2 score both indicate a high prediction accuracy. However, as the example data is generated through a random walk process, the model cannot possibly predict future outcomes. This underlines the important fact that simply evaluating the models predictive powers through directly calculating common error metrics can be very misleading, and one can easily be fooled into being overly confident in the model accuracy.

Is your time series a random walk?

Your time series may actually be a random walk, and some ways to check this are as follows:

The time series shows a strong temporal dependence (autocorrelation) that decays linearly or in a similar pattern.
The time series is non-stationary and making it stationary shows no obviously learnable structure in the data.
The persistence model (using the observation at the previous time step as what will happen in the next time step) provides the best source of reliable predictions.

This last point is key for time series forecasting. Baseline forecasts with the persistence model quickly indicate whether you can do significantly better. If you can’t, you’re probably dealing with a random walk (or close to it). The human mind is hardwired to look for patterns everywhere and we must be vigilant that we are not fooling ourselves and wasting time by developing elaborate models for random walk processes.

Summary

The main point I would like to emphasize through this article, is to be very careful when evaluating your model performance in terms of prediction accuracy. As shown through the above example, even for a completely random process, where predicting future outcomes is by definition impossible, one can easily be fooled. By simply defining a model, making some predictions and calculating common accuracy metrics, one could seemingly have a good model and decide to put it into production. Whereas, in reality, the model might have no predictive power whatsoever.

原文链接:https://www.kognifai.com/blog/avoid-the-pitfalls

python实现persistence model:https://machinelearningmastery.com/persistence-time-series-forecasting-with-python/