5月20日股市行情预测分析
机器学习与线性回归 (Machine Learning & Linear Regression)
Predictive modeling using machine learning comes with a trick to generalize new cases and not merely memorizing past cases. In order to achieve that, the ML algorithm must look through multiple rows of data, and different features which have significant correlations with target variable.
使用机器学习P redictive模型配备了一个绝招来概括新病例,而不仅仅是记忆过去的案例。 为了实现这一点,ML算法必须浏览多行数据以及与目标变量具有显着相关性的不同特征。
Most of the online resources which are available, where we can find that, the prediction problem ends with validating on test set. Very few resources are available which clearly shows the actual prediction report with future dates and foretasted prices.
我们可以找到的大多数在线资源都可以通过对测试集进行验证来结束预测问题。 几乎没有可用资源清楚地显示带有未来日期和预定价格的实际预测报告。
Here, our exercise in this article will not only validate the model, but show how to use the developed model to predict future prices. We will use simple linear regression to predict daily closing price of bitcoin based on over 2000 days of historical data. The goal here to predict future 15 days prices which are unknown.
在这里,我们在本文中的练习不仅将验证模型,而且还将展示如何使用已开发的模型来预测未来价格。 我们将使用简单的线性回归,根据超过2000天的历史数据来预测比特币的每日收盘价。 这里的目标是预测未知的未来15天价格。
We picked up crypto currency data. The motivation here is that, despite its high volatility, the price of BTC has been an area where significant efforts for price forecast are going on. Here we loading Bitcoin daily data into pandas data frame.
我们提取了加密货币数据。 这样做的动机是,尽管BTC的价格具有很高的波动性,但它一直是价格预测工作的重中之重。 在这里,我们将比特币每日数据加载到熊猫数据框中。
可视化 (Visualization)
Daily candlestick shows the open, high, low, and close price for the day. This real body represents the price range between the open and close of that day’s trading. When the real body is filled in red or green, it means the close was lower than the open. Here, we have taken last 30 days data to have a clear visualization.
每日烛台显示当天的开盘价,最高价,最低价和收盘价。 该实体代表了当天交易的开盘价与收盘价之间的价格范围。 当实物用红色或绿色填充时,表示收盘价低于开盘价。 在这里,我们使用了过去30天的数据来获得清晰的可视化效果。
统计 (Statistics)
观察结果 (Observations)
- BTC closing price was not over $4342 for almost half of the time (we can see that from mean value of close price). blue dashed line represents the median/mean line. BTC收盘价在几乎一半的时间内都没有超过$ 4342(我们可以从收盘价的平均值中看到)。 蓝色虚线表示中位数/均值线。
技术指标 (Technical indicators)
We have used technical analysis to create a number of features (73) from the existing data set. However, we will not use all the features here; rather we will use top 20 most correlated feature with our target price (‘Close’). By using the below function, we have :
我们已经使用技术分析从现有数据集中创建了许多功能(73)。 但是,我们不会在这里使用所有功能。 相反,我们将使用前20个最相关的功能与我们的目标价格(“关闭”)。 通过使用以下功能,我们可以:
- created multiple features using ohlcv 使用ohlcv创建了多个功能
- dropped open, high, low, volume columns from the data frame keeping the close as our target variable. 从数据框中删除开盘,高盘,低盘,成交量列,将收盘价作为我们的目标变量。
- Based on correlation (Pearson) function, we have identified top 20 most correlated features with target variable (‘Close’). [ 20 features are taken as an example; where n-number of features can be taken based on the problem statement] 基于相关(皮尔逊)函数,我们确定了目标变量(“关闭”)中最相关的前20个特征。 [以20个特征为例; 可以根据问题陈述获取n个特征]
- Finally target variable (‘Close’) is shifted for the 15 days that we want to predict the price for. 最后,将目标变量(“关闭”)转移了15天,我们希望以此预测价格。
创建目标变量 (Creating target variable)
多列火车测试拆分 (Multiple train test split)
We are dealing with time-series data here and need to split the data observed at fixed time intervals, in train/test sets. In each split, test indices would be higher than before. To simplify, we are here repeating the process of splitting the time series into train and test sets multiple times. Here, multiple models to be trained and evaluated at additional computational expense but in return we will have a robust estimate on unseen data.
我们在这里处理时间序列数据,需要将在固定时间间隔观察到的数据拆分为训练/测试集中的数据。 在每个分组中,测试指数将比以前更高。 为简化起见,我们在这里重复多次将时间序列分为训练集和测试集的过程。 在这里,要训练和评估多个模型需要付出额外的计算费用,但作为回报,我们将对看不见的数据进行可靠的估计。
Considering our data size is small, we have used 2 splits. The training set has size i * n_samples // (n_splits + 1) + n_samples % (n_splits + 1) in the i``th split, with a test set of size ``n_samples//(n_splits + 1), where n_samples is the number of samples. More details on this can be found at scikit learn user guide.
考虑到我们的数据量很小,我们使用了2个拆分。 训练集在第i个拆分中具有大小i * n_samples //(n_splits + 1)+ n_samples%(n_splits + 1) ,测试集的大小为`` n_samples //(n_splits + 1) ,其中n_samples是样本数。 有关更多详细信息,请参见scikit Learn用户指南。
线性回归 (Linear Regression)
We see here that, model accuracy dropped from 92.67% to 73.7% on test data. Let us view the residuals to ensure that, linear model fits our given data set.
我们在这里看到,根据测试数据,模型的准确性从92.67%下降到73.7%。 让我们查看残差以确保线性模型适合我们的给定数据集。
残差图 (Residuals Plot)
Residuals are the difference between the observed value of the target variable. The residuals plot shows the difference between residuals on the y- axis and the dependent variable on the x-axis. A common use of the residuals plot is to analyze the variance of the error of the model.
残差是目标变量的观察值之间的差。 残差图显示y轴上的残差与x轴上的因变量之间的差异。 残差图的常见用法是分析模型误差的方差。
If the points are randomly dispersed around the x-axis, a linear regression model is usually appropriate for the data; otherwise, a non-linear model is more appropriate. Here, we see a fairly random, uniform distribution of the residuals against the target in two dimensions. This seems to indicate that our linear model is performing well. We can also see from the histogram that our error is normally distributed around zero, indicating a well fitted model.
如果这些点围绕x轴随机散布,则通常适用于数据的线性回归模型; 否则,非线性模型更为合适。 在这里,我们在二维图中看到相对于目标的残差相当随机,均匀的分布。 这似乎表明我们的线性模型运行良好。 我们还可以从直方图中看到,我们的误差通常在零附近分布,表明模型拟合良好。
In regard to improving the performance on test data, few things can be done:
关于提高测试数据的性能,可以做的事情很少:
- add more training samples, 添加更多的训练样本,
- experiment with additional features or reduce number of features, 尝试其他功能或减少功能数量,
- may experiment with different set of features, 可以尝试不同的功能集,
- try other regressor such as RANSAC or Huber which can deal with outliers. 尝试使用其他可处理异常值的回归器,例如RANSAC或Huber。
预测 (Forecast)
Below we have used our separately kept dateset (X_future_prediction) for 15 days future prediction.
下面,我们将单独保存的日期集(X_future_prediction)用于未来15天的预测。
未来价格可视化 (Future price visualization)
However, considering the high volatility and random walk movement of Bitcoin prices, 73% accuracy may not be a bad model either.
但是,考虑到比特币价格的高波动性和随机走动,73%的准确性也可能不是一个坏模型。
结论 (Conclusion)
We have shown a simplified version of how to predict future 15 days expected price using historical data of over 2000 days. Because of its highly volatile nature, a good prediction model is required on which to base investment decisions. Different other algorithms such as Random Forest, XGBoost, Quadratic Discriminant Analysis, Support Vector Machine, Long Short-term Memory can also be experimented using the similar techniques.
我们已经展示了如何使用超过2000天的历史数据来预测未来15天的预期价格的简化版本。 由于其高度易变性,因此需要一个良好的预测模型作为投资决策的基础。 也可以使用类似的技术尝试其他不同的算法,例如随机森林,XGBoost,二次判别分析,支持向量机,长短期记忆。
I can be reached here.
我可以在 这里 到达 。
5月20日股市行情预测分析