特斯拉自动驾驶使用的技术_使用自回归预测特斯拉股价

最新推荐文章于 2022-01-23 14:32:17 发布

weixin_26752765

最新推荐文章于 2022-01-23 14:32:17 发布

阅读量594

点赞数

文章标签： python java 人工智能机器学习大数据

原文链接：https://towardsdatascience.com/forecasting-teslas-stock-price-using-autoregression-52e7908d34b6

版权

特斯拉自动驾驶使用的技术

Tesla has been making waves in financial markets over the last few months. Previously named the most shorted stock in the US [1], Tesla’s stock price has since catapulted the electric carmaker to a market capitalization of $278 billion [2]. Its latest quarterly results suggest that it is now available to be added to the S&P 500, which it is currently not a member of, despite being the 12th largest company in the US [3].

在过去的几个月中，特斯拉一直在金融市场掀起波澜。特斯拉以前曾被称为美国最短缺的股票[1]，此后，其股价便将这家电动汽车制造商的市值推升至2780亿美元[2]。其最新的季度业绩表明，尽管它是美国第12大公司，但现在仍可加入标准普尔500指数(S＆P 500)，该指数目前尚未加入。

Amid market volatility, various trading strategies and a sense of “FOMO” (fear of missing out), predicting the returns of Tesla’s stock is a difficult task. However, we are going to use Python to forecast Tesla’s stock price returns using autoregression.

在市场动荡，各种交易策略以及一种“ FOMO”(害怕错过)的氛围下，预测特斯拉股票的回报是一项艰巨的任务。但是，我们将使用Python通过自回归来预测特斯拉的股价回报。

Exploring the data

探索数据

First, we need to import the data. We may use historical stock price data downloaded from Yahoo Finance. We’re going to use the “Close” price for this analysis.

首先，我们需要导入数据。我们可能会使用从Yahoo Finance下载的历史股价数据。我们将使用“平仓”价格进行此分析。

import pandas as pddf = pd.read_csv("TSLA.csv", index_col=0, parse_dates=[0])
df.head()

To determine the order for the ARMA model, we can firstly plot a partial autocorrelation function. This gives a graphical interpretation of the amount of correlation between the dependent variable and the lags of itself, which is not explained by correlations at all lower-order lags.

为了确定ARMA模型的顺序，我们首先可以绘制部分自相关函数。 这给出了因变量和其自身滞后之间的相关量的图形解释， 但并未通过所有低阶滞后的相关解释。

From the PACF below, we can see that the significance of the lags cuts off after lag 1, which suggests we should use an autoregressive (AR) model [4].

从下面的PACF中，我们可以看到滞后的重要性在滞后1之后就消失了，这表明我们应该使用自回归(AR)模型[4]。

# Plot PACF
from statsmodels.tsa.stattools import acf, pacf
plt.bar(x=np.arange(0,41), height=pacf(df.Close))
plt.title("PACF")

When plotting the autocorrelation function, we get a slightly different result. The series is infinite and slowly damps out, which suggests an AR or ARMA model [4]. Taking both the PACF and the ACF into account, we are going to use an AR model.

在绘制自相关函数时，我们得到的结果略有不同。该序列是无限的，并逐渐衰减，这表明存在AR或ARMA模型[4]。考虑到PACF和ACF，我们将使用AR模型。

#Plot ACF
plt.bar(x=np.arange(0,41), height=acf(df.Close))
plt.title("ACF")

Pre-processing the data

预处理数据

Before we run the model we must make sure we are using stationary data. Stationarity refers to a characteristic in which the way the data moves doesn’t change over time. Looking at the raw stock price seen earlier in the article, it is clear that the series is not stationary. We can see this as the stock price increases over time in a seemingly exponential manner.

在运行模型之前，必须确保我们正在使用固定数据 。平稳性是指数据移动方式不会随时间变化的特征。从文章前面看到的原始股票价格来看，显然该系列不是固定的。我们可以看到，随着股价随着时间的推移呈指数增长。

Therefore, to make the series stationary we difference the series, which essentially means to subtract today’s value from tomorrow’s value. This results in the series revolving around a constant mean (0), giving us the stock returns instead of the stock price.

因此，为了使序列平稳，我们对序列进行求差，这实质上意味着从明天的值中减去今天的值。 这导致系列围绕恒定均值(0)旋转，从而为我们提供了股票收益率而不是股票价格。

We are also going to lag the differenced series by 1, which brings yesterday’s value forward to today. This is so we can obtain our AR term (Yt-1).

我们还将差值序列滞后1，从而将昨天的值延续到今天。 这样我们就可以获得AR项(Yt-1)。

After putting these values into the same DataFrame, we split the data into training and testing sets. In the code, the data is split roughly into 80:20 respectively.

将这些值放入同一DataFrame之后，我们将数据分为训练和测试集。在代码中，数据分别大致分为80:20。

# Make the data stationary by differencing
tsla = df.Close.diff().fillna(0)# Create lag
tsla_lag_1 = tsla.shift(1).fillna(0)# Put all into one DataFrame
df_regression = pd.DataFrame(tsla)
df_regression["Lag1"] = tsla_lag_1# Split into train and test data
df_regression_train = df_regression.iloc[0:200]
df_regression_test = df_regression.iloc[200:]tsla.plot()

Forming the AR model

形成AR模型

Now, how many values should we use to predict the next observation? Using all the past 200 values may not give a good estimate as intuitively, stock price activity from 200 days ago is unlikely to have a significant effect on today’s value as numerous factors may have changed since then. This could include earnings, competition, season and more. Therefore, to find the optimal window of observations to use in the regression, one method we can use is to run a regression with an expanding window. This method, detailed in the code below, runs a regression with one past observation, recording the r-squared value (goodness-of-fit), and then repeats this process, expanding past observations by 1 each time. For economic interpretation, I’ve set the limit on the size of the window at 30 days.

现在，我们应该使用多少个值来预测下一次观测？从直觉上来说，使用过去200个值中的所有值可能无法给出一个很好的估计，自200天前开始的股价活动不太可能对当今的值产生重大影响，因为此后可能已经发生了许多因素变化。这可能包括收入，竞争，赛季等等。因此，要找到在回归分析中使用的最佳观测窗口，我们可以使用的一种方法是使用扩大的窗口进行回归。下面的代码中详细介绍了该方法，该方法对一个过去的观察值进行回归，记录r平方值(拟合优度)，然后重复此过程， 每次将过去的观察值扩大1。 为了经济起见，我将窗口大小的上限设置为30天。

# Run expanding window regression to find optimal windown = 0
rsquared = []while n<=30:
    
    y = df_regression_train["Close"].iloc[-n:]
    x = df_regression_train["Lag1"].iloc[-n:]
    x = sm.add_constant(x)model = sm.OLS(y,x)
    results = model.fit()rsquared.append(results.rsquared)n +=1

Looking at the r-squared plot of each iteration, we can see than it is high around 1–5 iterations, and also has a peak at 13 past values. It may seem tempting to choose one of the values between 1 and 5, however, the very small sample size will likely mean that out regression is statistically biased, so wouldn’t give us the best result. Therefore let’s choose the second peak at 13 observations as this is a more sufficient sample size, which gives an r-squared of around 0.437 (i.e. model explains 43% of the variation in the data).

查看每次迭代的R平方图，我们可以看到它在1-5次迭代附近较高，并且在13个过去的值处也有一个峰值。从1到5之间选择一个值似乎很诱人，但是，样本量非常小可能意味着回归回归在统计上有偏差 ，因此不会给我们带来最佳结果。因此，让我们选择13个观测值处的第二个峰，因为这是一个更充分的样本量，其r平方约为0.437(即模型解释了数据变化的43％)。

Running the AR model on the training data

在训练数据上运行AR模型

The next step is to use our window of 13 past observations to fit the AR(1) model. We may do this using the OLS function in statsmodels. Code below:

下一步是使用我们过去13次观察的窗口来拟合AR(1)模型。我们可以使用statsmodels中的OLS函数来执行此操作。代码如下：

# AR(1) model with static coefficientsimport statsmodels.api as sm
y = df_regression_train["Close"].iloc[-13:]
x = df_regression_train["Lag1"].iloc[-13:]
x = sm.add_constant(x)model = sm.OLS(y,x)
results = model.fit()
results.summary()

As we can see in the statistical summary, the p-value of both the constant and the first lag is significant at the 10% significance level. Looking at the sign of the coefficients, the positive sign on the constant suggests that, all else being equal, stock price returns should be positive. Also, the negative sign on the first lag suggests that the past value of the stock return is lower than today’s value, ceteris paribus, which also maintains the narrative that stock returns increase over time.

正如我们在统计摘要中看到的那样，常数和第一个滞后的p值在10％的显着性水平上都是显着的。从系数的符号来看，常数上的正号表示在所有其他条件相等的情况下，股票价格收益应该是正的。同样，第一次滞后的负号表明股票收益的过去值低于今天的价值，ceteris paribus，这也保持了股票收益随时间增加的说法。

Great, now let’s use those coefficients to find the fitted value for Tesla’s stock returns so we can plot the model against the original data. Our model may now be specified as:

太好了，现在让我们使用这些系数来找到特斯拉股票收益的拟合值，以便可以将模型与原始数据作图。我们的模型现在可以指定为：

Plot Residuals (Actual — Fitted)

剩余图(实际-已拟合)

The residuals suggest that the model performs better in 2019, but in 2020 as volatility increased, the model performed considerable worse (residuals are larger). This is intuitive as the volatility experienced in the March 2020 selloff had a large impact on US stocks, while the quick and sizeable rebound was particularly felt by tech stocks. This, along with the increased betting on Tesla stock by retail traders on platforms such as Robinhood has increased price volatility, thus making it harder to predict.

残差表明该模型在2019年的表现更好，但在2020年，随着波动性的增加，该模型的表现会更差(残差更大)。这是很直观的，因为2020年3月抛售所经历的波动性对美国股票产生了很大的影响，而科技股尤其感受到了快速而可观的反弹。这以及零售交易商在Robinhood等平台上对特斯拉股票的押注增加，使得价格波动性增加，因此很难预测。

Given these factors, along with our previous r-squared of around 43%, we would not expect our AR(1) model to predict the exact stock return. Instead, we can test the model’s accuracy by calculating its “hit rate”, i.e. when our model predicted a positive value and the actual value was also positive, and vice versa. Summing up instances of true positives and true negatives, the accuracy of our model comes out at around 55%, which is fairly good for this simple model.

考虑到这些因素，再加上我们之前的约43％的r平方，我们无法期望AR(1)模型能够预测确切的股票收益。相反，我们可以通过计算模型的“命中率”来测试模型的准确性， 也就是说，当模型预测为正值而实际值也为正时 ，反之亦然。总结真实肯定和真实否定的情况，我们模型的准确性约为55％，对于这个简单的模型来说，这是相当不错的。

Fit model to the test data

使模型适合测试数据

Now, let’s apply the same methodology to the test data to see how our model performs out-of-sample.

现在，让我们将相同的方法应用于测试数据，以查看我们的模型如何执行样本外。

# Calculate hit rate
true_neg_test = np.sum((df_2_test["Fitted Value"] <0) & (df_2_test["Actual"] <0))
true_pos_test = np.sum((df_2_test["Fitted Value"] >0) & (df_2_test["Actual"] >0))accuracy = (true_neg_test + true_pos_test)/len(df_2_test)
print(accuracy)# Output: 0.6415

Our hit rate has improved to 64% when applying the model to the test data, which is a promising improvement! Next steps to improve its accuracy may include running a rolling regression, where coefficients change with each iteration, or perhaps incorporating a moving average (MA) element to the model.

将模型应用于测试数据时，我们的命中率已提高到64％，这是一个有希望的改进！改善其准确性的下一步可能包括运行滚动回归，其中系数随每次迭代而变化，或者可能将移动平均(MA)元素合并到模型中。

Thanks for reading! Please feel free to leave any comments for any insights you may have. The full Jupyter Notebook which contains the source code I used to do this project can be found on my Github Repository.

谢谢阅读！ 如果您有任何见解，请随时发表评论。 完整的Jupyter Notebook(包含我用于执行此项目的源代码)可以在我的 Github存储库中 找到。