[ITS]Elon Musk impact on GameStop Share

小毛毛。。。。。嘻嘻

于 2022-01-30 14:45:07 发布

阅读量488

点赞数

文章标签： python

本文链接：https://blog.csdn.net/qq_54413568/article/details/122752792

版权

ppt的代码加自己整理的笔记，并没有完全搞明白，后面应该会有更新，欢迎交流学习！！！

数据描述

When Elon Musk tweeted “Gamestonk!!” and a link to the WallStreetBets Reddit thread, the GameStop shares surged.

目的

用 Interrupted Time Series 探究 Elon Musk 的推特对 GMT 的股票带来了怎样的影响。

衡量参数：

Level Change = Start Level of Post Interval - End Level of Pre Interval
- i.e. Post-Interval-pred[0] – Pre-Interval-pred[-1]
Slope Change = Slope of Post Interval - Slope of Pre Interval
- i.e. Post-Interval-Coeficient – Pre-Interval-Coeficient

关于 level

Note: According to some definitions, Level is
the average of values in an interval. However,
based on a definition we gave here, Level is
the first and last value of the fitted model.
Our definition is more close to immediate
effect.

The Algorithm (Step by Step)

Doing Linear Regression in each interval separately (Pre Interval and Post Interval)

Linear Regression: simple and basic form to qualify the change

Checking the autocorrelation in residuals of the linear model in each interval by Durbin Watson Test. If no autocorrelation is found, we jump to step 6.

If there is autocorrelation in residuals of a certain model, it means some complexity in real data has not been captured by the model.–> model要换

If any autocorrelation was found, then for that interval, we put away the linear model and instead try to use SARIMAX
Checking the autocorrelation in residuals of the SARIMAX model in each interval by Durbin Watson Test.
If any autocorrelation was found, then for that interval, we put away the SARIMAX also and conclude linearity is not the possible option for this interval
In the achieved model (either linear model or SARIMAX), we calculate pred[0] (the first value of the model) and pred[-1] (the last value of the model) to attain levels at the beginning and end of the interval, and also take the coefficient of the “row number” variable as the slope of the model in the interval.
Calculating:
➢Level Change = Post-Interval-pred[0] – Pre-Interval-pred[-1]
➢Slope Change = Post-Interval-Coeficient – Pre-Interval-Coeficient

Import data and visualize it

import yfinance as yf
yf.pdr_override() #需要调用这个函数
from pandas_datareader import data as web

start_date = '2020-10-01'
end_date = '2021-02-11'
data = web.DataReader('GME', data_source='yahoo', start=start_date, end=end_date)
close = data['Close']
ax = close.plot(title='GameStop Share Price')
ax.set_xlabel('Date')
ax.set_ylabel('Close(US$)')
ax.grid()
import matplotlib.pyplot as plt
plt.show()

[*********************100%***********************]  1 of 1 completed

在这里插入图片描述

分别得到 Pre Interval, Post Interval 的 level 和 slope

Slicing Data into Pre and Post Intervals

import numpy as np
data_pre = data['2020-10-01':'2021-01-26']
data_pre = data_pre.copy() # 不加会有红色warning 不知道为什么
data_pre['row_number'] = np.arange(data_pre.shape[0])
# row_number 相当于 linear regression 中的 x
data_pre

	Open	High	Low	Close	Adj Close	Volume	row_number
Date
2020-10-01	10.090000	10.250000	9.690000	9.770000	9.770000	4554100	0
2020-10-02	9.380000	9.780000	9.300000	9.390000	9.390000	4340500	1
2020-10-05	9.440000	9.590000	9.250000	9.460000	9.460000	2805000	2
2020-10-06	9.560000	9.840000	9.100000	9.130000	9.130000	4535400	3
2020-10-07	9.230000	9.560000	9.170000	9.360000	9.360000	3308600	4
...	...	...	...	...	...	...	...
2021-01-20	37.369999	41.189999	36.060001	39.119999	39.119999	33471800	75
2021-01-21	39.230000	44.750000	37.000000	43.029999	43.029999	56216900	76
2021-01-22	42.590000	76.760002	42.320000	65.010002	65.010002	197157900	77
2021-01-25	96.730003	159.179993	61.130001	76.790001	76.790001	177874000	78
2021-01-26	88.559998	150.000000	80.199997	147.979996	147.979996	178588000	79

80 rows × 7 columns

row_number 相当于 linear regression 中的 x

data_pre = data_pre.copy() 不加会有红色warning 不知道为什么

红色warning:

<ipython-input-14-a1b62b3d68ff>:4: SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

data_post = data['2021-01-27':'2021-02-11']
data_post = data_post.copy()
data_post['row_number'] = np.arange(data_post.shape[0])
data_post

	Open	High	Low	Close	Adj Close	Volume	row_number
Date
2021-01-27	354.829987	380.000000	249.000000	347.510010	347.510010	93396700	0
2021-01-28	265.000000	483.000000	112.250000	193.600006	193.600006	58815800	1
2021-01-29	379.709991	413.980011	250.000000	325.000000	325.000000	50566100	2
2021-02-01	316.559998	322.000000	212.000000	225.000000	225.000000	37382200	3
2021-02-02	140.759995	158.000000	74.220001	90.000000	90.000000	78183100	4
2021-02-03	112.010002	113.400002	85.250000	92.410004	92.410004	42698500	5
2021-02-04	91.190002	91.500000	53.330002	53.500000	53.500000	62427300	6
2021-02-05	54.040001	95.000000	51.090000	63.770000	63.770000	81345000	7
2021-02-08	72.410004	72.660004	58.020000	60.000000	60.000000	25687300	8
2021-02-09	56.610001	57.000000	46.520000	50.310001	50.310001	26843100	9
2021-02-10	50.770000	62.830002	46.549999	51.200001	51.200001	36455000	10

Pre Interval

Linear Regression

from patsy import dmatrices
expr = 'Close' + ' ~ ' + 'row_number' # formula_like 不懂
y_train, x_train = dmatrices(expr,data_pre,return_type='dataframe')
y_train

	Close
Date
2020-10-01	9.770000
2020-10-02	9.390000
2020-10-05	9.460000
2020-10-06	9.130000
2020-10-07	9.360000
...	...
2021-01-20	39.119999
2021-01-21	43.029999
2021-01-22	65.010002
2021-01-25	76.790001
2021-01-26	147.979996

80 rows × 1 columns

expr(expression) = 'Close' + ' ~ ' + 'row_number' --> formula_like

dmatrices():

Construct two design matrices given a formula_like and data.
By convention, the first matrix is the “outcome” or “y” data, and the second is the “predictor” or “x” data.

因此y_train对应'Close', x_train对应'row_number'

x_train

	Intercept	row_number
Date
2020-10-01	1.0	0.0
2020-10-02	1.0	1.0
2020-10-05	1.0	2.0
2020-10-06	1.0	3.0
2020-10-07	1.0	4.0
...	...	...
2021-01-20	1.0	75.0
2021-01-21	1.0	76.0
2021-01-22	1.0	77.0
2021-01-25	1.0	78.0
2021-01-26	1.0	79.0

80 rows × 2 columns

建模后的 intercept = 这里的intercept × correlation of intercept

Linear Model
Any Auto-Correlation in the Residuals: Durbin Watson Test

from statsmodels.regression import linear_model
olsr_results = linear_model.OLS(y_train, x_train).fit()
fs = olsr_results.summary()
fs

OLS Regression Results
Dep. Variable:	Close	R-squared:	0.287
Model:	OLS	Adj. R-squared:	0.278
Method:	Least Squares	F-statistic:	31.45
Date:	Sun, 30 Jan 2022	Prob (F-statistic):	2.98e-07
Time:	14:38:15	Log-Likelihood:	-332.31
No. Observations:	80	AIC:	668.6
Df Residuals:	78	BIC:	673.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	2.3585	3.457	0.682	0.497	-4.523	9.240
row_number	0.4237	0.076	5.608	0.000	0.273	0.574

Omnibus:	122.674	Durbin-Watson:	0.312
Prob(Omnibus):	0.000	Jarque-Bera (JB):	3971.926
Skew:	5.078	Prob(JB):	0.00
Kurtosis:	35.992	Cond. No.	90.7

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

AutocorrelationValue_For_OLS_Errors = float(fs.tables[2].data[0][3])
# Durbin Watson 的值
preds = olsr_results.predict(x_train)
print(AutocorrelationValue_For_OLS_Errors)

0.312

OLS: ordinary least squares

review: p-value >0.05–> not significant

Durbin Watson Test 的值在(1.5，2.5)之间，no autocorrelation in the Residuals, the model is acceptable

if 1.5 < AutocorrelationValue_For_OLS_Errors <2.5: 
    """no autocorrelation in the Residuals"""
    Level_Start = preds[0]
    Level_End = preds[-1]
    Slope = olsr_results.params.row_number
    
    result_tab1_html = olsr_results.summary().tables[1].as_html
    result_tab1_pandas = pd.read_html(result_tab1_html,header=0,index_col=0)[0]
    if (result_tab1_pandas["P>|t|"].iloc[0] < 0.05) and (result_tab1_pandas["P>|t|"].iloc[1] < 0.05):
        Significance_Status = 3
        """
        Which means both intercept and coefficient are significant
        So Significance_Status = 3 is Ideal
        """
    elif result_tab1_pandas["P>|t|"].iloc[1] < 0.05:
        """coefficient is significant"""
        Significance_Status = 2
    elif result_tab1_pandas["P>|t|"].iloc[0] < 0.05:
        """intercept is significant"""
        Significance_Status = 1
    else:
        Significance_Status = 0
else:
    """has autocorrelation, not acceptable"""
    print("We have to go for SARIMAX model")

We have to go for SARIMAX model

理解：

elif result_tab1_pandas[“P>|t|”].iloc[1] < 0.05:

“”“coefficient is significant”""

coefficient = row_number × coefficient of row_number
而 row_number is significant(这个 if 中 row_number 的p-value < 0.05)

elif result_tab1_pandas[“P>|t|”].iloc[0] < 0.05:
“”“intercept is significant”""

intercept = intercept × intercept of row_number
而 intercept is significant(这个 if 中 intercept 的p-value < 0.05)

问题：coefficient 比 intercept 更重要？为啥？

Go to SARIMAX

Find best Parameters

from pmdarima.arima import auto_arima
stepwise_fit = auto_arima(y_train, m=12, seasonal=True, d=None, D=1, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True)

Performing stepwise search to minimize aic
 ARIMA(2,2,2)(1,1,1)[12]             : AIC=456.685, Time=0.26 sec
 ARIMA(0,2,0)(0,1,0)[12]             : AIC=471.333, Time=0.01 sec
 ARIMA(1,2,0)(1,1,0)[12]             : AIC=460.265, Time=0.11 sec
 ARIMA(0,2,1)(0,1,1)[12]             : AIC=469.429, Time=0.05 sec
 ARIMA(2,2,2)(0,1,1)[12]             : AIC=454.714, Time=0.21 sec
 ARIMA(2,2,2)(0,1,0)[12]             : AIC=453.718, Time=0.09 sec
 ARIMA(2,2,2)(1,1,0)[12]             : AIC=455.137, Time=0.15 sec
 ARIMA(1,2,2)(0,1,0)[12]             : AIC=455.900, Time=0.06 sec
 ARIMA(2,2,1)(0,1,0)[12]             : AIC=456.842, Time=0.04 sec
 ARIMA(3,2,2)(0,1,0)[12]             : AIC=455.130, Time=0.10 sec
 ARIMA(2,2,3)(0,1,0)[12]             : AIC=inf, Time=0.20 sec
 ARIMA(1,2,1)(0,1,0)[12]             : AIC=459.986, Time=0.03 sec
 ARIMA(1,2,3)(0,1,0)[12]             : AIC=455.256, Time=0.14 sec
 ARIMA(3,2,1)(0,1,0)[12]             : AIC=453.133, Time=0.05 sec
 ARIMA(3,2,1)(1,1,0)[12]             : AIC=452.891, Time=0.15 sec
 ARIMA(3,2,1)(2,1,0)[12]             : AIC=453.425, Time=0.44 sec
 ARIMA(3,2,1)(1,1,1)[12]             : AIC=453.520, Time=0.30 sec
 ARIMA(3,2,1)(0,1,1)[12]             : AIC=451.675, Time=0.19 sec
 ARIMA(3,2,1)(0,1,2)[12]             : AIC=453.401, Time=0.39 sec
 ARIMA(3,2,1)(1,1,2)[12]             : AIC=455.210, Time=0.54 sec
 ARIMA(2,2,1)(0,1,1)[12]             : AIC=457.550, Time=0.11 sec
 ARIMA(3,2,0)(0,1,1)[12]             : AIC=451.074, Time=0.11 sec
 ARIMA(3,2,0)(0,1,0)[12]             : AIC=451.292, Time=0.04 sec
 ARIMA(3,2,0)(1,1,1)[12]             : AIC=453.066, Time=0.15 sec
 ARIMA(3,2,0)(0,1,2)[12]             : AIC=453.061, Time=0.20 sec
 ARIMA(3,2,0)(1,1,0)[12]             : AIC=451.767, Time=0.09 sec
 ARIMA(3,2,0)(1,1,2)[12]             : AIC=inf, Time=0.81 sec
 ARIMA(2,2,0)(0,1,1)[12]             : AIC=460.167, Time=0.10 sec
 ARIMA(4,2,0)(0,1,1)[12]             : AIC=451.863, Time=0.19 sec
 ARIMA(4,2,1)(0,1,1)[12]             : AIC=453.671, Time=0.27 sec
 ARIMA(3,2,0)(0,1,1)[12] intercept   : AIC=452.064, Time=0.13 sec

Best model:  ARIMA(3,2,0)(0,1,1)[12]          
Total fit time: 5.740 seconds

SARIMAX
Any Auto-Correlation in the Residuals: Durbin Watson Test

from statsmodels.tsa.statespace.sarimax import SARIMAX
sarimax_model = SARIMAX(endog=y_train, exog=x_train, order=(3,2,0),seasonal_order=(0,1,1,12),enforce_stationarity=False)
results = sarimax_model.fit()
from statsmodels.stats.stattools import durbin_watson
AutocorrelationValue_For_ARIMAX_Errors = durbin_watson(results.resid)
AutocorrelationValue_For_ARIMAX_Errors

1.5611495127853794

1.5 < Autocorrelation Value=1.56 < 2.5

➔
Model has been successful enough not to leave
any meaningful data in the residuals.

exog=x_train exogenous(外生的) variable = x_train(主要是row_number)

这是比找 best parameters 时的 model 多出的参数

之前: SARIMA
现在: SARIMAX, X–> exogenous variable

为什么？

Time series has it individual parameter -> lags
Linear regression 带来 exogenous variable --> row_number(即 x)

–> SARIMAX sees how Time series is correlated with Linear regression

results.summary()

SARIMAX Results
Dep. Variable:	Close	No. Observations:	80
Model:	SARIMAX(3, 2, 0)x(0, 1, [1], 12)	Log Likelihood	-178.900
Date:	Sun, 30 Jan 2022	AIC	371.799
Time:	14:38:22	BIC	385.591
Sample:	0	HQIC	377.103
	- 80
Covariance Type:	opg

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	6.13e-08	5.77e+05	1.06e-13	1.000	-1.13e+06	1.13e+06
row_number	5.887e-07	7.23e+05	8.14e-13	1.000	-1.42e+06	1.42e+06
ar.L1	-0.3810	0.337	-1.129	0.259	-1.042	0.280
ar.L2	1.0854	0.188	5.761	0.000	0.716	1.455
ar.L3	1.1965	0.262	4.559	0.000	0.682	1.711
ma.S.L12	-0.5392	0.565	-0.954	0.340	-1.647	0.568
sigma2	48.5960	6.597	7.367	0.000	35.666	61.526

Ljung-Box (L1) (Q):	0.18	Jarque-Bera (JB):	199.64
Prob(Q):	0.67	Prob(JB):	0.00
Heteroskedasticity (H):	47.87	Skew:	0.91
Prob(H) (two-sided):	0.00	Kurtosis:	12.33

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 2.26e+16. Standard errors may be unstable.

Intercept 和 row_number 的 p-value > 0.05

Non Significant Intercept and Coefficient for linear part

–> Slope = 0, Intercept = 0

Getting the results of pre interval

preds = results.predict(start=min(y_train.index), end=max(y_train.index))
Level_Start = preds[0]
Level_End = preds[-1]
print('Level_Start:',Level_Start)
print('Level_End:',Level_End)

Level_Start: 6.129752604487791e-08
Level_End: 116.31853590382661

Post Interval

Linear Regression

expr = 'Close' + ' ~ ' + 'row_number' # formula_like 不懂
y_train, x_train = dmatrices(expr,data_post,return_type='dataframe')
y_train

	Close
Date
2021-01-27	347.510010
2021-01-28	193.600006
2021-01-29	325.000000
2021-02-01	225.000000
2021-02-02	90.000000
2021-02-03	92.410004
2021-02-04	53.500000
2021-02-05	63.770000
2021-02-08	60.000000
2021-02-09	50.310001
2021-02-10	51.200001

Linear Model
Any Auto-Correlation in the Residuals: Durbin Watson Test

olsr_results = linear_model.OLS(y_train, x_train).fit()
fs = olsr_results.summary()
fs

OLS Regression Results
Dep. Variable:	Close	R-squared:	0.733
Model:	OLS	Adj. R-squared:	0.703
Method:	Least Squares	F-statistic:	24.66
Date:	Sun, 30 Jan 2022	Prob (F-statistic):	0.000774
Time:	14:38:22	Log-Likelihood:	-59.834
No. Observations:	11	AIC:	123.7
Df Residuals:	9	BIC:	124.5
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	286.9668	34.752	8.257	0.000	208.352	365.582
row_number	-29.1697	5.874	-4.966	0.001	-42.458	-15.881

Omnibus:	1.306	Durbin-Watson:	1.819
Prob(Omnibus):	0.520	Jarque-Bera (JB):	0.734
Skew:	0.117	Prob(JB):	0.693
Kurtosis:	1.756	Cond. No.	11.3

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

row_number 的 p-value < 0.05–> row_number is significant–> slope = coefficient of row_number
1.5 < Durbin-Watson < 2.5 --> acceptable

preds = olsr_results.predict(x_train)
Level_Start = preds[0]
Level_End = preds[-1]
print('Level_Start:',Level_Start)
print('Level_End:',Level_End)

Level_Start: 286.9668230576948
Level_End: -4.730455398559627

结果

Level Change = Start Level Post Interval - End Level Pre Interval

Level Change = 286.96 - 116.31 = 170.65

GMT 的 close price, 总体观之, 增长 170.65

Slope Change = Slope Post Interval - Slope Pre Interval

Slope Change = -29.16 - 0 = -29.16

负的，因为 rise the price immediately了然后跌
即 short term success, long term failure