ppt的代码加自己整理的笔记,并没有完全搞明白,后面应该会有更新,欢迎交流学习!!!
数据描述
When Elon Musk tweeted “Gamestonk!!” and a link to the WallStreetBets Reddit thread, the GameStop shares surged.
目的
用 Interrupted Time Series 探究 Elon Musk 的推特对 GMT 的股票带来了怎样的影响。
衡量参数:
- Level Change = Start Level of Post Interval - End Level of Pre Interval
- i.e. Post-Interval-pred[0] – Pre-Interval-pred[-1]
- Slope Change = Slope of Post Interval - Slope of Pre Interval
- i.e. Post-Interval-Coeficient – Pre-Interval-Coeficient
关于 level
Note: According to some definitions, Level is
the average of values in an interval. However,
based on a definition we gave here, Level is
the first and last value of the fitted model.
Our definition is more close to immediate
effect.
The Algorithm (Step by Step)
- Doing Linear Regression in each interval separately (Pre Interval and Post Interval)
- Linear Regression: simple and basic form to qualify the change
- Checking the autocorrelation in residuals of the linear model in each interval by Durbin Watson Test. If no autocorrelation is found, we jump to step 6.
- If there is autocorrelation in residuals of a certain model, it means some complexity in real data has not been captured by the model.–> model要换
- If any autocorrelation was found, then for that interval, we put away the linear model and instead try to use SARIMAX
- Checking the autocorrelation in residuals of the SARIMAX model in each interval by Durbin Watson Test.
- If any autocorrelation was found, then for that interval, we put away the SARIMAX also and conclude linearity is not the possible option for this interval
- In the achieved model (either linear model or SARIMAX), we calculate pred[0] (the first value of the model) and pred[-1] (the last value of the model) to attain levels at the beginning and end of the interval, and also take the coefficient of the “row number” variable as the slope of the model in the interval.
- Calculating:
➢Level Change = Post-Interval-pred[0] – Pre-Interval-pred[-1]
➢Slope Change = Post-Interval-Coeficient – Pre-Interval-Coeficient
Import data and visualize it
import yfinance as yf
yf.pdr_override() #需要调用这个函数
from pandas_datareader import data as web
start_date = '2020-10-01'
end_date = '2021-02-11'
data = web.DataReader('GME', data_source='yahoo', start=start_date, end=end_date)
close = data['Close']
ax = close.plot(title='GameStop Share Price')
ax.set_xlabel('Date')
ax.set_ylabel('Close(US$)')
ax.grid()
import matplotlib.pyplot as plt
plt.show()
[*********************100%***********************] 1 of 1 completed
分别得到 Pre Interval, Post Interval 的 level 和 slope
Slicing Data into Pre and Post Intervals
import numpy as np
data_pre = data['2020-10-01':'2021-01-26']
data_pre = data_pre.copy() # 不加会有红色warning 不知道为什么
data_pre['row_number'] = np.arange(data_pre.shape[0])
# row_number 相当于 linear regression 中的 x
data_pre
Open | High | Low | Close | Adj Close | Volume | row_number | |
---|---|---|---|---|---|---|---|
Date | |||||||
2020-10-01 | 10.090000 | 10.250000 | 9.690000 | 9.770000 | 9.770000 | 4554100 | 0 |
2020-10-02 | 9.380000 | 9.780000 | 9.300000 | 9.390000 | 9.390000 | 4340500 | 1 |
2020-10-05 | 9.440000 | 9.590000 | 9.250000 | 9.460000 | 9.460000 | 2805000 | 2 |
2020-10-06 | 9.560000 | 9.840000 | 9.100000 | 9.130000 | 9.130000 | 4535400 | 3 |
2020-10-07 | 9.230000 | 9.560000 | 9.170000 | 9.360000 | 9.360000 | 3308600 | 4 |
... | ... | ... | ... | ... | ... | ... | ... |
2021-01-20 | 37.369999 | 41.189999 | 36.060001 | 39.119999 | 39.119999 | 33471800 | 75 |
2021-01-21 | 39.230000 | 44.750000 | 37.000000 | 43.029999 | 43.029999 | 56216900 | 76 |
2021-01-22 | 42.590000 | 76.760002 | 42.320000 | 65.010002 | 65.010002 | 197157900 | 77 |
2021-01-25 | 96.730003 | 159.179993 | 61.130001 | 76.790001 | 76.790001 | 177874000 | 78 |
2021-01-26 | 88.559998 | 150.000000 | 80.199997 | 147.979996 | 147.979996 | 178588000 | 79 |
80 rows × 7 columns
row_number
相当于 linear regression 中的 x
data_pre = data_pre.copy()
不加会有红色warning 不知道为什么
红色warning:
<ipython-input-14-a1b62b3d68ff>:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data_post = data['2021-01-27':'2021-02-11']
data_post = data_post.copy()
data_post['row_number'] = np.arange(data_post.shape[0])
data_post
Open | High | Low | Close | Adj Close | Volume | row_number | |
---|---|---|---|---|---|---|---|
Date | |||||||
2021-01-27 | 354.829987 | 380.000000 | 249.000000 | 347.510010 | 347.510010 | 93396700 | 0 |
2021-01-28 | 265.000000 | 483.000000 | 112.250000 | 193.600006 | 193.600006 | 58815800 | 1 |
2021-01-29 | 379.709991 | 413.980011 | 250.000000 | 325.000000 | 325.000000 | 50566100 | 2 |
2021-02-01 | 316.559998 | 322.000000 | 212.000000 | 225.000000 | 225.000000 | 37382200 | 3 |
2021-02-02 | 140.759995 | 158.000000 | 74.220001 | 90.000000 | 90.000000 | 78183100 | 4 |
2021-02-03 | 112.010002 | 113.400002 | 85.250000 | 92.410004 | 92.410004 | 42698500 | 5 |
2021-02-04 | 91.190002 | 91.500000 | 53.330002 | 53.500000 | 53.500000 | 62427300 | 6 |
2021-02-05 | 54.040001 | 95.000000 | 51.090000 | 63.770000 | 63.770000 | 81345000 | 7 |
2021-02-08 | 72.410004 | 72.660004 | 58.020000 | 60.000000 | 60.000000 | 25687300 | 8 |
2021-02-09 | 56.610001 | 57.000000 | 46.520000 | 50.310001 | 50.310001 | 26843100 | 9 |
2021-02-10 | 50.770000 | 62.830002 | 46.549999 | 51.200001 | 51.200001 | 36455000 | 10 |
Pre Interval
Linear Regression
from patsy import dmatrices
expr = 'Close' + ' ~ ' + 'row_number' # formula_like 不懂
y_train, x_train = dmatrices(expr,data_pre,return_type='dataframe')
y_train
Close | |
---|---|
Date | |
2020-10-01 | 9.770000 |
2020-10-02 | 9.390000 |
2020-10-05 | 9.460000 |
2020-10-06 | 9.130000 |
2020-10-07 | 9.360000 |
... | ... |
2021-01-20 | 39.119999 |
2021-01-21 | 43.029999 |
2021-01-22 | 65.010002 |
2021-01-25 | 76.790001 |
2021-01-26 | 147.979996 |
80 rows × 1 columns
expr(expression) = 'Close' + ' ~ ' + 'row_number'
--> formula_like
dmatrices()
:
- Construct two design matrices given a formula_like and data.
- By convention, the first matrix is the “outcome” or “y” data, and the second is the “predictor” or “x” data.
因此y_train
对应'Close'
,x_train
对应'row_number'
x_train
Intercept | row_number | |
---|---|---|
Date | ||
2020-10-01 | 1.0 | 0.0 |
2020-10-02 | 1.0 | 1.0 |
2020-10-05 | 1.0 | 2.0 |
2020-10-06 | 1.0 | 3.0 |
2020-10-07 | 1.0 | 4.0 |
... | ... | ... |
2021-01-20 | 1.0 | 75.0 |
2021-01-21 | 1.0 | 76.0 |
2021-01-22 | 1.0 | 77.0 |
2021-01-25 | 1.0 | 78.0 |
2021-01-26 | 1.0 | 79.0 |
80 rows × 2 columns
建模后的 intercept = 这里的intercept × correlation of intercept
Linear Model
Any Auto-Correlation in the Residuals: Durbin Watson Test
from statsmodels.regression import linear_model
olsr_results = linear_model.OLS(y_train, x_train).fit()
fs = olsr_results.summary()
fs
Dep. Variable: | Close | R-squared: | 0.287 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.278 |
Method: | Least Squares | F-statistic: | 31.45 |
Date: | Sun, 30 Jan 2022 | Prob (F-statistic): | 2.98e-07 |
Time: | 14:38:15 | Log-Likelihood: | -332.31 |
No. Observations: | 80 | AIC: | 668.6 |
Df Residuals: | 78 | BIC: | 673.4 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 2.3585 | 3.457 | 0.682 | 0.497 | -4.523 | 9.240 |
row_number | 0.4237 | 0.076 | 5.608 | 0.000 | 0.273 | 0.574 |
Omnibus: | 122.674 | Durbin-Watson: | 0.312 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 3971.926 |
Skew: | 5.078 | Prob(JB): | 0.00 |
Kurtosis: | 35.992 | Cond. No. | 90.7 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
AutocorrelationValue_For_OLS_Errors = float(fs.tables[2].data[0][3])
# Durbin Watson 的值
preds = olsr_results.predict(x_train)
print(AutocorrelationValue_For_OLS_Errors)
0.312
OLS: ordinary least squares
review: p-value >0.05–> not significant
Durbin Watson Test 的值在(1.5,2.5)之间,no autocorrelation in the Residuals, the model is acceptable
if 1.5 < AutocorrelationValue_For_OLS_Errors <2.5:
"""no autocorrelation in the Residuals"""
Level_Start = preds[0]
Level_End = preds[-1]
Slope = olsr_results.params.row_number
result_tab1_html = olsr_results.summary().tables[1].as_html
result_tab1_pandas = pd.read_html(result_tab1_html,header=0,index_col=0)[0]
if (result_tab1_pandas["P>|t|"].iloc[0] < 0.05) and (result_tab1_pandas["P>|t|"].iloc[1] < 0.05):
Significance_Status = 3
"""
Which means both intercept and coefficient are significant
So Significance_Status = 3 is Ideal
"""
elif result_tab1_pandas["P>|t|"].iloc[1] < 0.05:
"""coefficient is significant"""
Significance_Status = 2
elif result_tab1_pandas["P>|t|"].iloc[0] < 0.05:
"""intercept is significant"""
Significance_Status = 1
else:
Significance_Status = 0
else:
"""has autocorrelation, not acceptable"""
print("We have to go for SARIMAX model")
We have to go for SARIMAX model
理解:
elif result_tab1_pandas[“P>|t|”].iloc[1] < 0.05:
“”“coefficient is significant”""
coefficient = row_number × coefficient of row_number
而 row_number is significant(这个 if 中 row_number 的p-value < 0.05)
elif result_tab1_pandas[“P>|t|”].iloc[0] < 0.05:
“”“intercept is significant”""
intercept = intercept × intercept of row_number
而 intercept is significant(这个 if 中 intercept 的p-value < 0.05)
问题:coefficient 比 intercept 更重要?为啥?
Go to SARIMAX
Find best Parameters
from pmdarima.arima import auto_arima
stepwise_fit = auto_arima(y_train, m=12, seasonal=True, d=None, D=1, trace=True, error_action='ignore', suppress_warnings=True, stepwise=True)
Performing stepwise search to minimize aic
ARIMA(2,2,2)(1,1,1)[12] : AIC=456.685, Time=0.26 sec
ARIMA(0,2,0)(0,1,0)[12] : AIC=471.333, Time=0.01 sec
ARIMA(1,2,0)(1,1,0)[12] : AIC=460.265, Time=0.11 sec
ARIMA(0,2,1)(0,1,1)[12] : AIC=469.429, Time=0.05 sec
ARIMA(2,2,2)(0,1,1)[12] : AIC=454.714, Time=0.21 sec
ARIMA(2,2,2)(0,1,0)[12] : AIC=453.718, Time=0.09 sec
ARIMA(2,2,2)(1,1,0)[12] : AIC=455.137, Time=0.15 sec
ARIMA(1,2,2)(0,1,0)[12] : AIC=455.900, Time=0.06 sec
ARIMA(2,2,1)(0,1,0)[12] : AIC=456.842, Time=0.04 sec
ARIMA(3,2,2)(0,1,0)[12] : AIC=455.130, Time=0.10 sec
ARIMA(2,2,3)(0,1,0)[12] : AIC=inf, Time=0.20 sec
ARIMA(1,2,1)(0,1,0)[12] : AIC=459.986, Time=0.03 sec
ARIMA(1,2,3)(0,1,0)[12] : AIC=455.256, Time=0.14 sec
ARIMA(3,2,1)(0,1,0)[12] : AIC=453.133, Time=0.05 sec
ARIMA(3,2,1)(1,1,0)[12] : AIC=452.891, Time=0.15 sec
ARIMA(3,2,1)(2,1,0)[12] : AIC=453.425, Time=0.44 sec
ARIMA(3,2,1)(1,1,1)[12] : AIC=453.520, Time=0.30 sec
ARIMA(3,2,1)(0,1,1)[12] : AIC=451.675, Time=0.19 sec
ARIMA(3,2,1)(0,1,2)[12] : AIC=453.401, Time=0.39 sec
ARIMA(3,2,1)(1,1,2)[12] : AIC=455.210, Time=0.54 sec
ARIMA(2,2,1)(0,1,1)[12] : AIC=457.550, Time=0.11 sec
ARIMA(3,2,0)(0,1,1)[12] : AIC=451.074, Time=0.11 sec
ARIMA(3,2,0)(0,1,0)[12] : AIC=451.292, Time=0.04 sec
ARIMA(3,2,0)(1,1,1)[12] : AIC=453.066, Time=0.15 sec
ARIMA(3,2,0)(0,1,2)[12] : AIC=453.061, Time=0.20 sec
ARIMA(3,2,0)(1,1,0)[12] : AIC=451.767, Time=0.09 sec
ARIMA(3,2,0)(1,1,2)[12] : AIC=inf, Time=0.81 sec
ARIMA(2,2,0)(0,1,1)[12] : AIC=460.167, Time=0.10 sec
ARIMA(4,2,0)(0,1,1)[12] : AIC=451.863, Time=0.19 sec
ARIMA(4,2,1)(0,1,1)[12] : AIC=453.671, Time=0.27 sec
ARIMA(3,2,0)(0,1,1)[12] intercept : AIC=452.064, Time=0.13 sec
Best model: ARIMA(3,2,0)(0,1,1)[12]
Total fit time: 5.740 seconds
SARIMAX
Any Auto-Correlation in the Residuals: Durbin Watson Test
from statsmodels.tsa.statespace.sarimax import SARIMAX
sarimax_model = SARIMAX(endog=y_train, exog=x_train, order=(3,2,0),seasonal_order=(0,1,1,12),enforce_stationarity=False)
results = sarimax_model.fit()
from statsmodels.stats.stattools import durbin_watson
AutocorrelationValue_For_ARIMAX_Errors = durbin_watson(results.resid)
AutocorrelationValue_For_ARIMAX_Errors
1.5611495127853794
1.5 < Autocorrelation Value=1.56 < 2.5
➔
Model has been successful enough not to leave
any meaningful data in the residuals.
exog=x_train
exogenous(外生的) variable = x_train(主要是row_number)
这是比找 best parameters 时的 model 多出的参数
- 之前: SARIMA
- 现在: SARIMAX, X–> exogenous variable
为什么?
- Time series has it individual parameter -> lags
- Linear regression 带来 exogenous variable --> row_number(即 x)
–> SARIMAX sees how Time series is correlated with Linear regression
results.summary()
Dep. Variable: | Close | No. Observations: | 80 |
---|---|---|---|
Model: | SARIMAX(3, 2, 0)x(0, 1, [1], 12) | Log Likelihood | -178.900 |
Date: | Sun, 30 Jan 2022 | AIC | 371.799 |
Time: | 14:38:22 | BIC | 385.591 |
Sample: | 0 | HQIC | 377.103 |
- 80 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.13e-08 | 5.77e+05 | 1.06e-13 | 1.000 | -1.13e+06 | 1.13e+06 |
row_number | 5.887e-07 | 7.23e+05 | 8.14e-13 | 1.000 | -1.42e+06 | 1.42e+06 |
ar.L1 | -0.3810 | 0.337 | -1.129 | 0.259 | -1.042 | 0.280 |
ar.L2 | 1.0854 | 0.188 | 5.761 | 0.000 | 0.716 | 1.455 |
ar.L3 | 1.1965 | 0.262 | 4.559 | 0.000 | 0.682 | 1.711 |
ma.S.L12 | -0.5392 | 0.565 | -0.954 | 0.340 | -1.647 | 0.568 |
sigma2 | 48.5960 | 6.597 | 7.367 | 0.000 | 35.666 | 61.526 |
Ljung-Box (L1) (Q): | 0.18 | Jarque-Bera (JB): | 199.64 |
---|---|---|---|
Prob(Q): | 0.67 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 47.87 | Skew: | 0.91 |
Prob(H) (two-sided): | 0.00 | Kurtosis: | 12.33 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 2.26e+16. Standard errors may be unstable.
Intercept 和 row_number 的 p-value > 0.05
Non Significant Intercept and Coefficient for linear part
–> Slope = 0, Intercept = 0
Getting the results of pre interval
preds = results.predict(start=min(y_train.index), end=max(y_train.index))
Level_Start = preds[0]
Level_End = preds[-1]
print('Level_Start:',Level_Start)
print('Level_End:',Level_End)
Level_Start: 6.129752604487791e-08
Level_End: 116.31853590382661
Post Interval
Linear Regression
expr = 'Close' + ' ~ ' + 'row_number' # formula_like 不懂
y_train, x_train = dmatrices(expr,data_post,return_type='dataframe')
y_train
Close | |
---|---|
Date | |
2021-01-27 | 347.510010 |
2021-01-28 | 193.600006 |
2021-01-29 | 325.000000 |
2021-02-01 | 225.000000 |
2021-02-02 | 90.000000 |
2021-02-03 | 92.410004 |
2021-02-04 | 53.500000 |
2021-02-05 | 63.770000 |
2021-02-08 | 60.000000 |
2021-02-09 | 50.310001 |
2021-02-10 | 51.200001 |
Linear Model
Any Auto-Correlation in the Residuals: Durbin Watson Test
olsr_results = linear_model.OLS(y_train, x_train).fit()
fs = olsr_results.summary()
fs
Dep. Variable: | Close | R-squared: | 0.733 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.703 |
Method: | Least Squares | F-statistic: | 24.66 |
Date: | Sun, 30 Jan 2022 | Prob (F-statistic): | 0.000774 |
Time: | 14:38:22 | Log-Likelihood: | -59.834 |
No. Observations: | 11 | AIC: | 123.7 |
Df Residuals: | 9 | BIC: | 124.5 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 286.9668 | 34.752 | 8.257 | 0.000 | 208.352 | 365.582 |
row_number | -29.1697 | 5.874 | -4.966 | 0.001 | -42.458 | -15.881 |
Omnibus: | 1.306 | Durbin-Watson: | 1.819 |
---|---|---|---|
Prob(Omnibus): | 0.520 | Jarque-Bera (JB): | 0.734 |
Skew: | 0.117 | Prob(JB): | 0.693 |
Kurtosis: | 1.756 | Cond. No. | 11.3 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- row_number 的 p-value < 0.05–> row_number is significant–> slope = coefficient of row_number
- 1.5 < Durbin-Watson < 2.5 --> acceptable
preds = olsr_results.predict(x_train)
Level_Start = preds[0]
Level_End = preds[-1]
print('Level_Start:',Level_Start)
print('Level_End:',Level_End)
Level_Start: 286.9668230576948
Level_End: -4.730455398559627
结果
Level Change = Start Level Post Interval - End Level Pre Interval
Level Change = 286.96 - 116.31 = 170.65
GMT 的 close price, 总体观之, 增长 170.65
Slope Change = Slope Post Interval - Slope Pre Interval
Slope Change = -29.16 - 0 = -29.16
负的,因为 rise the price immediately了然后跌
即 short term success, long term failure