python数据分析--- ch14-15 python计量回归模型

本文链接：https://blog.csdn.net/s1164548515/article/details/139766730

python数据分析--- ch14-15 python计量回归模型

1.Ch14--python 回归分析
2. Ch15--python 时间序列分析

1.Ch14–python 回归分析

statsmodels

1.1 相关分析

1.1.1 相关系数的概念

$\frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}$

1.1.2 使用模拟数据计算变量之间的相关系数和绘图

# 导入包
import numpy as np
import statsmodels.tsa.stattools as sts
import matplotlib.pyplot as plt#绘图
import pandas as pd
import seaborn as sns#绘图
import statsmodels.api as sm

#生成随机数并绘制图形
X = np.random.randn(1000)
Y = np.random.randn(1000)
plt.scatter(X,Y)
plt.show()
print("correlation of X and Y is ")
np.corrcoef(X,Y)[0,1]

在这里插入图片描述
output
correlation of X and Y is
-0.0070692509611224395

X = np.random.randn(1000)
Y = X + np.random.normal(0,0.1,1000)

plt.scatter(X,Y)
plt.show()
print("correlation of X and Y is ")
np.corrcoef(X,Y)[0,1]

在这里插入图片描述
output
correlation of X and Y is
0.9953589318626792

X = np.random.randn(1000)
Y = -2*X + np.random.normal(0,0.1,1000)
plt.scatter(X,Y)
plt.show()
print("correlation of X and Y is ")
np.corrcoef(X,Y)[0,1]

在这里插入图片描述

output
correlation of X and Y is
-0.9987776408463028

X = np.random.randn(1000)
Y = -X*X-2*X + np.random.normal(0,0.1,1000)
plt.scatter(X,Y)
plt.show()
print("correlation of X and Y is ")
np.corrcoef(X,Y)[0,1]

在这里插入图片描述

output
correlation of X and Y is
-0.8357355999573256

1.1.3 使用本地数据计算变量之间的相关系数和绘图

import pandas as pd
import numpy as np
import scipy.stats.stats as stats

$\frac{\sum (x - m_x) (y - m_y)} {\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}$

?stats.pearsonr

import numpy as np
from scipy import stats
x, y = [1, 2, 3, 4, 5, 6, 7], [10, 9, 2.5, 6, 4, 3, 2]
res = stats.pearsonr(x, y)
res

output
PearsonRResult(statistic=-0.8285038835884277, pvalue=0.021280260007523356)
数据文件“ch14_1.xls”下载

#读取数据并创建数据表，名称为data。
data=pd.DataFrame(pd.read_excel('./data/ch14_1.xls'))
#查看数据表前5行的内容
print(data.head())
#取adv和sale数据
x = np.array(data['adv'])
y = np.array(data['sale'])

r=stats.pearsonr(x,y)[0]  
print(r)

output
time adv sale
0 1 35 50
1 2 50 100
2 3 56 120
3 4 68 180
4 5 70 175
0.9636816932556833

1.1.4 使用网上数据计算变量之间的相关系数和绘图

import pandas as pd
import numpy as np

# 创建一个包含10支股票的DataFrame
np.random.seed(42)  # 为了可重复性设置随机种子
stock_data = pd.DataFrame(np.random.randn(100, 10), columns=[f'Stock_{i}' for i in range(1, 11)])

# 将DataFrame转换为CSV文件
stock_data.to_csv('./data/ch14_2_stock_prices.csv', index=False)
print('CSV文件已生成。')

output
CSV文件已生成。

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('./data/ch14_2_stock_prices.csv')
df.head()

output

	Stock_1	Stock_2	Stock_3	Stock_4	Stock_5	Stock_6	Stock_7	Stock_8	Stock_9	Stock_10
0	0.496714	-0.138264	0.647689	1.523030	-0.234153	-0.234137	1.579213	0.767435	-0.469474	0.542560
1	-0.463418	-0.465730	0.241962	-1.913280	-1.724918	-0.562288	-1.012831	0.314247	-0.908024	-1.412304
2	1.465649	-0.225776	0.067528	-1.424748	-0.544383	0.110923	-1.150994	0.375698	-0.600639	-0.291694
3	-0.601707	1.852278	-0.013497	-1.057711	0.822545	-1.220844	0.208864	-1.959670	-1.328186	0.196861
4	0.738467	0.171368	-0.115648	-0.301104	-1.478522	-0.719844	-0.460639	1.057122	0.343618	-1.763040


correlation_matrix = df.corr()
plt.figure(figsize=(10, 8)) # 可以根据需要调整图形大小
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Heatmap of Stock Price Correlation')
plt.show()

output
在这里插入图片描述

1.2 一元回归分析

1.2.1 应用Python-statsmodels工具作一元线性回归分析

数据文件“ch14_3.xls”下载

import pandas as pd
import numpy as np
import scipy.stats as ss
#读取数据并创建数据表，名称为data。
data=pd.DataFrame(pd.read_excel('./data/ch14_3.xls'))
print(data.head())
print(data.describe())
#此命令的含义是对销售额xse、人数rs等变量进行描述性统计分析。

#对数据进行相关分析
x = np.array(data['rs'])
y = np.array(data['xse'])

r=ss.pearsonr(x,y)[0]
#本命令的含义是对新产品销售额、销售人员人数等变量进行相关性分析 
print ('相关系数为:',round(r,4))
print('相关系数矩阵为:\n', data.corr())

output

   dq  xse  rs
0   1  385  17
1   2  251  10
2   3  701  44
3   4  479  30
4   5  433  22
             dq         xse         rs
count  10.00000   10.000000  10.000000
mean    5.50000  446.600000  22.100000
std     3.02765  160.224287  12.705642
min     1.00000  217.000000   5.000000
25%     3.25000  362.500000  12.000000
50%     5.50000  422.000000  19.500000
75%     7.75000  555.500000  30.750000
max    10.00000  701.000000  44.000000
相关系数为: 0.9699
相关系数矩阵为:
            dq       xse        rs
dq   1.000000  0.218510  0.085207
xse  0.218510  1.000000  0.969906
rs   0.085207  0.969906  1.000000

# model matrix with intercept
X = sm.add_constant(x)
#least squares fit
model = sm.OLS(y, X)
fit = model.fit()
print(fit.summary())

output

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.933
Method:                 Least Squares   F-statistic:                     126.9
Date:                Wed, 12 Jun 2024   Prob (F-statistic):           3.46e-06
Time:                        11:37:48   Log-Likelihood:                -50.301
No. Observations:                  10   AIC:                             104.6
Df Residuals:                       8   BIC:                             105.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        176.2952     27.327      6.451      0.000     113.279     239.311
x1            12.2310      1.086     11.267      0.000       9.728      14.734
==============================================================================
Omnibus:                        0.718   Durbin-Watson:                   1.407
Prob(Omnibus):                  0.698   Jarque-Bera (JB):                0.588
Skew:                          -0.198   Prob(JB):                        0.745
Kurtosis:                       1.879   Cond. No.                         52.6
==============================================================================

#画线性回归图
import matplotlib.pyplot as plt
plt.scatter(x, y)
plt.plot(x, fit.fittedvalues)

在这里插入图片描述

1.2.2 应用Python-sklearn工具作一元线性回归分析

数据文件“ch14_3.xls”下载

import numpy as np
from sklearn import linear_model
import pandas as pd

# 假设data是已经定义并包含所需列的DataFrame
data=pd.DataFrame(pd.read_excel('./data/ch14_3.xls'))

# 确保data变量已经被定义且包含'rs'和'xse'列
x = np.array(data[['rs']])
y = np.array(data[['xse']])
x

output
array([[17],
[10],
[44],
[30],
[22],
[15],
[11],
[ 5],
[31],
[36]], dtype=int64)

data['rs']

output
0 17
1 10
2 44
3 30
4 22
5 15
6 11
7 5
8 31
9 36
Name: rs, dtype: int64

# 创建线性回归模型并拟合数据
clf = linear_model.LinearRegression()
clf.fit(x, y)

# 打印回归系数和截距
print("回归系数:", clf.coef_)
print("截距:", clf.intercept_)

# 计算并打印R²分数
r_squared = clf.score(x, y)
print("R-squared:", r_squared)

# 输入自变量值进行预测
# 假设你想预测'rs'为40时的'xse'值
prediction = clf.predict([[40]])
print("预测rs = 40时xse的值为 :", prediction[0])

output
回归系数: [[12.2309863]]
截距: [176.2952027]
R-squared: 0.9407180505879883
预测rs = 40时xse的值为 : [665.53465483]

1.3 多元线性回归数据分析

数据文件“ch14_4.xls”下载

import pandas as pd
import numpy as np
import scipy.stats as ss
#读取数据并创建数据表，名称为data。
data=pd.DataFrame(pd.read_excel('./data/ch14_4.xls'))
print(data.head())
print(data.describe())

output

      TC  Q    PL    PF   PK
0  0.082  2  2.09  17.9  183
1  0.661  3  2.05  35.1  174
2  0.990  4  2.05  35.1  171
3  0.315  4  1.83  32.2  166
4  0.197  5  2.12  28.6  233
               TC             Q          PL          PF          PK
count  145.000000    145.000000  145.000000  145.000000  145.000000
mean    12.976097   2133.082759    1.972069   26.176552  174.496552
std     19.794577   2931.942131    0.236807    7.876071   18.209477
min      0.082000      2.000000    1.450000   10.300000  138.000000
25%      2.382000    279.000000    1.760000   21.300000  162.000000
50%      6.754000   1109.000000    2.040000   26.900000  170.000000
75%     14.132000   2507.000000    2.190000   32.200000  183.000000
max    139.422000  16719.000000    2.320000   42.800000  233.000000

#用数组对数据做相关分析
y= np.array(data['TC'])
x1= np.array(data['Q'])
x2= np.array(data['PL'])
x3= np.array(data['PF'])
x4= np.array(data['PK'])

r1=ss.pearsonr(x1,y)[0]
r2=ss.pearsonr(x2,y)[0]
r3=ss.pearsonr(x3,y)[0]
r4=ss.pearsonr(x4,y)[0]
print(f"r1={round(r1,4)};r2={round(r2,4)};r3={round(r3,4)};r4={round(r4,4)}")

output

r1=0.9525;r2=0.2513;r3=0.0339;r4=0.0272

#用数据框对数据做相关分析
corr_matrix = data.corr()
print('相关系数矩阵为:\n', data.corr())

plt.figure(figsize=(10, 8)) # 可以根据需要调整图形大小
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title('Heatmap of data Correlation')
plt.show()

相关系数矩阵为:
           TC         Q        PL        PF        PK
TC  1.000000  0.952504  0.251338  0.033935  0.027202
Q   0.952504  1.000000  0.171450 -0.077349  0.002869
PL  0.251338  0.171450  1.000000  0.313703 -0.178145
PF  0.033935 -0.077349  0.313703  1.000000  0.125428
PK  0.027202  0.002869 -0.178145  0.125428  1.000000

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

1.3.1 多元回归分析的Python的statsmodels工具应用

import statsmodels.api as sm
from patsy import dmatrices
vars = ['TC','Q','PL','PF','PK']
df=data[vars]
#显示最后5条记录数据
print (df.tail())

output

          TC      Q    PL    PF   PK
140   44.894   9956  1.68  28.8  203
141   67.120  11477  2.24  26.5  151
142   73.050  11796  2.12  28.6  148
143  139.422  14359  2.31  33.5  212
144  119.939  16719  2.30  23.6  162

y,X=dmatrices('TC~Q+PL+PF+PK',data=data,return_type='dataframe')
print (y.head())
print (X.head())
model = sm.OLS(y, X)
fit = model.fit()
print (fit.summary())

output

      TC
0  0.082
1  0.661
2  0.990
3  0.315
4  0.197
   Intercept    Q    PL    PF     PK
0        1.0  2.0  2.09  17.9  183.0
1        1.0  3.0  2.05  35.1  174.0
2        1.0  4.0  2.05  35.1  171.0
3        1.0  4.0  1.83  32.2  166.0
4        1.0  5.0  2.12  28.6  233.0
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     TC   R-squared:                       0.923
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                     418.1
Date:                Wed, 12 Jun 2024   Prob (F-statistic):           9.26e-77
Time:                        11:38:18   Log-Likelihood:                -452.47
No. Observations:                 145   AIC:                             914.9
Df Residuals:                     140   BIC:                             929.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -22.2210      6.587     -3.373      0.001     -35.245      -9.197
Q              0.0064      0.000     39.258      0.000       0.006       0.007
PL             5.6552      2.176      2.598      0.010       1.352       9.958
PF             0.2078      0.064      3.242      0.001       0.081       0.335
PK             0.0284      0.027      1.073      0.285      -0.024       0.081
==============================================================================
Omnibus:                      135.057   Durbin-Watson:                   1.560
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4737.912
Skew:                           2.907   Prob(JB):                         0.00
Kurtosis:                      30.394   Cond. No.                     5.29e+04
==============================================================================

y,X=dmatrices('TC~Q+PL+PF',data=df,return_type='dataframe')
import statsmodels.api as sm
model = sm.OLS(y, X)
fit = model.fit()
print (fit.summary())

output

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     TC   R-squared:                       0.922
Model:                            OLS   Adj. R-squared:                  0.920
Method:                 Least Squares   F-statistic:                     556.5
Date:                Wed, 12 Jun 2024   Prob (F-statistic):           6.39e-78
Time:                        11:38:19   Log-Likelihood:                -453.06
No. Observations:                 145   AIC:                             914.1
Df Residuals:                     141   BIC:                             926.0
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -16.5443      3.928     -4.212      0.000     -24.309      -8.780
Q              0.0064      0.000     39.384      0.000       0.006       0.007
PL             5.0978      2.115      2.411      0.017       0.917       9.278
PF             0.2217      0.063      3.528      0.001       0.097       0.346
==============================================================================
Omnibus:                      142.387   Durbin-Watson:                   1.590
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5466.347
Skew:                           3.134   Prob(JB):                         0.00
Kurtosis:                      32.419   Cond. No.                     3.42e+04
==============================================================================

1.3.2 用scikit-learn工具作多元回归分析

import pandas as pd

print(data.shape)
print(data.head())

output

(145, 5)
      TC  Q    PL    PF   PK
0  0.082  2  2.09  17.9  183
1  0.661  3  2.05  35.1  174
2  0.990  4  2.05  35.1  171
3  0.315  4  1.83  32.2  166
4  0.197  5  2.12  28.6  233

feature_cols = ['Q', 'PL', 'PF', 'PK']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# check the type and shape of X
print (type(X))
print (X.shape)

# select a Series from the DataFrame
y = data['TC']
# print the first 5 values
print (y.head())

output

<class 'pandas.core.frame.DataFrame'>
(145, 4)
0    0.082
1    0.661
2    0.990
3    0.315
4    0.197
Name: TC, dtype: float64

# import seaborn as sns   #seaborn程序包需要先安装
# #安装命令：pip install seaborn
# sns.pairplot(data, x_vars=['Q', 'PL', 'PF', 'PK'], y_vars='TC', height=7, aspect=0.8)

import seaborn as sns
import matplotlib.pyplot as plt

    # 使用lmplot绘制每个自变量与因变量之间的关系
for col in ['Q', 'PL', 'PF', 'PK']:
    sns.lmplot(x=col, y='TC', data=data, height=7, aspect=0.8, scatter_kws={'alpha':0.5})
plt.show()

在这里插入图片描述

1.4 线性回归模型

#构造训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

#线性回归分析的Scikit-learn工具应用
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1)
print(linreg.intercept_)
print(linreg.coef_)

# pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)

#预测
y_pred = linreg.predict(X_test)
print(y_pred.round(decimals=4))

output

(108, 4)
(108,)
(37, 4)
(37,)
-15.546621299685215
[5.98807135e-03 5.68597073e+00 1.69334275e-01 9.00966074e-05]
[ 0.4819  4.1567  2.1491 32.6262  6.6678  1.4221  1.9241 17.8204  0.7097
  9.0636  4.1637  8.5665  5.6732  8.2366  3.96   25.3146 -0.9507  3.0294
  4.5084  2.7402 31.2073 12.9369  2.1043  3.8415 12.741  55.895  26.3795
 18.7322  6.7021  0.5052  1.4885 18.4282 13.5925 10.1431  2.7995  7.1969
 89.2625]

1.5 回归问题的评价测度

from sklearn import metrics
import numpy as np

# calculate MAE by hand
print("MAE by hand:",(10 + 0 + 20 + 10)/4)
# calculate MAE using scikit-learn
print ("MAE:",metrics.mean_absolute_error(y_test, y_pred))

# calculate MSE by hand
print ("MSE by hand:",(10**2 + 0**2 + 20**2 + 10**2)/4.)
# calculate MSE using scikit-learn
print ("MSE:",metrics.mean_squared_error(y_test, y_pred))

# calculate RMSE by hand
print ("RMSE by hand:",np.sqrt((10**2 + 0**2 + 20**2 + 10**2)/4.))
# calculate RMSE using scikit-learn
print ("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

output

MAE by hand: 10.0
MAE: 3.559135213796553
MSE by hand: 150.0
MSE: 77.98080985164937
RMSE by hand: 12.24744871391589
RMSE: 8.830674371283846

2. Ch15–python 时间序列分析

2.1 准备工作

2.1.1 加载包

# 加载包
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
from datetime import datetime

#全局配置
#中文字符设定 plt.rcParams属性总结
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

2.1.2 数据读取及预处理

数据文件“csh601318_processed.csv”下载

df = pd.read_csv('./data/sh601318_processed.csv',index_col='Date')
df.head()

output

	Close
Date
2015-01-05	76.16
2015-01-06	73.73
2015-01-07	73.41
2015-01-08	71.08
2015-01-09	72.84

#读取已处理好的数据

df_601318 = df[['Close']]
df_601318.index = pd.to_datetime(df_601318.index, format='%Y-%m-%d') # 将Date列转换为datetime格式

# 查看数据的基本信息
print(df_601318.head())

output

            Close
Date             
2015-01-05  76.16
2015-01-06  73.73
2015-01-07  73.41
2015-01-08  71.08
2015-01-09  72.84

print(df_601318.info())

output

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2033 entries, 2015-01-05 to 2023-05-15
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Close   2033 non-null   float64
dtypes: float64(1)
memory usage: 31.8 KB
None

print(df_601318.describe())

output

             Close
count  2033.000000
mean     59.039838
std      18.942363
min      25.110000
25%      42.840000
50%      58.380000
75%      76.520000
max      93.380000

2.2 可视化股票价格

#对数据进行可视化，以了解股票价格的趋势和变化。
plt.figure(figsize=(10, 6))
plt.plot(df_601318['Close'])
plt.title('ZGPA收盘价')
plt.xlabel('日期')
plt.ylabel('收盘价格(元)')
plt.show()

在这里插入图片描述

2.3 单位根检验

# 进行ADF检验
result = sm.tsa.stattools.adfuller(df_601318['Close'])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

output

ADF Statistic: -1.894778
p-value: 0.334522
Critical Values:
	1%: -3.434
	5%: -2.863
	10%: -2.568

无法拒绝原假设，认为数据是非平稳的。

2.4 差分

2.4.1 一阶差分

# 进行一阶差分
diff1 = df_601318['Close'].diff().dropna()

2.4.2 一阶差分ADF检验

# 进行ADF检验
result = sm.tsa.stattools.adfuller(diff1)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

output

ADF Statistic: -8.640510
p-value: 0.000000
Critical Values:
	1%: -3.434
	5%: -2.863
	10%: -2.568

一阶差分后序列平稳

2.4.3 绘制一阶差分后的时序图

# 绘制差分后的时间序列图
fig, ax = plt.subplots(2, 1, figsize=(8, 12))
ax[0].plot(df_601318.index, df_601318['Close'])
ax[0].set_title('原始数据')
ax[1].plot(diff1.index, diff1)
ax[1].set_title('一阶差分')
# ax[2].plot(diff2.index, diff2)
# ax[2].set_title('二阶差分')
plt.show()

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

2.5 绘制自相关图和偏自相关图

2.5.1 绘制原序列自相关图和偏自相关图

# 绘制自相关图和偏自相关图
fig, axes = plt.subplots(2, 1, figsize=(10,8))
sm.graphics.tsa.plot_acf(df_601318['Close'], lags=50, ax=axes[0])
sm.graphics.tsa.plot_pacf(df_601318['Close'], lags=30, ax=axes[1])
plt.show()

在这里插入图片描述

2.5.2 绘制一阶差分后自相关图和偏自相关图

# 绘制自相关图和偏自相关图
fig, axes = plt.subplots(2, 1, figsize=(10,8))
sm.graphics.tsa.plot_acf(diff1, lags=30, ax=axes[0])
sm.graphics.tsa.plot_pacf(diff1, lags=30, ax=axes[1])
plt.show()

在这里插入图片描述

2.6 划分训练集和测试集

2.6.1 按照指定的百分比划分，如：8:2,7:3等

# 假设df是处理好的股票数据
train_size = int(len(df_601318) * 0.8)
train_df = df_601318[:train_size]
test_df = df_601318[train_size:]
# 输出训练集和测试集的数据量
print("训练集数据量：", len(train_df))
print("测试集数据量：", len(test_df))

output

训练集数据量： 1626
测试集数据量： 407

2.6.2 按照指定的日期划分，如：2022-12-31

# 假设df是处理好的股票数据
split_date = datetime(2021, 12, 31)
train_df = df_601318[df_601318.index <= split_date]
test_df = df_601318[df_601318.index > split_date]

# 输出训练集和测试集的数据量
print("训练集数据量：", len(train_df))
print("测试集数据量：", len(test_df))

output

训练集数据量： 1705
测试集数据量： 328

test_df.head()

output

	Close
Date
2022-01-04	51.00
2022-01-05	52.07
2022-01-06	51.30
2022-01-07	52.94
2022-01-10	53.08

2.7 模型拟合

我们可以使用 ARIMA 函数来拟合 ARIMA 模型并进行预测。首先，需要确定 ARIMA 模型的参数。可以通过观察 ACF 和 PACF 图来初步选择参数。在这里，我们将选择 ARIMA(1,1,1) 模型。

# 拟合ARIMA(1,1,1)模型并预测训练集数据
model = ARIMA(train_df, order=(1, 1, 1))
model_fit = model.fit()
# 模型概述
print(model_fit.summary())

output

                               SARIMAX Results                                
==============================================================================
Dep. Variable:                  Close   No. Observations:                 1705
Model:                 ARIMA(1, 1, 1)   Log Likelihood               -3295.505
Date:                Wed, 12 Jun 2024   AIC                           6597.010
Time:                        11:41:15   BIC                           6613.333
Sample:                             0   HQIC                          6603.052
                               - 1705                                         
Covariance Type:                  opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.7986      0.117     -6.807      0.000      -1.029      -0.569
ma.L1          0.8396      0.105      8.020      0.000       0.634       1.045
sigma2         2.8013      0.017    160.271      0.000       2.767       2.836
===================================================================================
Ljung-Box (L1) (Q):                   2.02   Jarque-Bera (JB):           5344375.75
Prob(Q):                              0.16   Prob(JB):                         0.00
Heteroskedasticity (H):               0.29   Skew:                           -10.32
Prob(H) (two-sided):                  0.00   Kurtosis:                       276.58
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).


train_pred = model_fit.predict(typ='levels')

# 预测测试集数据
test_pred = model_fit.forecast(steps=len(test_df))

# 输出训练集和测试集的预测数据
print("训练集预测数据：\n", train_pred)
print("测试集预测数据：\n", test_pred)

output

训练集预测数据：
 Date
2015-01-05     0.000000
2015-01-06    76.160008
2015-01-07    73.639795
2015-01-08    73.473245
2015-01-09    70.935963
                ...    
2021-12-27    49.766963
2021-12-28    50.031091
2021-12-29    50.863590
2021-12-30    50.812922
2021-12-31    50.391148
Name: predicted_mean, Length: 1705, dtype: float64
测试集预测数据：
 1705    50.425828
1706    50.413187
1707    50.423283
1708    50.415220
1709    50.421659
          ...    
2028    50.418800
2029    50.418800
2030    50.418800
2031    50.418800
2032    50.418800
Name: predicted_mean, Length: 328, dtype: float64

# 可视化训练集和测试集的实际数据和预测数据
plt.figure(figsize=(12,6))
plt.plot(train_df.index, train_df.values, label='训练数据')
plt.plot(train_df.index, train_pred.values, label='训练数据预测值')
plt.plot(test_df.index, test_df.values, label='测试数据')
plt.plot(test_df.index, test_pred.values, label='测试数据预测值')
plt.legend()
plt.title('ARIMA(1,1,1)模型拟合结果')
plt.xlabel('日期')
plt.ylabel('收盘价格')
plt.show()

在这里插入图片描述

思考：ARIMA在训练集上拟合的比较好，但在测试集上，所有的预测值几乎都相等，为什么？问题出在哪？怎么解决？

2.8 模型评价

2.8.1 计算评估指标

为了评估模型的预测效果，我们需要计算一些常用的评估指标，包括均方误差（Mean Squared Error，MSE）、均方根误差（Root Mean Squared Error，RMSE）、平均绝对误差（Mean Absolute Error，MAE）和平均绝对百分误差（Mean Absolute Percentage Error，MAPE）等。

这些指标的计算公式如下：

$\frac{1}{n} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$

$\sqrt{MSE}$

$\frac{1}{n} \sum_{i=1}^{n}|y_i - \hat{y}_i|$

$\frac{1}{n} \sum_{i=1}^{n}|\frac{y\_i - \hat{y}_i}{y_i}| \times 100%$

其中， $y_i$ 表示真实值， $\hat{y}_i$ 表示预测值， $n$ 表示样本数量。

代码如下：

# 计算评估指标
mse = mean_squared_error(test_df.values, test_pred.values)
rmse = np.sqrt(mse)
mae = mean_absolute_error(test_df.values, test_pred.values)
mape = np.mean(np.abs((test_df.values - test_pred.values) / test_df.values)) * 100

# 打印评估指标
print('MSE: %.2f' % mse)
print('RMSE: %.2f' % rmse)
print('MAE: %.2f' % mae)
print('MAPE: %.2f%%' % mape)

output

MSE: 33.86
RMSE: 5.82
MAE: 4.92
MAPE: 11.28%

接下来，我们可以绘制模型的残差图（Residual Plot），来评估模型是否存在系统性误差。如果模型的残差是随机的、没有规律性，那么说明模型拟合得比较好；如果残差呈现出一定的规律性，那么说明模型还存在一些问题。

2.8.2 残差检验

(1) 绘制残差图

# 计算训练集和测试集的残差
train_resid = train_df['Close'] - train_pred.values
test_resid = test_df['Close'] - test_pred.values

# 绘制训练集和测试集的残差图
plt.figure(figsize=(12,10))
plt.subplot(211)
plt.plot(train_resid)
plt.title('训练数据残差图')
plt.xlabel('年份')
plt.ylabel('残差')

plt.subplot(212)
plt.plot(test_resid)
plt.title('测试数据残差图')
plt.xlabel('年份')
plt.ylabel('残差')
plt.show()

在这里插入图片描述

(2) 对残差进行平稳性检验

# 对训练集残差进行平稳性检验
result3 = sm.tsa.stattools.adfuller(train_resid)
print('训练集残差ADF检验结果：')
print('ADF Statistic: %f' % result3[0])
print('p-value: %f' % result3[1])
print('Critical Values:')
for key, value in result3[4].items():
    print('\t%s: %.3f' % (key, value))
    
# 对测试集残差进行平稳性检验
result4 = sm.tsa.stattools.adfuller(test_resid)
print("")
print('测试集残差ADF检验结果：')
print('ADF Statistic: %f' % result4[0])
print('p-value: %f' % result4[1])
print('Critical Values:')
for key, value in result4[4].items():
    print('\t%s: %.3f' % (key, value))

output

训练集残差ADF检验结果：
ADF Statistic: -7.788862
p-value: 0.000000
Critical Values:
	1%: -3.434
	5%: -2.863
	10%: -2.568

测试集残差ADF检验结果：
ADF Statistic: -1.734568
p-value: 0.413396
Critical Values:
	1%: -3.451
	5%: -2.870
	10%: -2.572

根据结果，训练集残差ADF 统计量为-8.386454，远小于 1% 的临界值-3.616，因此我们可以拒绝原假设(序列具有单位根非平稳性)，接受备择假设，即序列是平稳的。 P 值为 0，说明序列的单位根非常显著地低于 5% 的显著性水平，进一步支持序列的平稳性。