这一次更新的是统计计算相关的项目,用序贯蒙特卡洛算法来预测股票市场的波动率。波动率的预测在现实中还是很有意义的,可以是市场风险的测度,也被拿来做一些金融衍生产品,而且波动率相关的模型大都是基于数理统计方面的理论,求解也都是根据一些相关的算法来编程求解。这一点和一些机器学习模型还是很类似的。
具体的数学模型如下:
在得到股票的价格P之后(也可以是股票指数),可以得到对应的对数收益率。这里关心的变量有隐变量即无法观测的方差以及可观测变量即对数收益率。这里的假设主要在两方面,一方面是等式关系上的假设,具体就是方差的对数服从一个AR(1)模型,另一方面是分布上的假设,具体就是假设给定方差后对数收益率的条件分布是正态分布,此外还假设了状态方程中的noise为正态分布。这里序贯蒙特卡洛的任务就是根据这些假设以及观测到的对数收益率y来估计和预测无法观测的波动率即方差。
这里介绍一下算法具体的流程,背后的理论推导就不详细展开了,主要就是序贯重要性抽样与系统重抽样的结合,有了很多方差的抽样之后就可以根据大数定律做推断了:
(1).t=0,初始状态:
(2).当t>0时,重复抽样并更新权重,然后根据大数定律估计和预测:
(3).然后进行重抽样,主要是为了解决权重退化的问题,就是可能绝大多数样本的权重很小,只有一小部分样本占着极高的权重,这会大大降低推断和预测的质量,本项目用的是系统性重抽样:
多步预测所用的表达式推导结果如下,证明的过程会在文章最后面列出:
当然还有一点很重要,就是之前展示的对数波动率AR(1)中的参数都是未知的,我们要估计波动率就必须知道这些参数,这里本项目的做法是先用观测到的y粗略地估一下波动率,然后用时间序列模型去估AR(1)中的系数,具体滑窗设置如下:
前面介绍完了模型本身,接下来从模拟和实际数据两方面来介绍一下算法的结果:
模拟:用计算机生成一组T为200,服从AR(1)分布的序列,然后取指数得到模拟实验中的真是方差值,具体设置如下:
其中ut是均值为0方差为1的噪声,然后根据观测方程生成yt的样本。得到yt后,根据上面介绍的算法对波动率进行预测,然后和生成的真实波动率进行比较,结果如下,具体实现可以参考后面的代码部分:
从结果可以看出预测值和真实值的拟合程度还是比较高的,但是一些高峰的预测值并没有真实值那么高。
实际数据:本项目选的数据是沪深300指数,它能反映我国股票市场整体的价格变动情况,时间跨度是从2021年5月6号到2023年5月5号,将每日的收盘价当作该日价格,下面分别是股票指数和对应的对数收益率,就是模型中的Pt和Yt:
然后就是根据序贯蒙特卡洛算法进行波动率的预测,下面是分别做了未来的一步预测和五步预测:
根据预测值可以推断未来几天应该不会有太大的波动,投资风险也是比较小的,这里附上未来几天观测值的图:
从模拟和实际数据的分析中可以看出序贯蒙特卡洛算法还是能一定程度上预测波动率的。
最后就是预测方程的推导,主要利用对数正态的性质以及数学归纳法:
模拟的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from pylab import mpl
mpl.rcParams['font.sans-serif']=['SimHei']
mpl.rcParams['axes.unicode_minus']=False
# We try both using package and writting our own function to generate data, the results are the same.
def generate_ln_sigma2(alpha, beta, delta, T):
true_ln_sigma2 = []
true_ln_sigma2.append(np.random.normal(0, 1))
for t in range(1, T):
true_ln_sigma2.append(np.random.normal(alpha + beta*true_ln_sigma2[t-1], delta))
return true_ln_sigma2
plt.plot(generate_ln_sigma2(0, 0.7, 1, 200))
plt.show()
import statsmodels.api as sm
# generate log(sigma^2) and y
T = 200
true_alpha = 0
true_beta = 0.7
true_delta = 1
np.random.seed(12345)
true_ln_sigma2 = generate_ln_sigma2(true_alpha, true_beta, true_delta, T)
true_sigma2 = np.exp(true_ln_sigma2)
y = []
np.random.seed(12345)
for sig2 in true_sigma2:
y.append(np.random.normal(0, sig2**0.5))
# estimate sigma^2
sigma2_t_hat = []
gap = 1
for i in range(T):
if i < gap:
sigma2_t_hat.append(np.var(np.append(y[:i],y[i:2*gap+1]),ddof = 1))
elif i >= T - gap:
sigma2_t_hat.append(np.var(np.append(y[i-gap-1:i],y[i:]),ddof = 1))
else:
sigma2_t_hat.append(np.var(y[i-gap:i+gap+1],ddof = 1))
sigma2_t_hat = np.array(sigma2_t_hat)
log_sigma2_t_hat = np.log(sigma2_t_hat)
# estimate the state equation
from statsmodels.graphics.tsaplots import plot_pacf,plot_acf
plot_acf(log_sigma2_t_hat)
plt.show()
plot_pacf(log_sigma2_t_hat, method='ywm')
plt.show()
model = sm.tsa.arima.ARIMA(log_sigma2_t_hat,order=(1,0,0))
arima = model.fit()
mu = arima.params[0]
beta = arima.params[1]
alpha = mu*(1-beta)
delta2 = arima.params[2]
delta = delta2**0.5
# other preparation
def normal_pdf(x, mean, std):
return np.exp(-(x-mean)**2/(2*std**2))/(np.sqrt(2*np.pi)*std)
y = np.append([0],y) # set y0 to be 0 for easier index
# monte-carlo method
m = 10000
sigma2 = np.zeros((m, T+1))
W = np.zeros((m, T+1))
E_prediction = np.zeros(T+2)
for t in range(T+1):
# sampling
if t == 0:
sigma2[:, t] = np.random.lognormal(np.mean(log_sigma2_t_hat), np.std(log_sigma2_t_hat,ddof=1), m)
else:
sigma2[:, t] = np.random.lognormal(alpha + beta*np.log(sigma2[:,t-1]), delta, m)
# updating weights
if t == 0:
W[:, t] = 1
else:
W[:, t] = W[:, t-1]*normal_pdf(y[t], 0, sigma2[:, t])
# inference
if t != 0:
#prediction
E_prediction[t+1] = np.sum(W[:, t]*np.exp(alpha + beta*np.log(sigma2[:, t])+delta2/2))/np.sum(W[:, t])
# resampling
if t != T: # when all observed data are used, we don't resample because we needn't to sample anymore.
u1 = np.random.uniform(0,1/m)
um = u1 + (m-1)/m
U = np.linspace(u1, um, m)
P = np.cumsum(W[:, t]/np.sum(W[:, t]))
K = []
k = 0
for j in range(m):
while P[k] < U[j]:
k = k + 1
K.append(k)
sigma2 = sigma2[np.array(K)]
W[:, t] = np.mean(W[:, t])
plt.plot(E_prediction[2:], label = '预测值')
plt.plot(true_sigma2, label = '真实值')
plt.legend()
plt.title('模拟的样本量:'+str(T))
plt.show()
沪深300波动率预测的代码:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from pylab import mpl
mpl.rcParams['font.sans-serif']=['SimHei']
mpl.rcParams['axes.unicode_minus']=False
# data pre-processing and ploting
data = pd.read_csv('hs300.csv')
price = np.array(data['收盘'])[::-1]
time = data['日期'][::-1]
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(111)
ax.plot(time, price, c = 'blue', alpha = 0.7)
plt.xlabel('日期')
plt.ylabel('价格')
plt.title('沪深300价格指数')
tick_spacing = 60
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.savefig('hs300price.png')
plt.show()
# y_t we observed
y = np.log(price[1:]/price[:-1])
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(111)
ax.plot(time[1:], y, c = 'blue', alpha = 0.7)
plt.xlabel('日期')
plt.ylabel('对数回报率')
plt.title('沪深300对数回报率')
tick_spacing = 60
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.savefig('hs300logreturn.png')
plt.show()
# estimate sigma^2_t
sigma2_t_hat = []
gap = 1 # if the gap is large, log(sigma2) will be non-stationary, and SIS method will raise RuntimeWarning: overflow encountered error.
T = len(y)
for i in range(T):
if i < gap:
sigma2_t_hat.append(np.var(np.append(y[:i],y[i:2*gap+1]),ddof = 1))
elif i >= T - gap:
sigma2_t_hat.append(np.var(np.append(y[i-gap-1:i],y[i:]),ddof = 1))
else:
sigma2_t_hat.append(np.var(y[i-gap:i+gap+1],ddof = 1))
sigma2_t_hat = np.array(sigma2_t_hat)
# plt.plot(sigma2_t_hat)
# plt.title('sigma2_hat')
log_sigma2_t_hat = np.log(sigma2_t_hat)
# determine the coefficents of state equation
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_pacf,plot_acf
plt.plot(log_sigma2_t_hat)
plt.title('log(sigma^2)')
plt.savefig('ln.png')
plt.show()
plot_acf(log_sigma2_t_hat)
plt.savefig('acf.png')
plt.show()
plot_pacf(log_sigma2_t_hat, method='ywm')
plt.savefig('pacf.png')
plt.show()
model = sm.tsa.arima.ARIMA(log_sigma2_t_hat,order=(1,0,0))
arima = model.fit()
arima.summary()
# log(sigma^2_t) = alpha + beta*log(sigma^2_t-1) + u_t, where u_t is N(0, delta^2)
mu = arima.params[0]
beta = arima.params[1]
alpha = mu*(1-beta)
delta2 = arima.params[2]
delta = delta2**0.5
# since the log(sigma^2) is stationary, we assume log(sigma^2_0) follows N(mean, variance),
# where mean and variance is the mean and variance of series {log(sigma^2_t)}
# so sigma^2_0 follows LN.
def normal_pdf(x, mean, std):
return np.exp(-(x-mean)**2/(2*std**2))/(np.sqrt(2*np.pi)*std)
y = np.append([0],y) # set y0 to be 0 for easier index
# SIS let g(sigma^2_t|sigma^2_t-1) = q(sigma^2_t|sigma^2_t-1), which is log-normal.
m = 10000
sigma2 = np.zeros((m, T+1))
W = np.zeros((m, T+1))
n = 5 # steps we want to predict
E_prediction = np.zeros(T+1+n)
for t in range(T+1):
# sampling
if t == 0:
sigma2[:, t] = np.random.lognormal(np.mean(log_sigma2_t_hat), np.std(log_sigma2_t_hat,ddof=1), m)
else:
sigma2[:, t] = np.random.lognormal(alpha + beta*np.log(sigma2[:,t-1]), delta, m)
# updating weights
if t == 0:
W[:, t] = 1
else:
W[:, t] = W[:, t-1]*normal_pdf(y[t], 0, sigma2[:, t])
# inference
if t != 0:
#a = np.log(W[:, t])
#a_max = np.max(a)
#E_prediction[t+1] = np.sum(np.exp(a - a_max)*np.exp(alpha + beta*np.log(sigma2[:, t])+delta2/2))/np.sum(np.exp(a - a_max))
#prediction
E_prediction[t+1] = np.sum(W[:, t]*np.exp(alpha + beta*np.log(sigma2[:, t])+delta2/2))/np.sum(W[:, t])
# resampling
if t != T: # when all observed data are used, we don't resample because we needn't to sample anymore.
u1 = np.random.uniform(0,1/m)
um = u1 + (m-1)/m
U = np.linspace(u1, um, m)
P = np.cumsum(W[:, t]/np.sum(W[:, t]))
K = []
k = 0
for j in range(m):
while P[k] < U[j]:
k = k + 1
K.append(k)
sigma2 = sigma2[np.array(K)]
W[:, t] = np.mean(W[:, t])
# predict the future n step given observed data
def SUM(i): # calculate beta^0 + beta^2 + beta^4 + ...
s = 0
for j in range(i+2):
s = s + beta**(2*j)
return s
E_last_step = alpha + beta*np.log(sigma2[:, t]) #E(ln(sigma^2_t+1)|sigma^2_t)
for i in range(n-1):
E_prediction[T+2+i] = np.sum(W[:, t]*np.exp(alpha + beta*E_last_step + SUM(i)*delta2/2))/np.sum(W[:, t])
E_last_step = alpha + beta*E_last_step
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(111)
ax.plot(np.append(time[2:], ['2023-5-8']), E_prediction[2:-n+1], label = '一步预测', c = 'blue', alpha = 0.7)
ax.plot(['2023-5-8','2023-5-9','2023-5-10','2023-5-11','2023-5-12'], E_prediction[-n:],
label = '五步预测', c = 'red', alpha = 0.7)
plt.title('风险曲线')
plt.legend()
tick_spacing = 60
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.savefig('riskcurve1.png')
plt.show()
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(111)
ax.plot(time[2:], E_prediction[2:-n], c = 'blue', alpha = 0.7, label = '一步预测')
ax.plot(['2023-5-8'], E_prediction[-n:-n+1], c = 'red', marker='*', label = '2023-5-8预测值')
plt.title('风险曲线')
plt.legend()
tick_spacing = 60
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.savefig('riskcurve2.png')
plt.show()
new_data = pd.read_csv('new_data.csv')
new_price = np.array(new_data['收盘'])[::-1]
new_y = np.log(new_price[1:]/new_price[:-1])
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(111)
ax.plot(time[1:], y[1:], c = 'blue', label = '历史数据')
ax.plot(['2023-5-5','2023-5-8','2023-5-9','2023-5-10','2023-5-11','2023-5-12'], np.append(y[-1], new_y), c = 'red', label = '新观测数据')
plt.title('沪深300对数回报率')
plt.legend()
tick_spacing = 60
ax.xaxis.set_major_locator(ticker.MultipleLocator(tick_spacing))
plt.savefig('newdata.png')
plt.show()