说明
其实最早的时候HMM成功应用在语音识别方面,从道理上语音是时间序列,股价也是时间序列,是相通的。
内容
以前看过从语音识别到股指预测—隐马尔科夫模型(HMM)的一种应用这篇文章,感觉还是不错的。
关键是这段话:
策略的思路依然很简单:历史在不断重复自己。过去发生过的行情走势,会在未来不断重复。所以针对最近的走势去预测明天的股市,我们先直接去找历史上最相似的走势,然后看看它后来是怎么走的就好了
以下使用hmmlearn来跑一下
# play a search
from hmmlearn import hmm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import datetime
# step1 import data
filename = 'data/stock_data.csv'
data = pd.read_csv(filename)
data.dtypes
# step2 衍生数据
# p1-取出维度,进行差分,衍生和对齐
volume = np.array(data.volume) # --- 量
close = np.array(data.close) # --- 价
# 当日数据
high = np.array(data.high)
low = np.array(data.low)
# dimension1 当日价格波动
logDel = np.log(high) - np.log(low) # 当日价格波动
logRet1 = np.log(close[1:]) - np.log(close[:-1]) # 与前一日价格比
logRet5 = np.log(close[5:]) - np.log(close[:-5]) # 与前5日价格比
logVol5 = np.log(volume[5:]) - np.log(volume[:-5]) # 与前5日量比
print('shape of logDel, logRet1, logRet5, logVol5 :', logDel.shape, logRet1.shape, logRet5.shape, logVol5.shape)
# 对齐
logDel = logDel[5:]
logRet1 = logRet1[4:]
# 加入时间轴(序)和预测目标
close = close[5:]
date = pd.to_datetime(data.date[5:])
# 对齐后的整个数据集
print('shape of logDel, logRet1, logRet5, logVol5,close, date :', logDel.shape, logRet1.shape, logRet5.shape, logVol5.shape,close.shape,date.shape)
# 将数据进行‘绑定’ -- 观察是一个向量 obs: logDel, logRet1, logRet5, logVol5 , 目标是close, 序是date
obs = np.column_stack([logDel, logRet1, logRet5, logVol5]) # Gaussian
# obs = np.column_stack([logDel]) # GMMHMM
obs = obs[:-10]
obs_test = obs[-10:]
---
shape of logDel, logRet1, logRet5, logVol5 : (936,) (935,) (931,) (931,)
shape of logDel, logRet1, logRet5, logVol5,close, date : (931,) (931,) (931,) (931,) (931,) (931,)
# step3 - 建模拟合
# 状态数量假设
n_states = 6
np.random.seed(1234)
# model = hmm.GaussianHMM(n_components=n_states, covariance_type='full', n_iter = 100).fit(obs) # Gaussian
model = hmm.GMMHMM(n_components=n_states, n_mix=3, covariance_type='diag', n_iter = 100).fit(obs) #GMMHMM
当时的一些笔记:
模型参数确认:
五大参数: A, B, pi | obs emission
本次假设是6个隐含状态,对于多元正态分布,对于任意一个状态,其对应的分布一定有cov_matrix.shape = (obs.dimension , obs.dimension)
obs.dimension = model.n_features
拟合之后解码(猜测某个序列的隐含状态)
方法1:model.predict(some_obs) 返回一个隐含状态序列 . 方法2:log_prob, predict_states = model.decode(some_obs)
如果要返回预测序列每个状态的概率,使用predict_proba方法
使用score方法,返回对数概率
注意X,文档中称 X = Feature matrix of individual samples,认知一致