hmmlearn 源码解读
文章目录
简介
hmmlearn: HMM的Python实现
hmmlearn
可以处理一组独立序列/多序列,但本文只考虑一个序列。
记号约定
- X , Z X, Z X,Z: 观察变量、隐变量
- S S S: 状态集, Z Z Z样本空间
- T T T: 样本个数
变量名/术语表
源码中出现的变量和术语的对照
基本变量
X
: 观测变量
state
: 状态S
n_components
: 状态数 |S|
n_samples
: 样本个数 T
n_features
: 样本特征数
高级变量
变量名: 形状 | HMM记号(未对数化) | 术语 |
---|---|---|
fwdlattice: T X |S| | $\alpha_t(z):=P(x_{1:t},Z_t=z | \theta),t=1,\cdots T, z:S$ |
bwdlattice: T X |S| | $\beta_t(z):= P(y_{t+1:T} | Z_t=z,\theta)$ |
framelogprob: T X |S| | $p(x_t | z,\theta)$ |
logprob: scalar | $p(X=x | \theta)$ |
posteriors: T X |S| | $\gamma_t(i)=p(Z_t=i | X=x,\theta)\sim\alpha_t(i)\beta_t(i)$ |
viterbi_lattice: TX |S| | $\delta_t(i)=\max_{z_1,\cdots,z_{t-1}}P(z_t=i,z_{t-1},\cdots,z_1,x | \theta)$ |
xi: T X |S| X |S| | $\xi_t(i,j)=p(Z_t=i,Z_{t+1}=j | x,\theta)$ |
注 表中的变量名可能是源代码中变量名的子字符串
类
_BaseHMM类
-
主要参数:
n_components
: 状态数startprob_prior
: 初始状态先验分布 π ∼ D i r ( α ) \pi\sim Dir(\alpha) π∼Dir(α), 默认 α = 1 \alpha=1 α=1transmat_prior
: 转移矩阵先验分布 A ( i , : ) ∼ D i r ( α i ′ ) A(i,:)\sim Dir(\alpha_i') A(i,:)∼Dir(αi′), 默认 α ′ = 1 \alpha'=1 α′=1
-
主要属性:
-
startprob_
: 初始状态分布 -
transmat_
: 转移矩阵
-
-
主要方法:
_compute_log_likelihood
: 计算对数似然/发射概率 log p ( X t ∣ z t , θ ) , t = 1 ⋯ T \log p(X_t|z_t,\theta),t=1\cdots T logp(Xt∣zt,θ),t=1⋯T_generate_sample_from_state
: 根据分布 P ( X ∣ Z = z , θ ) P(X|Z=z,\theta) P(X∣Z=z,θ)生成样本_score
: 计算对数似然,调用_do_forward_pass,_do_backward_pass/_compute_posteriors
_do_forward_pass
: 计算 α \alpha α_do_backward_pass
: 计算 β \beta β_compute_posteriors
: 计算状态后验概率 γ t ( i ) ∼ α t ( i ) β t ( i ) \gamma_t(i)\sim\alpha_t(i)\beta_t(i) γt(i)∼αt(i)βt(i)decode/predict
: 估计状态序列 Z ∣ X = x Z|X=x Z∣X=x,调用_decode_viterbi
_decode_map
: 贪婪的极大后验估计_decode_viterbi
: Viterbi算法fit
: 学习算法,调用_do_mstep
_accumulate_sufficient_statistics
: 利用 α , β , γ \alpha,\beta,\gamma α,β,γ更新统计量, 调用_hmmc._compute_log_xi_sum
sample
: 生成 ( X t , Z t ) (X_t,Z_t) (Xt,Zt)序列,调用_generate_sample_from_state
_do_mstep
: M step,跟新初始分布,和转移矩阵
-
统计量:
start
:估计初始概率trans
: 估计转移矩阵
初始分布更新: π ∼ α − 1 + N 1 ∨ 0 \pi \sim \alpha-1+N_1\lor 0 π∼α−1+N1∨0 (归一化),转移矩阵更新类似
用户主要实现
_compute_log_likelihood
_do_mstep
GaussianHMM
-
属性
emissionprob_
: 发射概率,由均值和协方差决定means_
: 每个状态对应的均值covars_
: 每个状态对应的协方差covariance_type
: 协方差类型means_prior, means_weight: 均值的正态先验分布的均值与精度
covars_prior, covars_weight: 方差的先验分布的均值与精度
-
统计量:
post,obs
:用于更新均值obs**2,obs*obs.T
:用于更新协方差
MultinomialHMM
离散值HMM
emissionprob_
: 发射概率
-
统计量:
obs: 估计发射概率
GMMHMM
略
辅助函数
_hmmc._compute_log_xi_sum
: 计算
∑
t
ξ
t
\sum_t\xi_t
∑tξt 用于估计转移矩阵
源码解读
代码已经简化,去掉一些异常检测、归一化代码。并假设只学习当个序列。
fit 方法 - 学习算法(Bayesian Baum-Welch 算法)
# 初始化
self._init(X, lengths=lengths)
# for-looping
# E step
# 计算统计量(迭代序列集)
# 初始化统计量
stats = self._initialize_sufficient_statistics()
framelogprob = self._compute_log_likelihood(X)
logprob, fwdlattice = self._do_forward_pass(framelogprob)
bwdlattice = self._do_backward_pass(framelogprob)
posteriors = self._compute_posteriors(fwdlattice, bwdlattice)
# 跟新统计量
self._accumulate_sufficient_statistics(
stats, X, framelogprob, posteriors, fwdlattice,
bwdlattice)
# M step
self.startprob_ ~ np.maximum(self.startprob_prior - 1 + stats['start'], 0)
self.transmat_ ~ np.maximum(self.transmat_prior - 1 + stats['trans'], 0)
def _do_mstep(self, stats):
super()._do_mstep(stats)
means_prior = self.means_prior
means_weight = self.means_weight
denom = stats['post'][:, None]
# 更新均值
self.means_ = (means_weight * means_prior + stats['obs']) / (means_weight + denom)
# 更新方差
covars_prior = self.covars_prior
covars_weight = self.covars_weight
meandiff = self.means_ - means_prior
if self.covariance_type in {'spherical', 'diag'}: # 对角型,features 独立
# {c0 + w0(mu-mu0)^2 + sum_t pt*yt^2 - 2 mu * sum_t pt * yt + mu**2*sum_t pt} / {(cw-1) V 0 + sum_t pt}
c_n = (means_weight * meandiff**2
+ stats['obs**2']
- 2 * self.means_ * stats['obs']
+ self.means_**2 * denom)
c_d = max(covars_weight - 1, 0) + denom
self._covars_ = (covars_prior + c_n) / np.maximum(c_d, 1e-5)
if self.covariance_type == 'spherical':
self._covars_ = np.tile(self._covars_.mean(1)[:, None],
(1, self._covars_.shape[1]))
elif self.covariance_type in ('tied', 'full'):
c_n = np.empty((self.n_components, self.n_features,
self.n_features))
for c in range(self.n_components):
obsmean = np.outer(stats['obs'][c], self.means_[c])
c_n[c] = (means_weight * np.outer(meandiff[c],
meandiff[c])
+ stats['obs*obs.T'][c]
- obsmean - obsmean.T
+ np.outer(self.means_[c], self.means_[c])
* stats['post'][c])
cvweight = max(covars_weight - self.n_features, 0)
if self.covariance_type == 'tied':
self._covars_ = ((covars_prior + c_n.sum(axis=0)) /
(cvweight + stats['post'].sum()))
elif self.covariance_type == 'full':
self._covars_ = ((covars_prior + c_n) /
(cvweight + stats['post'][:, None, None]))
作品
#!/usr/bin/env python3
"""AI composing by HMM
Word2Vec -> HMM
"""
import pathlib
import joblib
import numpy as np
from hmmlearn import hmm
from scipy.spatial import distance
from gensim.models import word2vec, Word2Vec
CORPUS_PATH = pathlib.Path('...') # the path of your corpus, could be a folder of txt files.
WV_PATH = pathlib.Path('wv.model')
HMM_PATH = pathlib.Path('hmm.model')
def read(path, level='p'):
"""Read text from files
"""
if isinstance(path, str):
path = pathlib.Path(path)
if path.is_dir():
if level == 'd':
return [read(f, level=level) for f in path.iterdir()]
elif level == 'p':
res = []
for f in path.iterdir():
if f.suffix in {'.txt', '.md'}:
res.extend(read(f, level=level))
return res
else:
if level == 'd':
return path.read_text().rpartition('---')[-1]
elif level == 'p':
return [p for p in path.read_text().rpartition('---')[-1].split('\n') if len(p)>2]
def get_corpus(path=CORPUS_PATH):
# get corpus (for gensim)
# a corpus is a 2-order list of words [['words']]
import jieba
import jieba.posseg as pseg
import logging
jieba.setLogLevel(logging.INFO)
def cut_flag(s):
words = pseg.cut(s)
return [f'{word_flag.flag}' if word_flag.flag in {'u', 'w', 'x', 'y', 'z', 'un', 'o', 'p', 'uj'}
else word_flag.word for word_flag in words]
s_s = read(path)
return [cut_flag(s) for s in s_s]
def ezfit(model, Xs):
lengths = list(map(len, Xs))
model.fit(np.vstack(Xs), lengths)
def similar_word(x, wv_model):
k = np.argmin([distance.cosine(x, wv_model.wv[w]) for w in wv_model.wv.key_to_index])
return wv_model.wv.index_to_key[k]
n_featrues = 20
o = np.random.random(n_featrues)
def train(model, path=CORPUS_PATH):
# set of files -> set of sequences
docs = get_corpus(path=path)
if WV_PATH.exists():
wv_model = Word2Vec.load('wv.model')
else:
wv_model = word2vec.Word2Vec(docs, vector_size=n_featrues, workers=4)
wv_model.build_vocab(docs)
wv_model.save('wv.model')
Xs = [[(wv_model.wv[w] if (w in wv_model.wv.key_to_index) else o) for w in ws] for ws in docs]
return ezfit(model, Xs)
if HMM_PATH.exists():
model = joblib.load('hmm.model')
else:
model = hmm.GaussianHMM(n_components=20, n_iter=1000, tol=0.0001)
train(model)
joblib.dump(model, 'hmm.model')
Y, Z = model.sample(n_samples=200)
wv_model = Word2Vec.load('wv.model')
for y in Y:
print(similar_word(y,wv_model), end=' ')
Appendix