hmmlearn 源码解读

hmmlearn 源码解读

简介

hmmlearn: HMM的Python实现

hmmlearn可以处理一组独立序列/多序列,但本文只考虑一个序列。

记号约定

  • X , Z X, Z X,Z: 观察变量、隐变量
  • S S S: 状态集, Z Z Z样本空间
  • T T T: 样本个数

变量名/术语表

源码中出现的变量和术语的对照

基本变量

X: 观测变量

state: 状态S

n_components: 状态数 |S|

n_samples: 样本个数 T

n_features: 样本特征数

高级变量

变量名: 形状HMM记号(未对数化)术语
fwdlattice: T X |S|$\alpha_t(z):=P(x_{1:t},Z_t=z\theta),t=1,\cdots T, z:S$​
bwdlattice: T X |S|$\beta_t(z):= P(y_{t+1:T}Z_t=z,\theta)$
framelogprob: T X |S|$p(x_tz,\theta)$​
logprob: scalar$p(X=x\theta)$​
posteriors: T X |S|$\gamma_t(i)=p(Z_t=iX=x,\theta)\sim\alpha_t(i)\beta_t(i)$​
viterbi_lattice: TX |S|$\delta_t(i)=\max_{z_1,\cdots,z_{t-1}}P(z_t=i,z_{t-1},\cdots,z_1,x\theta)$​
xi: T X |S| X |S|$\xi_t(i,j)=p(Z_t=i,Z_{t+1}=jx,\theta)$​

表中的变量名可能是源代码中变量名的子字符串

_BaseHMM类

  • 主要参数:

    • n_components: 状态数
    • startprob_prior: 初始状态先验分布 π ∼ D i r ( α ) \pi\sim Dir(\alpha) πDir(α), 默认 α = 1 \alpha=1 α=1
    • transmat_prior: 转移矩阵先验分布 A ( i , : ) ∼ D i r ( α i ′ ) A(i,:)\sim Dir(\alpha_i') A(i,:)Dir(αi), 默认 α ′ = 1 \alpha'=1 α=1
  • 主要属性:

    • startprob_: 初始状态分布

    • transmat_: 转移矩阵

  • 主要方法:

    • _compute_log_likelihood: 计算对数似然/发射概率 log ⁡ p ( X t ∣ z t , θ ) , t = 1 ⋯ T \log p(X_t|z_t,\theta),t=1\cdots T logp(Xtzt,θ),t=1T
    • _generate_sample_from_state: 根据分布 P ( X ∣ Z = z , θ ) P(X|Z=z,\theta) P(XZ=z,θ)生成样本
    • _score: 计算对数似然,调用 _do_forward_pass,_do_backward_pass/_compute_posteriors
    • _do_forward_pass: 计算 α \alpha α
    • _do_backward_pass: 计算 β \beta β
    • _compute_posteriors: 计算状态后验概率 γ t ( i ) ∼ α t ( i ) β t ( i ) \gamma_t(i)\sim\alpha_t(i)\beta_t(i) γt(i)αt(i)βt(i)
    • decode/predict: 估计状态序列 Z ∣ X = x Z|X=x ZX=x,调用_decode_viterbi
    • _decode_map: 贪婪的极大后验估计
    • _decode_viterbi: Viterbi算法
    • fit: 学习算法,调用_do_mstep
    • _accumulate_sufficient_statistics: 利用 α , β , γ \alpha,\beta,\gamma α,β,γ更新统计量, 调用_hmmc._compute_log_xi_sum
    • sample: 生成 ( X t , Z t ) (X_t,Z_t) (Xt,Zt)​序列,调用_generate_sample_from_state
    • _do_mstep: M step,跟新初始分布,和转移矩阵
  • 统计量:

    • start:估计初始概率
    • trans: 估计转移矩阵

初始分布更新: π ∼ α − 1 + N 1 ∨ 0 \pi \sim \alpha-1+N_1\lor 0 πα1+N10​​ (归一化),转移矩阵更新类似

用户主要实现

_compute_log_likelihood

_do_mstep

GaussianHMM

  • 属性

    emissionprob_: 发射概率,由均值和协方差决定

    means_: 每个状态对应的均值

    covars_: 每个状态对应的协方差

    covariance_type: 协方差类型

    means_prior, means_weight: 均值的正态先验分布的均值与精度

    covars_prior, covars_weight: 方差的先验分布的均值与精度

  • 统计量:

    post,obs:用于更新均值

    obs**2,obs*obs.T:用于更新协方差

MultinomialHMM

离散值HMM

emissionprob_: 发射概率

  • 统计量:

    obs: 估计发射概率

GMMHMM

辅助函数

_hmmc._compute_log_xi_sum: 计算 ∑ t ξ t \sum_t\xi_t tξt 用于估计转移矩阵

源码解读

代码已经简化,去掉一些异常检测、归一化代码。并假设只学习当个序列。

fit 方法 - 学习算法(Bayesian Baum-Welch 算法)

# 初始化
self._init(X, lengths=lengths)

# for-looping
    # E step
    # 计算统计量(迭代序列集)
    # 初始化统计量
    stats = self._initialize_sufficient_statistics()
    framelogprob = self._compute_log_likelihood(X)
    logprob, fwdlattice = self._do_forward_pass(framelogprob)
    bwdlattice = self._do_backward_pass(framelogprob)
    posteriors = self._compute_posteriors(fwdlattice, bwdlattice)
    # 跟新统计量
    self._accumulate_sufficient_statistics(
        stats, X, framelogprob, posteriors, fwdlattice,
        bwdlattice)
    # M step
        self.startprob_ ~ np.maximum(self.startprob_prior - 1 + stats['start'], 0)
        self.transmat_ ~ np.maximum(self.transmat_prior - 1 + stats['trans'], 0)
def _do_mstep(self, stats):
    super()._do_mstep(stats)
    means_prior = self.means_prior
    means_weight = self.means_weight
    denom = stats['post'][:, None]
    
    # 更新均值
    self.means_ = (means_weight * means_prior + stats['obs']) / (means_weight + denom)

    # 更新方差
    covars_prior = self.covars_prior
    covars_weight = self.covars_weight
    meandiff = self.means_ - means_prior

    if self.covariance_type in {'spherical', 'diag'}: # 对角型,features 独立
        # {c0 + w0(mu-mu0)^2 + sum_t pt*yt^2 - 2 mu * sum_t pt * yt + mu**2*sum_t pt} / {(cw-1) V 0 + sum_t pt}
        c_n = (means_weight * meandiff**2
               + stats['obs**2']
               - 2 * self.means_ * stats['obs']
               + self.means_**2 * denom)
        c_d = max(covars_weight - 1, 0) + denom
        self._covars_ = (covars_prior + c_n) / np.maximum(c_d, 1e-5)
        if self.covariance_type == 'spherical':
            self._covars_ = np.tile(self._covars_.mean(1)[:, None],
                                    (1, self._covars_.shape[1]))
    elif self.covariance_type in ('tied', 'full'):
        c_n = np.empty((self.n_components, self.n_features,
                        self.n_features))
        for c in range(self.n_components):
            obsmean = np.outer(stats['obs'][c], self.means_[c])
            c_n[c] = (means_weight * np.outer(meandiff[c],
                                              meandiff[c])
                      + stats['obs*obs.T'][c]
                      - obsmean - obsmean.T
                      + np.outer(self.means_[c], self.means_[c])
                      * stats['post'][c])
        cvweight = max(covars_weight - self.n_features, 0)
        if self.covariance_type == 'tied':
            self._covars_ = ((covars_prior + c_n.sum(axis=0)) /
                             (cvweight + stats['post'].sum()))
        elif self.covariance_type == 'full':
            self._covars_ = ((covars_prior + c_n) /
                             (cvweight + stats['post'][:, None, None]))

作品

#!/usr/bin/env python3

"""AI composing by HMM

Word2Vec -> HMM
"""

import pathlib
import joblib

import numpy as np
from hmmlearn import hmm
from scipy.spatial import distance
from gensim.models import word2vec, Word2Vec


CORPUS_PATH = pathlib.Path('...') # the path of your corpus, could be a folder of txt files.
WV_PATH = pathlib.Path('wv.model')
HMM_PATH = pathlib.Path('hmm.model')

def read(path, level='p'):
    """Read text from files
    """
    if isinstance(path, str):
        path = pathlib.Path(path)
    if path.is_dir():
        if level == 'd':
            return [read(f, level=level) for f in path.iterdir()]
        elif level == 'p':
            res = []
            for f in path.iterdir():
                if f.suffix in {'.txt', '.md'}:
                    res.extend(read(f, level=level))
            return res
    else:
        if level == 'd':
            return path.read_text().rpartition('---')[-1]
        elif level == 'p':
            return [p for p in path.read_text().rpartition('---')[-1].split('\n') if len(p)>2]


def get_corpus(path=CORPUS_PATH):
    # get corpus (for gensim)
    # a corpus is a 2-order list of words [['words']]
    import jieba
    import jieba.posseg as pseg
    import logging
    jieba.setLogLevel(logging.INFO)

    def cut_flag(s):
        words = pseg.cut(s)
        return [f'{word_flag.flag}' if word_flag.flag in {'u', 'w', 'x', 'y', 'z', 'un', 'o', 'p', 'uj'}
        else word_flag.word for word_flag in words]

    s_s = read(path)
    return [cut_flag(s) for s in s_s]

def ezfit(model, Xs):
    lengths = list(map(len, Xs))
    model.fit(np.vstack(Xs), lengths)

def similar_word(x, wv_model):
    k = np.argmin([distance.cosine(x, wv_model.wv[w]) for w in wv_model.wv.key_to_index])
    return wv_model.wv.index_to_key[k]

n_featrues = 20
o = np.random.random(n_featrues)

def train(model, path=CORPUS_PATH):
    # set of files -> set of sequences
    docs = get_corpus(path=path)

    if WV_PATH.exists():
        wv_model = Word2Vec.load('wv.model')
    else:
        wv_model = word2vec.Word2Vec(docs, vector_size=n_featrues, workers=4)
        wv_model.build_vocab(docs)
        wv_model.save('wv.model')
    Xs = [[(wv_model.wv[w] if (w in wv_model.wv.key_to_index) else o) for w in ws] for ws in docs]
    return ezfit(model, Xs)

if HMM_PATH.exists():
    model = joblib.load('hmm.model')
else:
    model = hmm.GaussianHMM(n_components=20, n_iter=1000, tol=0.0001)
    train(model)
    joblib.dump(model, 'hmm.model')


Y, Z = model.sample(n_samples=200)
wv_model = Word2Vec.load('wv.model')
for y in Y:
    print(similar_word(y,wv_model), end=' ')


Appendix

推送的一个PR已经被接纳

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值