隐马尔可夫模型详解 (英文版)

HMMs and MEMMs are both sequence classifiers. A sequence classifier or sequence labeler is a model whose job is to assign some label or class to each unit in a sequence.

Hidden Markov Models

Markov chain

We view Markov chain as a kind of probabilitic graphical model; a way of representing probabilistic assumptions in a graph. A Markov chain embodies an important assumption about these probabilities. In a first-order Markov chain, the probability of a particular state is dependent only on previous state.

Markov Assumption: P ( q i ∣ q 1 … q i − 1 ) = P ( q i ∣ q i − 1 ) P(q_i | q_1…q_{i-1}) = P(q_i|q_{i-1}) P(qiq1qi1)=P(qiqi1)

A Markov chain is specified by following components:

  • Q = q 1 q 2 … q N Q = q_1 q_2 …q_N Q=q1q2qN a set of N states

  • A = a 01 a 02 … a n 1 … a n n A = a_{01} a_{02}…a_{n1}…a_{nn} A=a01a02an1ann A transition probability matrix A , each with a i j a_{ij} aij representing the probability of moving from state i i i to state j j j , s.t. Σ j = 1 n a i j = 1    ∀ i \Sigma_{j=1}^{n}a_{ij} = 1 \; \forall i Σj=1naij=1i

  • q 0 , q F q_0, q_F q0,qF A special start state and end state which are not associated with observations

A markov chain is useful when we need to compute a probability for a sequence of events that we can observe in the world. A hidden Markov model allows us to talk about both observed events and hidden events that we think of as causal factors in our probabilistic model.

Hidden Markov Models

HMM is specified by the following components:

  • Q = q 1 q 2 … q N Q = q_1 q_2 … q_N Q=q1q2qN A set of N states
  • A = a 01 a 02 … a n 1 … a n n A = a_{01} a_{02}…a_{n1}…a_{nn} A=a01a02an1ann A transition probability matrix A , each with a i j a_{ij} aij representing the probability of moving from state i i i to state j j j , s.t. Σ j = 1 n a i j = 1    ∀ i \Sigma_{j=1}^{n}a_{ij} = 1 \; \forall i Σj=1naij=1i
  • O = o 1 o 2 … o T O = o_1 o_2 … o_T O=o1o2oT A sequence of T observations, each one drawn from a vocabulary V = v 1 , v 2 , … , v V V = v_1, v_2, …, v_V V=v1,v2,,vV
  • B = b i ( o t ) B = b_i(o_t) B=bi(ot) A sequence of observation likelihoods, also called emission probabilities, each expressing the probability of an observation o t o_t ot being generated from a state i i i
  • q 0 , q F q_0, q_F q0,qF A special state state and end state which are not associated with observations, together with transition probabilities a 01 a 02 … a 0 n a_{01} a_{02}…a_{0n} a01a02a0n out of the start state and a 1 F a 2 F … a n F a_{1F} a_{2F} … a_{nF} a1Fa2FanF into the end state.

A first-order Hidden Markov Model instantiates two simplifying assumptions:

  • Markov Assumption : P ( q i ∣ q 1 … q i − 1 ) = P ( q i ∣ q i − 1 ) P(q_i | q_1 … q_{i-1}) = P(q_i|q_{i-1}) P(qiq1qi1)=P(qiqi1) as with first-order Markov chain, the probability of a particular state is dependent only on the previous state
  • Output Independence Assumption: P ( o i ∣ q 1 … q i , … , q T , o 1 , … , o i , … . o T ) = P ( o i ∣ q i ) P(o_i | q_1 … q_i, …, q_T, o_1, …, o_i, …. o_T) = P(o_i|q_i) P(oiq1qi,,qT,o1,,oi,.oT)=P(oiqi) probability of an output observation o t o_t ot is dependent only on the state that produced the observation q i q_i qi, and not on any other states or any other observations

Types of HMM:

  • Fully-connected or ergodic HMM: there is a non-zero probability of transitioning between any two states
  • Bakis HMM : many of the transitions between states have zero probability and the state transitions proceed from left to right

Hidden Markov Models should be characterized by three fundamental problems:

  • Problem 1 (Computing Likelihood) : Given an HMM λ = ( A , B ) \lambda = (A, B) λ=(A,B) and an observation sequence O O O, determine the likelihood P ( O ∣ λ ) P(O|\lambda) P(Oλ)
  • Problem 2 (Decoding) : Given an observation sequence O O O and an HMM λ = ( A , B ) \lambda = (A, B) λ=(A,B), discover the best hidden state sequence Q Q Q
  • Problem 3 (Learning) : Given an observation sequence O O O and the set of states in HMM, learn the HMM parameters A A A and B B B
Computing Likelihood: the Forward Algorithm

An efficient ( O ( N 2 T ) ) (O(N^2T)) (O(N2T)) algorithm called the forward algorithm is a kind of dynamic programming. It computes the observation probability by summing over the probabilities of all possible hidden state paths that could generate the observation sequence, but it does so efficiently by implicitly folding each of these paths into a single forward trellis .

Each cell of the forward algorithm trellis α t ( j ) \alpha_t(j) αt(j) represents the probability of being in state j j j after seeing the first t t t observations, give the automaton λ \lambda λ :

α t ( j ) = P ( o 1 , o 2 , … o t , q t = j ∣ λ ) = ∑ i = 1 N α t − 1 ( i ) a i j b j ( o t ) \alpha_t(j) = P(o_1,o_2,…o_t, q_t=j|\lambda) = \sum_{i=1}^N \alpha_{t-1}(i)a_{ij}b_j(o_t) αt(j)=P(o1,o2,ot,qt=jλ)=i=1Nαt1(i)aijbj(ot)

  • α t − 1 ( i ) \alpha_{t-1}(i) αt1(i) The previous forward path probability from the previous time step
  • a i j a_{ij} aij The transition probability from previous state q i q_i qi to current state q j q_j qj
  • b j ( o t ) b_j(o_t) bj(ot) The state observation likelihood of the observation symbol o t o_t ot given the current state j j j

Formal definition of the forward algorithm

  1. Initialization: α 1 ( j ) = a 0 j b j ( o 1 )      1 ≤ j ≤ N \alpha_1(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq N α1(j)=a0jbj(o1)1jN
  2. Recursion: α t ( j ) = ∑ i = 1 N α t − 1 ( i ) a i j b j ( o t ) ;      1 ≤ j ≤ N , 1 < t ≤ T \alpha_t(j) = \sum_{i=1}^N\alpha_{t-1}(i) a_{ij} b_j(o_t); \; \; 1 \leq j \leq N, 1 < t \leq T αt(j)=i=1Nαt1(i)aijbj(ot);1jN,1<tT
  3. Termination: P ( O ∣ λ ) = α T ( q F ) = ∑ i = 1 N α T ( i ) a i F P(O|\lambda) = \alpha_T(q_F) = \sum_{i=1}^N\alpha_T(i)a_{iF} P(Oλ)=αT(qF)=i=1NαT(i)aiF

在这里插入图片描述

Decoding: the Viterbi Algorithm

For any model, such as an HMM, that contains hidden variables, the task of determining which sequence of variables is the underlying source of some sequence of observations is called the decoding task.

Decoding : Given as input an HMM λ = ( A , B ) \lambda=(A,B) λ=(A,B) and a sequence of observations O = o 1 , o 2 , … , o T O = o_1, o_2, …, o_T O=o1,o2,,oT, find the most probable sequence of states Q = q 1 q 2 q 3 … q T Q=q_1 q_2 q_3 … q_T Q=q1q2q3qT

The most common decoding algorithms for HMMs is the Viterbi algorithm. Like the forward algorithm, Viterbi is a kind of dynamic programming and makes uses of a dynamic programming trellis.

Each cell of the Viterbi trellis, v t ( j ) v_t(j) vt(j) represents the probability that the HMM is in state j after seeing the first t observations and passing through the most probable state sequence q 0 , q 1 , … , q t − 1 q_0, q_1, …, q_{t-1} q0,q1,,qt1, given the automaton λ \lambda λ :

v t ( j ) = max ⁡ q 0 , q 1 , … , q t − 1 P ( q 0 , q 1 , … q t − 1 , o 1 , o 2 , … o t , q t = j ∣ λ )              = max ⁡ i = 1 N v t − 1 ( i ) a i j b j ( o t ) v_t(j) = \max \limits_{q_0,q_1,…,q_{t-1}} P(q_0,q_1,…q_{t-1},o_1,o_2,…o_t,q_t=j|\lambda) \\ \;\;\;\;\;\; = \max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t) vt(j)=q0,q1,,qt1maxP(q0,q1,qt1,o1,o2,ot,qt=jλ)=maxi=1Nvt1(i)aijbj(ot)

  • v t − 1 ( i ) v_{t-1}(i) vt1(i) The previous Viterbi path probability from the previous time step
  • a i j a_{ij} aij The transition probability from previous state q i q_i qi to current state q j q_j qj
  • b j ( o t ) b_j(o_t) bj(ot) The state observation likelihood of the observation symbol o t o_t ot given the current state j j j

Note that the Viterbi algorithm is identical to the Forward algorithm except that it takes the max over the previous path probabilities where the forward algorithm takes the sum. The Viterbi algorithm also has back pointers, which will compute the best state sequence by keeping track of the path of hidden states that led to each state, and then at the end tracing back the best path to the beginning (the Viterbi backtrace).

Formal definition of the Viterbi algorithm

  1. Initialization:

    v i ( j ) = a 0 j b j ( o 1 )      1 ≤ j ≤ N v_i(j) = a_{0j}b_j(o_1) \; \; 1 \leq j \leq N vi(j)=a0jbj(o1)1jN

    b t 1 ( j ) = 0 bt_1(j) = 0 bt1(j)=0

  2. Recursion (recall states 0 0 0 and q F q_F qF are non-emitting)

    v t ( j ) = max ⁡ i = 1 N v t − 1 a i j b j ( o t ) ;      1 ≤ j ≤ N , 1 < t ≤ T v_t(j) = \max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1< t \leq T vt(j)=maxi=1Nvt1aijbj(ot);1jN,1<tT

    b t t ( j ) = arg ⁡ max ⁡ i = 1 N v t − 1 a i j b j ( o t ) ;      1 ≤ j ≤ N , 1 < t ≤ T bt_t(j) = \arg\max_{i=1}^{N}v_{t-1}a_{ij}b_j(o_t); \;\; 1 \leq j \leq N, 1 < t \leq T btt(j)=argmaxi=1Nvt1aijbj(ot);1jN,1<tT

  3. Termination:

    The best score: P ∗ = v t ( q F ) = max ⁡ i = 1 N v T ( i ) ∗ a i , F P* = v_t(q_F) = \max \limits_{i=1}^{N} v_T(i)*a_{i,F} P=vt(qF)=i=1maxNvT(i)ai,F

    The start of backtrace: q T ∗ = b t T ( q F ) = arg ⁡ max ⁡ i = 1 N v T ( i ) ∗ a i , F q_T* = bt_T(q_F) = \arg\max \limits_{i=1}^{N} v_T(i)*a_{i,F} qT=btT(qF)=argi=1maxNvT(i)ai,F
    在这里插入图片描述

TRAINING HMMs: The Forward-Backward Algorithm

Learning: Given an observation sequence O O O and the set of possible states in the HMM, learn the HMM parameters A A A and B B B.

The standard algorithm for HMM training is the forward-backward or Baum-Welch algorithm, a special case of Expectation-Maximization or EM algorithm. The algorithm will let us train both the transition probabilities A A A and the emission probabilities B B B of the HMM.

Let us begin by considering the much simpler case of training a Markov chain rather than a HMM. Since states in a Markov chain are observed and it has no emission probabilities B B B, we could view a Markov chain as a degenerate HMM where all the b b b probabilities are 1.0 for the observed symbol and 0 for all other symbols. Thus the only probabilities we need to train are the transition probability matrix A A A.

We get the maximum likelihood estimate of the probability a i j a_{ij} aij of a particular transition between states I I I and j j j by counting the number of times the transition was taken, which we could call C ( i → j ) C(i \rightarrow j) C(ij), and then normalizing by the total count of all times we took any transition from state i i i:

a i j = C ( i → j ) ∑ q ∈ Q C ( i → q ) a_{ij} = \frac{C(i \rightarrow j)}{\sum_{q \in Q} C(i \rightarrow q)} aij=qQC(iq)C(ij)

For HMM we cannot compute these counts directly from an observation sequence since we don’t know which path of states was taken through the machine for a given input.

The Baum-Welch algorithm uses two neat intuitions to solve this problem.

  1. The first idea is to iteratively estimate the counts. We will start with an estimate for the transition and observation probabilities, and then use these estimated probabilities to derive better and better probabilities.
  2. The second idea is that we get our estimated probabilities by computing the forward probability for an observation and then dividing that probability mass among all the different paths that contributed to this forward probability.

Backword probability

The backward probability β \beta β is the probability of seeing the observations from time t + 1 t+1 t+1 to the end, give that we are in state j j j at time t t t (give the automaton λ \lambda λ):

β t ( i ) = P ( o t + 1 , o t + 2 … o T ∣ q t = i , λ ) \beta_t(i) = P(o_{t+1},o_{t+2}…o_T|q_t=i,\lambda) βt(i)=P(ot+1,ot+2oTqt=i,λ)

Formal definition of the backward algorithm

  1. Initialization: β T ( i ) = a i , F ,      1 ≤ i ≤ N \beta_T(i) = a_{i,F}, \;\; 1 \leq i \leq N βT(i)=ai,F,1iN

  2. Recursion (again since states 0 and q F q_F qF are non-emitting):

    β t ( i ) = ∑ j = 1 N a i j b j ( o t + 1 ) β t + 1 ( j ) ,      1 ≤ i ≤ N , 1 ≤ t < T \beta_t(i) = \sum \limits_{j=1}^{N}a_{ij}b_j(o_{t+1})\beta_{t+1}(j), \;\; 1 \leq i \leq N, 1 \leq t < T βt(i)=j=1Naijbj(ot+1)βt+1(j),1iN,1t<T

  3. Termination:

    P ( O ∣ λ ) = α T ( q F ) = β 1 ( 0 ) = ∑ j = 1 N a 0 j b j ( o 1 ) β 1 ( j ) P(O|\lambda) = \alpha_T(q_F) = \beta_1(0) = \sum \limits_{j=1}^{N}a_{0j}b_j(o_1)\beta_1(j) P(Oλ)=αT(qF)=β1(0)=j=1Na0jbj(o1)β1(j)

We are now ready to understand how the forward and backward probabilities can help us compute the transition probability a i j a_{ij} aij and the observation probability b i ( o t ) b_i(o_t) bi(ot) from an observation sequence, even though the actual path taken through the machine is hidden.

Transition Probability Matrix

Let’s begin by showing how to estimate a ^ i j \hat{a}_{ij} a^ij :

a ^ i j = expected number of transitions from state i to j expected number of transitions from state i \hat{a}_{ij} = \frac{\text{expected number of transitions from state i to j}}{\text{expected number of transitions from state i}} a^ij=expected number of transitions from state iexpected number of transitions from state i to j

How do we compute the numerator? Here is the intuition. Assumption we had some estimate of the probability that a given transition i → j i \rightarrow j ij was taken at a particular point in time t t t in the observation sequence. If we know this probability for each particular time t t t, we could sum over all times t t t to estimate the total count for the transition i → j i \rightarrow j ij.

Formally, let’s define the probability ξ t \xi_t ξt as the probability of being in state i i i time t t t and state j j j at time t + 1 t+1 t+1, give the observation sequence and of course the model:

ξ t ( i , j ) = P ( q t = i , q t + 1 = j ∣ O , λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) α T ( N ) \xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda) = \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)} ξt(i,j)=P(qt=i,qt+1=jO,λ)=αT(N)αt(i)aijbj(ot+1)βt+1(j)

In detail :

ξ t ( i , j ) = P ( q t = i , q t + 1 = j ∣ O , λ ) not-quite    ξ t ( i , j ) = P ( q t = i , q t + 1 = j , O ∣ λ ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) P ( O ∣ λ ) = α T ( N ) = β T ( 1 ) = ∑ j = 1 N α t ( j ) β t ( j ) laws of probability :    P ( Q ∣ O , λ ) = P ( Q , O ∣ λ ) P ( O ∣ λ ) } ⇒ ξ t ( i , j ) = α t ( i ) a i j b j ( o t + 1 ) β t + 1 ( j ) α T ( N ) \left.\begin{matrix} \xi_t(i, j) = P(q_t=i, q_{t+1}=j|O, \lambda)\\ \text{not-quite}\; \xi_t(i,j)=P(q_t=i,q_{t+1}=j,O|\lambda)=\alpha_t(i)a_{ij}b_j(o_t+1)\beta_{t+1}(j)\\ P(O|\lambda)=\alpha_T(N)=\beta_T(1)=\sum \limits_{j=1}^{N}\alpha_t(j)\beta_t(j)\\ \text{laws of probability} :\;P(Q|O,\lambda) = \frac{P(Q,O|\lambda)}{P(O|\lambda)} \end{matrix}\right\} \Rightarrow \xi_t(i,j)= \frac{\alpha_t(i)a_{ij}b_j(o_{t+1})\beta_{t+1}(j)}{\alpha_T(N)} ξt(i,j)=P(qt=i,qt+1=jO,λ)not-quiteξt(i,j)=P(qt=i,qt+1=j,Oλ)=αt(i)aijbj(ot+1)βt+1(j)P(Oλ)=αT(N)=βT(1)=j=1Nαt(j)βt(j)laws of probability:P(QO,λ)=P(Oλ)P(Q,Oλ)ξt(i,j)=αT(N)αt(i)aijbj(ot+1)βt+1(j)

The expected number of transitions from state i i i to state j j j is then the sum over all t t t of ξ \xi ξ , so here is the final formula for α ^ i j \hat{\alpha}_{ij} α^ij :

α ^ i j = ∑ t = 1 T − 1 ξ t ( i , j ) ∑ t = 1 T − 1 ∑ j = 1 N ξ t ( i , j ) \hat{\alpha}_{ij} = \LARGE \frac{\sum_{t=1}^{T-1}\xi_t(i, j)}{\sum_{t=1}^{T-1}\sum_{j=1}^{N}\xi_t(i, j)} α^ij=t=1T1j=1Nξt(i,j)t=1T1ξt(i,j)

Observation Probability Matrix

This is the probability of a given symbol v k v_k vk from the observation vocabulary V V V, given a state j j j : b ^ j ( v k ) \hat{b}_j(v_k) b^j(vk).

b ^ j ( v k ) = expected number of times in state  j  and observing symbol  v k expected number of times in state  j \hat{b}_j(v_k) = \frac{\text{expected number of times in state $j$ and observing symbol $v_k$}}{\text{expected number of times in state $j$}} b^j(vk)=expected number of times in state jexpected number of times in state j and observing symbol vk

For this we will need to know the probability of being in state j j j at time t t t, which we call γ t ( j ) \gamma_t(j) γt(j) :

γ t ( j ) = P ( q t = j ∣ O , λ ) = P ( q t = j , O ∣ λ ) P ( O ∣ λ ) = α t ( j ) β t ( j ) P ( O ∣ λ ) \gamma_t(j) = P(q_t=j|O,\lambda) = \frac{P(q_t=j, O|\lambda)}{P(O|\lambda)} = \frac{\alpha_t(j)\beta_t(j)}{P(O|\lambda)} γt(j)=P(qt=jO,λ)=P(Oλ)P(qt=j,Oλ)=P(Oλ)αt(j)βt(j)

We are ready to compute b b b. For the numerator, we sum γ t ( j ) \gamma_t(j) γt(j) for all time steps t t t in which the observation o t o_t ot is the symbol v k v_k vk that we are interested in. For the denominator, we sum γ t ( j ) \gamma_t(j) γt(j) over all the time steps t t t. The result will be the percentage of the times that we were in state j j j and we saw symbol v k v_k vk :

b ^ j ( v k ) = ∑ t = 1 s . t . O t = v k T γ t ( j ) ∑ t = 1 T γ t ( j ) \hat{b}_j(v_k) = \frac{\sum_{t=1s.t. O_t=v_k}^{T}\gamma_t(j)}{\sum_{t=1}^{T}\gamma_t(j)} b^j(vk)=t=1Tγt(j)t=1s.t.Ot=vkTγt(j)

We now have ways to re-estimate the transition A A A and observation B B B probabilities from an observation sequence O O O assuming that we already have a previous estimate of A A A and B B B.

The Forward-Backward algorithm

The forward-backward algorithm starts with some initial estimate of the HMM parameters λ = ( A , B ) \lambda=(A, B) λ=(A,B). We then iteratively run two steps. Like other cases of the EM algorithm, the forward-backward algorithm has two steps: the expectation step or E-step and maximization step, or M-step.

In the E-step, we compute the expected state occupancy count γ \gamma γ and the expected state transition count ξ \xi ξ, from the earlier A A A and B B B probabilities. In the M-step, we use γ \gamma γ and ξ \xi ξ to recompute new A A A and B B B probabilities.
在这里插入图片描述
p.s. 主要自己复习用的. 讲asr的一本书中抄的. 我觉得是讲解最清楚的一版了.

thanks

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值