往期文章链接目录
文章目录
-
- 往期文章链接目录
- Conditional independence
- Settings of the Hidden Markov Model (HMM)
- Useful probabilities p ( z k ∣ x ) p(z_k | x) p(zk∣x) and p ( z k + 1 , z k ∣ x ) p(z_{k+1}, z_k | x) p(zk+1,zk∣x)
- Three fundamental problems of HMM
- Problem 1 (Likelihood)
- Problem 2 (Learning)
- Problem 3 (Inference)
- 往期文章链接目录
Before reading this post, make sure you are familiar with the EM Algorithm and decent among of knowledge of convex optimization. If not, please check out my previous post
Let’s get started!
Conditional independence
A A A and B B B are conditionally independent given C C C if and only if, given knowledge that C C C occurs, knowledge of whether A A A occurs provides no information on the likelihood of B B B occurring, and knowledge of whether B B B occurs provides no information on the likelihood of A A A occurring.
Formally, if we denote conditional independence of A A A and B B B given C C C by ( A ⊥ ⊥ B ) ∣ C (A\perp \!\!\!\perp B)\mid C (A⊥⊥B)∣C, then by definition, we have
( A ⊥ ⊥ B ) ∣ C ⟺ P ( A , B ∣ C ) = P ( A ∣ C ) ⋅ P ( B ∣ C ) (A\perp \!\!\!\perp B)\mid C\quad \iff \quad P(A, B\mid C)= P(A\mid C) \cdot P(B\mid C) (A⊥⊥B)∣C⟺P(A,B∣C)=P(A∣C)⋅P(B∣C)
Given the knowledge that C C C occurs, to show the knowledge of whether B B B occurs provides no information on the likelihood of A A A occurring, we have
P ( A ∣ B , C ) = P ( A , B , C ) P ( B , C ) = P ( A , B ∣ C ) ⋅ P ( C ) P ( B , C ) = P ( A ∣ C ) ⋅ P ( B ∣ C ) ⋅ P ( C ) P ( B ∣ C ) ⋅ P ( C ) = P ( A ∣ C ) \begin{aligned} P(A | B ,C) &=\frac{P(A , B , C)}{P(B , C)} \\ &=\frac{P(A , B | C) \cdot P(C)}{P(B , C)} \\ &=\frac{P(A | C) \cdot P(B | C) \cdot P(C)}{P(B | C) \cdot P(C)} \\ &=P(A | C) \end{aligned} P(A∣B,C)=P(B,C)P(A,B,C)=P(B,C)P(A,B∣C)⋅P(C)=P(B∣C)⋅P(C)P(A∣C)⋅P(B∣C)⋅P(C)=P(A∣C)
Two classical cases where X X X and Z Z Z are conditionally independent
Case 1 :
![](https://i-blog.csdnimg.cn/blog_migrate/7cdca8f1e7df4488deb0ca2fa9930f23.png)
From the above directed graph, we have P ( X , Y , Z ) = P ( X ) ⋅ P ( Y ∣ X ) ⋅ P ( Z ∣ Y ) P(X,Y,Z) = P(X)\cdot P(Y|X)\cdot P(Z|Y) P(X,Y,Z)=P(X)⋅P(Y∣X)⋅P(Z∣Y). Hence we have
P ( Z ∣ X , Y ) = P ( X , Y , Z ) P ( X , Y ) = P ( X ) ⋅ P ( Y ∣ X ) ⋅ P ( Z ∣ Y ) P ( X ) ⋅ P ( Y ∣ X ) = P ( Z ∣ Y ) \begin{aligned} P(Z|X,Y) &= \frac{P(X,Y,Z)}{P(X,Y)}\\ &= \frac{P(X)\cdot P(Y|X)\cdot P(Z|Y)}{P(X)\cdot P(Y|X)}\\ &= P(Z|Y) \end{aligned} P(Z∣X,Y)=P(X,Y)P(X,Y,Z)=P(X)⋅P(Y∣X)P(X)⋅P(Y∣X)⋅P(Z∣Y)=P(Z∣Y)
Therefore, X X X and Z Z Z are conditionally independent.
Case 2 :
![](https://i-blog.csdnimg.cn/blog_migrate/52d1a7d0bdbe6032cf4a88661ee277dd.png)
From the above directed graph, we have P ( X , Y , Z ) = P ( Y ) ⋅ P ( X ∣ Y ) ⋅ P ( Z ∣ Y ) P(X,Y,Z) = P(Y)\cdot P(X|Y) \cdot P(Z|Y) P(X,Y,Z)=P(Y)⋅P(X∣Y)⋅P(Z∣Y). Hence we have
P ( Z ∣ X , Y ) = P ( X , Y , Z ) P ( X , Y ) = P ( Y ) ⋅ P ( X ∣ Y ) ⋅ P ( Z ∣ Y ) P ( Y ) ⋅ P ( X ∣ Y ) = P ( Z ∣ Y ) \begin{aligned} P(Z|X,Y) &= \frac{P(X,Y,Z)}{P(X,Y)}\\ &= \frac{P(Y)\cdot P(X|Y) \cdot P(Z|Y)}{P(Y)\cdot P(X|Y)}\\ &= P(Z|Y) \end{aligned} P(Z∣X,Y)=P(X,Y)P(X,Y,Z)=P(Y)⋅P(X∣Y)P(Y)⋅P(X∣Y)⋅P(Z∣Y)=P(Z∣Y)
Therefore, X X X and Z Z Z are conditionally independent.
Settings of the Hidden Markov Model (HMM)
The HMM is based on augmenting the Markov chain. A Markov chain is a model that tells us something about the probabilities of sequences of random variables, states, each of which can take on values from some set. A Markov chain makes a very strong assumption that if we want to predict the future in the sequence, all that matters is the current state.
To put it formally, suppose we have a sequence of state variables z 1 , z 2 , . . . , z n z_1, z_2, ..., z_n z1,z2,...,zn. Then the Markov assumption is
p ( z n ∣ z 1 z 2 . . . z n − 1 ) = p ( z n ∣ z n − 1 ) p(z_n | z_1z_2...z_{n-1}) = p(z_n | z_{n-1}) p(zn∣z1z2...zn−1)=p(zn∣zn−1)
A Markov chain is useful when we need to compute a probability for a sequence of observable events. However, in many cases the events we are interested in are hidden. For example we don’t normally observe part-of-speech (POS) tags in a text. Rather, we see words, and must infer the tags from the word sequence. We call the tags hidden because they are not observed.
![](https://i-blog.csdnimg.cn/blog_migrate/d41721784ba7b61c7ab64e8aa1e8a54e.png)
A hidden Markov model (HMM) allows us to talk about both observed events (like words that we see in the input) and hidden events (like part-of-speech tags) that we think of as causal factors in our probabilistic model. An HMM is specified by the following components:
-
A sequence of hidden states z z z, where z k z_k zk takes values from all possible hidden states Z = { 1 , 2 , . . , m } Z = \{1,2,..,m\} Z={ 1,2,..,m}.
-
A sequence of observations x x x, where x = ( x 1 , x 2 , . . . , x n ) x = (x_1, x_2, ..., x_n) x=(x1,x2,...,xn). Each one is drawn from a vocabulary V V V.
-
A transition probability matrix A A A, where A A A is an m × m m \times m m×m matrix. A i j A_{ij} Aij represents the probability of moving from state i i i to state j j j: A i j = p ( z t + 1 = j ∣ z t = i ) A_{ij} = p(z_{t+1}=j| z_t=i) Aij=p(zt+1=j∣zt=i), and ∑ j = 1 m A i j = 1 \sum_{j=1}^{m} A_{ij} = 1 ∑j=1mAij=1 for all i i i.
-
An emission probability matrix B B B, where B B B is an m × ∣ V ∣ m \times |V| m×∣V∣ matrix. B i j B_{ij} Bij represents the probability of an observation x j x_j xj being generated from a state i i i: B i j = P ( x t = V j ∣ z t = i ) B_{ij} = P(x_t = V_j|z_t = i) Bij=P(xt=Vj∣zt=i)
-
An initial probability distribution π \pi π over states, where π = ( π 1 , π 2 , . . . , π m ) \pi = (\pi_1, \pi_2, ..., \pi_m) π=(π1,π2,...,πm). π i \pi_i πi is the probability that the Markov chain will start in state i i i. ∑ i = 1 m π i = 1 \sum_{i=1}^{m} \pi_i = 1 ∑i=1mπi=1.
Given a sequence x x x and the corresponding hidden states z z z (like one in the picture above), we have
P ( x , z ∣ θ ) = p ( z 1 ) ⋅ [ p ( z 2 ∣ z 1 ) ⋅ p ( z 3 ∣ z 2 ) ⋅ . . . ⋅ ( z n ∣ z n − 1 ) ] ⋅ [ p ( x 1 ∣ z 1 ) ⋅ p ( x 2 ∣ z 2 ) ⋅ . . . ⋅ p ( x n ∣ z n ) ] (0) P(x, z|\theta) = p(z_1) \cdot [p(z_2|z_1)\cdot p(z_3|z_2)\cdot ... \cdotp(z_n|z_{n-1})] \cdot [p(x_1|z_1)\cdot p(x_2|z_2)\cdot ... \cdot p(x_n|z_n)] \tag 0 P(x,z∣θ)=p(z1)⋅[p(z2∣z1)⋅p(z3∣z2)⋅...⋅(zn∣zn−1)]⋅[p(x1∣z1)⋅p(x2∣z2)⋅...⋅p(xn∣zn)](0)
We get p ( z 1 ) p(z_1) p(z1) from π \pi π, p ( z k + 1 ∣ z k ) p(z_{k+1}|z_k) p(zk+1∣zk) from A A A, and p ( x k ∣ z k ) p(x_k|z_k) p(xk∣zk) from B B B.
Useful probabilities p ( z k ∣ x ) p(z_k | x) p(zk∣x) and p ( z k + 1 , z k ∣ x ) p(z_{k+1}, z_k | x) p(zk+1,zk∣x)
p ( z k ∣ x ) p(z_k | x) p(zk∣x) and p ( z k + 1 , z k ∣ x ) p(z_{k+1}, z_k | x) p(zk+1,zk∣x) are useful probabilities and we are going to use them later.
Intuition: Once we have a sequence x x x, we might be interested in find the probability of any hidden state z k z_k zk, i.e., find probabilities p ( z k = 1 ∣ x ) , p ( z k = 2 ∣ x ) , . . . , p ( z k = m ∣ x ) p(z_k =1| x), p(z_k =2| x), ..., p(z_k =m| x) p(zk=1∣x),p(zk=2∣x),...,p(zk=m∣x). we have the following
p ( z k ∣ x ) = p ( z k , x ) p ( x ) ( 1 ) ∝ p ( z k , x ) ( 2 ) \begin{aligned} p(z_k | x) &= \frac{p(z_k, x)}{p(x)} & & (1)\\ &\propto p(z_k, x) & & (2)\\ \end{aligned} p(zk∣x)=p(x)p(zk,x)∝p(zk,x)(1)(2)
Note that from ( 1 ) (1) (1) to (2), since p ( x ) p(x) p(x) doesn’t change for all values of z k z_k zk, p ( z k ∣ x ) p(z_k | x) p(zk∣x) is proportional to p ( z k , x ) p(z_k, x) p(zk,x).
p ( z k = i , x ) = p ( z k = i , x 1 : k , x k + 1 : n ) = p ( z k = i , x 1 : k ) ⋅ p ( x k + 1 : n ∣ z k = i , x 1 : k ) ( 3 ) = p ( z k = i , x 1 : k ) ⋅ p ( x k + 1 : n ∣ z k = i ) ( 4.1 ) = α k ( z k = i ) ⋅ β k ( z k = i ) ( 4.11 ) \begin{aligned} p(z_k=i, x) &= p(z_k=i, x_{1:k}, x_{k+1:n}) \\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+1:n}|z_k=i, x_{1:k}) & & (3)\\ &= p(z_k=i, x_{1:k}) \cdot p(x_{k+1:n}|z_k=i) & & (4.1) \\ &= \alpha_k(z_k=i) \cdot \beta_k(z_k=i) &&(4.11)\\ \end{aligned} p(zk=i,x)=p(zk=i,x1:k,xk+1:n)=p(zk=i,x1:k)⋅p(xk+1:n∣zk=i,x1:k)=p(zk=i,x1:k)⋅p(xk+1:n∣zk=i)=αk(zk=i)⋅βk(zk=i)(3)(4.1)(4.11)
![](https://i-blog.csdnimg.cn/blog_migrate/431dde733ce39883a738159b30c2c31a.png)
From the above graph, we see that the second term ( 3 ) (3) (3) is the 2nd classical cases. So x k + 1 : n x_{k+1:n} xk+1:n and x 1 : k x_{1:k} x1:k are conditionally independent. This is why we can go from ( 3 ) (3) (3) to ( 4.1 ) (4.1) (4.1). We are going to use the Forward Algorithm to compute p ( z k , x 1 : k ) p(z_k, x_{1:k}) p(zk,x1:k), and Backward Algorithm to compute p ( x k + 1 : n ∣ z k ) p(x_{k+1:n}|z_k) p(xk+1:n∣zk) later.
We denote p ( z k , x 1 : k ) p(z_k, x_{1:k}) p(zk,x