Hidden Markov Model
Model Definition
Hidden Markov Model(HMM) contains three different sets of parameters:
Transition Matrix A i j ∈ R N × N A_{ij}\in \mathbb{R}^{N\times N} Aij∈RN×N,
Emission Matrix B j k ∈ R N × M B_{jk}\in \mathbb{R}^{N\times M} Bjk∈RN×M,
Initial Probability Distribution π i ∈ R N × 1 \pi_i\in \mathbb{R}^{N\times 1} πi∈RN×1,
where N N N denotes #hidden states z z z, M M M denotes #observable states x x x.
In an HMM, we model the sequence as the result of a series of observations of a series of hidden states. Throughout the process, transition among hidden states are modeled by transition matrix A A A, and follows Markov Chain properties. In an observation of state z j z_j zj, there is a chance of B j k B_{jk} Bjk that we will observe the k k k-th observable state. The initial probability is described by π \pi π.
Maths and Viterbi Algorithm
A common question raised in the setting of an HMM is, Given a series of observations, what are the most likely sequence of the hidden states? To answer this question, we first recall Bayesian Rules, i.e.
P ( z ∣ x ) = P ( x , z ) P ( x ) ∝ P ( x , z ) = P ( x ∣ z ) P ( z ) \mathcal{P}(z|x) = \frac{\mathcal{P}(x, z)}{\mathcal{P}(x)}\propto\mathcal{P}(x, z)=\mathcal{P}(x|z)\mathcal{P}(z) P(z∣x)=P(x)P(x,z)∝P(x,z)=P(x∣z)P(z)
Therefore, as we want to use Maximize A Posteriori(MAP) estimation, we need to maximize the R.H.S. Based on laws derived on Probability Graph Models, for an HMM, we have
P ( z ) = π z 1 Π i = 1 T − 1 A z i z i + 1 \mathcal{P}(z)=\pi_{z_1}\Pi_{i=1}^{T-1}A_{z_i z_{i+1}} P(z)=πz1Πi=1T−1Azizi+1
, where T T T is the length of the sequence, and
P ( x ∣ z ) = Π i = 1 T B z i x i \mathcal{P}(x|z) = \Pi_{i=1}^T B_{z_i x_i} P(x∣z)=Πi=1TBzixi
. Thus, our optimization goal is
max z π z 1 Π i = 1 T − 1 A z i z i + 1 Π i = 1 T B z i x i \max_z \pi_{z_1}\Pi_{i=1}^{T-1}A_{z_i z_{i+1}}\Pi_{i=1}^T B_{z_i x_i} zmaxπz1Πi=1T−1Azizi+1Πi=1TBzixi
With some insights, this optimization can be solved using the methodology of Dynamic Programming(DP), since it satisfies the two conditions of DP, i.e. optimal substructure and overlapping subproblems.
Consider the prefix of the whole sequence of length t > 1 t>1 t>1, we rewrite the subproblem as
max z π z 1 Π i = 1 t − 1 A z i z i + 1 Π i = 1 t B z i x i = max z t , z t − 1 { A z t − 1 z t B z t x t max z 1 : t − 1 { π z 1 Π i = 1 t − 2 A z i z i + 1 Π i = 1 t − 1 B z i x i } } \max_z \pi_{z_1}\Pi_{i=1}^{t-1}A_{z_i z_{i+1}}\Pi_{i=1}^t B_{z_i x_i} = \max_{z_t, z_{t-1}} \{A_{z_{t-1}z_t}B_{z_t x_t}\max_{z_{1:t-1}} \{\pi_{z_1}\Pi_{i=1}^{t-2}A_{z_i z_{i+1}}\Pi_{i=1}^{t-1} B_{z_i x_i}\}\} zmaxπz1Πi=1t−1Azizi+1Πi=1tBzixi=zt,zt−1max{Azt−1ztBztxtz1:t−1max{πz1Πi=1t−2Azizi+1Πi=1t−1Bzixi}}
Hence we get an iterative form of the optimization problem. For the sake of both computability and compactness, we take the logarithm on R.H.S, and get
max z π z 1 Π i = 1 t − 1 A z i z i + 1 Π i = 1 t B z i x i = max z t , z t − 1 { log A z t − 1 z t + log B z t x t + max z 1 : t − 1 { log π z 1 + Σ i = 1 t − 2 log A z i z i + 1 + Σ i = 1 t − 1 log B z i x i } } \max_z \pi_{z_1}\Pi_{i=1}^{t-1}A_{z_i z_{i+1}}\Pi_{i=1}^t B_{z_i x_i} = \max_{z_t, z_{t-1}} \{\log A_{z_{t-1}z_t}+\log B_{z_t x_t} + \max_{z_{1:t-1}} \{\log \pi_{z_1}+\Sigma_{i=1}^{t-2}\log A_{z_i z_{i+1}}+\Sigma_{i=1}^{t-1} \log B_{z_i x_i}\}\} zmaxπz1Πi=1t−1Azizi+1Πi=1tBzixi=zt,zt−1max{logAzt−1zt+logBztxt+z1:t−1max{logπz1+Σi=1t−2logAzizi+1+Σi=1t−1logBzixi}}
This is an apposite form for solving the problem. The process of iteratively solve the problem is named after Viterbi.
Implementation of Hidden Markov Model
We implement the basic behavior of an HMM, which is the generation of a succession of hidden states and observations, and the Viterbi Algorithm, which is the basic solution to an HMM.
For the generation, all we need to do is to do experiments from a series of multinoulli distributions portrayed by the three parameters.
For the Viterbi Algorithm, we utilize two matrices, one recording the maximum log-likelihood of a hidden state sequence ending at time t t t with z t = n z_t=n zt=n, we denote this matrix as H n t ∈ R N × T \mathcal{H}_{nt}\in\mathbb{R}^{N\times T} Hnt∈RN×T. The other matrix is prepared for recording the precursors of each hidden state at time t t t in order to reach the maximum log-likelihood at the very time-state pair, we denote this matrix as R n t ∈ R N × T \mathcal{R}_{nt}\in\mathbb{R}^{N\times T} Rnt∈RN×T.
Now we are ready to implement the algorithm.
Generation:
def generate(self, T, series_idx):
self.sample[series_idx] = {'hidden': np.zeros(T, dtype=np.int32),
'observed': np.zeros(T, dtype=np.int32)}
cur_state = np.random.choice(np.arange(self.N), p=self.pi)
self.sample[series_idx]['hidden'][0] = cur_state
self.sample[series_idx]['observed'][0] = np.random.choice(np.arange(self.M), p=self.B[cur_state])
for t in range(1, T):
cur_state = np.random.choice(np.arange(self.N), p=self.A[cur_state])
self.sample[series_idx]['hidden'][t] = cur_state
self.sample[series_idx]['observed'][t] = np.random.choice(np.arange(self.M), p=self.B[cur_state])
Viterbi:
def viterbi(self, O, exp_idx):
"""Viterbi Algorithm
Args:
O (np.array): 1d array, length of T
exp_idx (string): index of experiment
"""
T = O.shape[0]
self.H[exp_idx] = np.zeros((self.N, T), dtype=np.int32)
self.R[exp_idx] = np.zeros((self.N, T), dtype=np.int32)
self.optimal[exp_idx] = np.zeros(T, dtype=np.int32)
# Initialization
self.H[exp_idx][:, 0] = self.log_pi + self.log_B[:, O[0]]
self.R[exp_idx][:, 0] = -1
# Iteration
for t in range(1, T):
for j in range(self.N):
tmp = self.H[exp_idx][:, t-1] + self.log_A[:, j] + self.log_B[j, O[t]]
self.H[exp_idx][j, t] = np.max(tmp)
self.R[exp_idx][j, t] = np.argmax(tmp)
print(f"Hidden State Matrix: \n{self.H[exp_idx]}")
print(f"Route Matrix: \n{self.R[exp_idx]}")
# Termination
self.optimal[exp_idx][-1] = np.argmax(self.H[exp_idx][:, -1])
for t in range(T-2, -1, -1):
self.optimal[exp_idx][t] = self.R[exp_idx][self.optimal[exp_idx][t+1], t+1]
print(f"Optimal Hidden State Sequence: {self.optimal[exp_idx]}")
return self.optimal[exp_idx]
We consider a simple example, where
A = [ 0.7 0.3 0.2 0.8 ] A = \left[\begin{matrix} 0.7& 0.3\\ 0.2& 0.8 \end{matrix}\right] A=[0.70.20.30.8], B = [ 0.4 0.1 0.4 0.1 0.1 0.4 0.1 0.4 ] B = \left[\begin{matrix} 0.4& 0.1& 0.4& 0.1\\ 0.1& 0.4& 0.1& 0.4 \end{matrix}\right] B=[0.40.10.10.40.40.10.10.4], π = [ 0.5 0.5 ] \pi = \left[\begin{matrix} 0.5& 0.5 \end{matrix}\right] π=[0.50.5], o b s e r v a t i o n = [ 0 1 3 ] \mathrm{observation} = \left[\begin{matrix} 0& 1& 3 \end{matrix}\right] observation=[013]
The following derivation is adapted from the course slides of Media and Cognition, EE, Tsinghua, so variables may have different letters. Here, δ \delta δ denotes likelihood, φ \varphi φ stands for precursor, O O O stands for observation, P P P stands for transition probability, and q ∗ q^* q∗ is the solution.
δ 0 ( 0 ) = π 0 b 0 ( O 0 ) = 0.2 δ 0 ( 1 ) = π 1 b 1 ( O 0 ) = 0.05 φ 0 ( 0 ) = − 1 φ 0 ( 1 ) = − 1 δ 1 ( 0 ) = max { δ 0 ( 0 ) P ( S 0 ∣ S 0 ) , δ 0 ( 1 ) P ( S 0 ∣ S 1 ) } b 0 ( O 1 ) = max { 0.2 × 0.7 , 0.05 × 0.2 } × 0.1 = 0.014 δ 1 ( 1 ) = max { δ 0 ( 0 ) P ( S 1 ∣ S 0 ) , δ 0 ( 1 ) P ( S 1 ∣ S 1 ) } b 1 ( O 1 ) = max { 0.2 × 0.3 , 0.05 × 0.8 } × 0.4 = 0.024 φ 1 ( 0 ) = arg max { δ 0 ( 0 ) P ( S 0 ∣ S 0 ) , δ 0 ( 1 ) P ( S 0 ∣ S 1 ) } = 0 φ 1 ( 1 ) = 0 δ 2 ( 0 ) = max { δ 1 ( 0 ) P ( S 0 ∣ S 0 ) , δ 1 ( 1 ) P ( S 0 ∣ S 1 ) } b 0 ( O 2 ) = max { 0.014 × 0.7 , 0.024 × 0.2 } × 0.1 = 0.00098 δ 2 ( 1 ) = max { δ 1 ( 0 ) P ( S 1 ∣ S 0 ) , δ 1 ( 1 ) P ( S 1 ∣ S 1 ) } b 1 ( O 2 ) = max { 0.014 × 0.3 , 0.024 × 0.8 } × 0.4 = 0.00768 φ 2 ( 0 ) = 0 φ 2 ( 1 ) = 1 P ∗ = max { δ 2 ( 0 ) , δ 2 ( 1 ) } = δ 2 ( 1 ) = 0.00768 q 2 ∗ = 1 , q 1 ∗ = φ 2 ( 1 ) = 1 , q 0 ∗ = φ 1 ( 1 ) = 0 \begin{aligned} & \delta_0(0)=\pi_0 b_0\left(O_0\right)=0.2 \\ & \delta_0(1)=\pi_1 b_1\left(O_0\right)=0.05 \\ & \varphi_0(0)=-1 \\ & \varphi_0(1)=-1 \\ & \delta_1(0)=\max \left\{\delta_0(0) P\left(S_0 \mid S_0\right), \delta_0(1) P\left(S_0 \mid S_1\right)\right\} b_0\left(O_1\right)=\max \{0.2 \times 0.7,0.05 \times 0.2\} \times 0.1=0.014 \\ & \delta_1(1)=\max \left\{\delta_0(0) P\left(S_1 \mid S_0\right), \delta_0(1) P\left(S_1 \mid S_1\right)\right\} b_1\left(O_1\right)=\max \{0.2 \times 0.3,0.05 \times 0.8\} \times 0.4=0.024 \\ & \varphi_1(0)=\arg \max \left\{\delta_0(0) P\left(S_0 \mid S_0\right), \delta_0(1) P\left(S_0 \mid S_1\right)\right\}=0 \\ & \varphi_1(1)=0 \\ & \delta_2(0)=\max \left\{\delta_1(0) P\left(S_0 \mid S_0\right), \delta_1(1) P\left(S_0 \mid S_1\right)\right\} b_0\left(O_2\right)=\max \{0.014 \times 0.7,0.024 \times 0.2\} \times 0.1=0.00098 \\ & \delta_2(1)=\max \left\{\delta_1(0) P\left(S_1 \mid S_0\right), \delta_1(1) P\left(S_1 \mid S_1\right)\right\} b_1\left(O_2\right)=\max \{0.014 \times 0.3,0.024 \times 0.8\} \times 0.4=0.00768 \\ & \varphi_2(0)=0 \\ & \varphi_2(1)=1 \\ & P^*=\max \left\{\delta_2(0), \delta_2(1)\right\}=\delta_2(1)=0.00768 \\ & q_2{ }^*=1, \quad q_1{ }^*=\varphi_2(1)=1, \quad q_0{ }^*=\varphi_1(1)=0 \end{aligned} δ0(0)=π0b0(O0)=0.2δ0(1)=π1b1(O0)=0.05φ0(0)=−1φ0(1)=−1δ1(0)=max{δ0(0)P(S0∣S0),δ0(1)P(S0∣S1)}b0(O1)=max{0.2×0.7,0.05×0.2}×0.1=0.014δ1(1)=max{δ0(0)P(S1∣S0),δ0(1)P(S1∣S1)}b1(O1)=max{0.2×0.3,0.05×0.8}×0.4=0.024φ1(0)=argmax{δ0(0)P(S0∣S0),δ0(1)P(S0∣S1)}=0φ1(1)=0δ2(0)=max{δ1(0)P(S0∣S0),δ1(1)P(S0∣S1)}b0(O2)=max{0.014×0.7,0.024×0.2}×0.1=0.00098δ2(1)=max{δ1(0)P(S1∣S0),δ1(1)P(S1∣S1)}b1(O2)=max{0.014×0.3,0.024×0.8}×0.4=0.00768φ2(0)=0φ2(1)=1P∗=max{δ2(0),δ2(1)}=δ2(1)=0.00768q2∗=1,q1∗=φ2(1)=1,q0∗=φ1(1)=0
So we have the optimal hidden state sequence is [0, 1, 1]
By running the codes, we get the following results.
====================
Hidden: [1 1 1 1 1]
--------------------
Observed: [1 1 2 1 3]
====================
Hidden State Matrix:
[[0.2 0.014 0.00098]
[0.05 0.024 0.00768]]
Route Matrix:
[[-1 0 0]
[-1 0 1]]
Optimal Hidden State Sequence: [0 1 1]
The first part(between the lines of “==”) records the generated hidden sequence and the observed sequence. The second part denotes the solving steps of the Viterbi Algorithm, which is in sync with previous calculation. This verifies the correctness of our program.