The Hidden Markov Model is a finite set of states , each of which is associated with a (generally multidimensional) probability distribution . Transitions among the states are governed by a set of probabilities calledtransition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution. It is only the outcome, not the state visible to an external observer and therefore states are ``hidden'' to the outside; hence the name Hidden Markov Model.
- The number of states of the model, N .
- The number of observation symbols in the alphabet, M . If the observations are continuous thenM is infinite.
- A set of state transition probabilities .
where denotes the current state.
Transition probabilities should satisfy the normal stochastic constraints,
- A probability distribution in each of the states, .
where denotes the observation symbol in the alphabet, and the current parameter vector.
Following stochastic constraints must be satisfied.
If the observations are continuous then we will have to use a continuous probability density function, instead of a set of discrete probabilities. In this case we specify the parameters of the probability density function. Usually the probability density is approximated by a weighted sum of M Gaussian distributions ,
should satisfy the stochastic constrains,
- The initial state distribution, .
Therefore we can use the compact notation
to denote an HMM with discrete probability distributions, while
to denote one with continuous densities. .
For the sake of mathematical and computational tractability, following assumptions are made in the theory of HMMs.
(1)The Markov assumption
As given in the definition of HMMs, transition probabilities are defined as,
In other words it is assumed that the next state is dependent only upon the current state. This is called the Markov assumption and the resulting model becomes actually a first order HMM.
However generally the next state may depend on past k states and it is possible to obtain a such model, called an order HMM by defining the transition probabilities as follows.
But it is seen that a higher order HMM will have a higher complexity. Even though the first order HMMs are the most common, some attempts have been made to use the higher order HMMs too.
(2)The stationarity assumption
Here it is assumed that state transition probabilities are independent of the actual time at which the transitions takes place. Mathematically,
for any and .
(3)The output independence assumption
This is the assumption that current output(observation) is statistically independent of the previous outputs(observations). We can formulate this assumption mathematically, by considering a sequence of observations,
. Then according to the assumption for an HMM ,
However unlike the other two, this assumption has a very limited validity. In some cases this assumption may not be fair enough and therefore becomes a severe weakness of the HMMs.
Once we have an HMM, there are three problems of interest.
(1)The Evaluation Problem
- Given an HMM and a sequence of observations , what is the probability that the observations are generated by the model, ? (2)The Decoding Problem 根据观察到的序列，计算其最有可能对应的隐藏状态序列，即解码问题
- Given a model and a sequence of observations , what is the most likely state sequence in the model that produced the observations? (3)The Learning Problem 怎样改进这个模型，使得观察到的序列的概率最大化
Given a model
and a sequence of observations
, how should we adjust the model parameters
in order to maximize
Evaluation problem can be used for isolated (word) recognition. Decoding problem is related to the continuous recognition as well as to the segmentation. Learning problem must be solved, if we want to train an HMM for the subsequent use of recognition tasks.
We have a model and a sequence of observations , and must be found. We can calculate this quantity using simple probabilistic arguments. But this calculation involves number of operations in the order of . This is very large even if the length of the sequence,T is moderate. Therefore we have to look for an other method for this calculation. Fortunately there exists one which has a considerably low complexity and makes use an auxiliary variable, calledforward variable .
The forward variable is defined as the probability of the partial observation sequence , when it terminates at the statei . Mathematically,
前向变量：观察到O1,O2,..,Ot并且t时刻Qt = i 的概率，它是按t向前推进的，当t=T时，整个观察序列都已经获取到，因此只要对所有的前向变量在T时刻的值求和就得到了观察序列出现的概率
Then it is easy to see that following recursive relationship holds.
Using this recursion we can calculate
and then the required probability is given by,
The complexity of this method, known as the forward algorithm is proportional to , which is linear wrtT whereas the direct calculation mentioned earlier, had an exponential complexity.
In a similar way we can define the backward variable as the probability of the partial observation sequence , given that the current state isi . Mathematically ,
As in the case of there is a recursive relationship which can be used to calculate efficiently.
Further we can see that,
Therefore this gives another way to calculate , by using both forward and backward variables as given in eqn.1.7 .
Eqn. 1.7 is very useful, specially in deriving the formulas required for gradient based training.
In this case We want to find the most likely state sequence for a given sequence of observations, and a model,
The solution to this problem depends upon the way ``most likely state sequence'' is defined. One approach is to find the most likely state att =t and to concatenate all such ' 's. But some times this method does not give a physically meaningful state sequence. Therefore we would go for another method which has no such problems.
In this method, commonly known as Viterbi algorithm , the whole state sequence with the maximum likelihood is found. In order to facilitate the computation we define an auxiliary variable,
which gives the highest probability that partial observation sequence and state sequence up tot =t can have, when the current state isi .
It is easy to observe that the following recursive relationship holds.
由前面的第三个假设可知，t时刻转换到t + 1 时刻，这个概率与已经发生的观察序列无关，因此我们只需要保存在每个状态上的最大概率，然后在计算这个状态进行到下一个状态的概率，将二者进行乘积即得到在t + 1时刻该路径的概率，然后在N个值中选择一个最大的值
So the procedure to find the most likely state sequence starts from calculation of using recursion in1.8 , while always keeping a pointer to the ``winning state'' in the maximum finding operation. Finally the state , is found where
and starting from this state, the sequence of states is back-tracked as the pointer in each state indicates.This gives the required set of states.
This whole algorithm can be interpreted as a search in a graph whose nodes are formed by the states of the HMM in each of the time instant .