Advantages of CRF
CRF is popularly used in classification problems. There are 5 important reasons:
1. The chief advantage of CRF lies in the fact that it models the conditional distribution
P(y|x)
rather than the joint distribution
P(y,x)
.
2. A CRF can be used to capture arbitrary dependencies among components of
x
and
3. The major difference between CRF with some other existing methods is that it is a “global” model that considers all residues of the protein as a whole rather than focus merely on a local window around the tag to be labeled.
4. In the inference, the states of all tags are predicted simultaneously in a way that maximizes the overall likelihood. The interdependence between the states of adjacent tags is also explicitly exploited through doublet feature functions used in the model.
5. For some cases that chain CRF cannot handle, we can use high order CRF to add high order features to describe the long-range interaction.
Chain CRF
Definition of Chain CRF
In the chain CRF, all the nodes in the graph form a linear chain. According to the nature of undirected graphical model,
where ψi(Yi,X) acts over single labels and ϕi(Yi,Yi−1,X) acts over edges.
After merging all parameter into a single vector,
where in the k×n matrix \mathcal{F}, Fji=fj(yi,yi−1,x,i) , and dΛ is a vector( k×1 ).
The forward and backward vectors
We use dynamic programming to calculate
Zx
.
where f(.,.,.,i) is the feature vector evaluated at the ith sequence position. α and β are called the forward and backward vectors respectively, and both have a time complexity of O(TN2) .\
Then we can write the marginals and partition function as below:
Inference of Chain CRF
Inference in linear CRF using the Viterbi algorithm:
where the normalized probability of the best labeling is given by maxyδ(n,y)Zx ,and the best labeling is given by argmaxyδ(n,y)
The time complexity is O(TN2) .
We can also use pseudo likelihood to perform the inference.
Training of Chain CRF
CRF too suffer from the bane of overfitting, we impose a penalty on large parameter values. The penalized log-likelihood is given by:
and the gradient is given by:
where EP(y|xk)[F(y,xk)]=ΣiΣy,y,α(i−1,y,)β(i,y)exp(ΛTf(y,y,,xk,i)) .\
Then Gradient Descent method (you can find in the last blog) can be used to train all the parameters.