线性链条件随机场CRF

最新推荐文章于 2023-05-31 16:30:09 发布

rose_gong

最新推荐文章于 2023-05-31 16:30:09 发布

阅读量2.8k

点赞数 1

文章标签： crf 条件随机场分类动态规划前向算法

本文链接：https://blog.csdn.net/rose_gong/article/details/44410213

版权

线性链条件随机场（CRF）在分类问题中广泛应用，因其考虑了特征间的全局依赖性，能捕捉y与x之间的任意依赖关系。CRF通过动态规划计算概率，同时预测所有标签状态，利用双元特征函数处理相邻标签间的相互影响。在无法处理某些情况时，可通过高阶CRF描述长程交互。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Advantages of CRF

CRF is popularly used in classification problems. There are 5 important reasons:
1. The chief advantage of CRF lies in the fact that it models the conditional distribution $P(y|x)$ rather than the joint distribution $P(y,x)$ .
2. A CRF can be used to capture arbitrary dependencies among components of $x$ and $y$ .
3. The major difference between CRF with some other existing methods is that it is a “global” model that considers all residues of the protein as a whole rather than focus merely on a local window around the tag to be labeled.
4. In the inference, the states of all tags are predicted simultaneously in a way that maximizes the overall likelihood. The interdependence between the states of adjacent tags is also explicitly exploited through doublet feature functions used in the model.
5. For some cases that chain CRF cannot handle, we can use high order CRF to add high order features to describe the long-range interaction.

Chain CRF

Definition of Chain CRF

In the chain CRF, all the nodes in the graph form a linear chain. According to the nature of undirected graphical model, 　

P (y 1, . . ., y M | x) = 1 Z x Π i (ψ i (Y i, X) ϕ i (Y i, Y i - 1, X))

$P(y_1,...,y_M|x)=\frac{1}{Z_x}\Pi_{i} (\psi_{i}(Y_i,X)\phi_{i}(Y_i,Y_{i-1},X) )$
where

ψi(Yi,X) $\psi_{i}(Y_i,X)$ acts over single labels and

ϕi(Yi,Yi−1,X) $\phi_{i}(Y_i,Y_{i-1},X)$ acts over edges.

ψ i (Y i, X) = e x p (Σ k θ k S k (y i, x, i))

$\psi_{i}(Y_i,X)=exp(\Sigma_k\theta_k S_k(y_i,\textbf{x},i))$

ϕ i (Y i, Y i - 1, X) = e x p (Σ j λ j t j (y i - 1, y i, x, i))

$\phi_{i}(Y_i,Y_{i-1},X)=exp(\Sigma_j \lambda_j t_j (y_{i-1},y_i,\textbf{x},i) )$
After merging all parameter into a single vector,

P (y 1, . . ., y M | x) = e x p ( Λ T F ) Z x

$P(y_1,...,y_M|x)=\frac{exp({\Lambda}^T \mathcal{F})}{Z_x}$
where in the

k×n $k \times n$ matrix \mathcal{F},

Fji=fj(yi,yi−1,x,i) $F_{ji}=f_j(y_i,y_{i−1},{x},i)$ , and

dΛ $d{\Lambda}$ is a vector(

k×1 $k \times 1$ ).

y * = a r g m a x y Λ T F

$y^*=argmax_{y}{\Lambda}^T \mathcal{F}$

The forward and backward vectors

We use dynamic programming to calculate $Z_x$ .

α (y, i) = Σ y, α (y,, i - 1) e x p (Λ T f (y, y,, x, i))

$\alpha(y,i)=\Sigma_{y^,}\alpha(y^,,i-1)exp({\Lambda}^T{f}(y,y^,,{x},i))$

β (y, i) = Σ y, β (y,, i + 1) e x p (Λ T f (y, y,, x, i + 1))

$\beta(y,i)=\Sigma_{y^,}\beta(y^,,i+1)exp({\Lambda}^T {f}(y,y^,,{x},i+1))$
where

f(.,.,.,i) ${f} (., ., ., i)$ is the feature vector evaluated at the

ith $i^{th}$ sequence position.

α $\alpha$ and

β $\beta$ are called the forward and backward vectors respectively, and both have a time complexity of

O(TN2) $O(TN^2)$ .\
Then we can write the marginals and partition function as below:

P (Y i = y | x) = α (y, i) β (y, i) / Z x

$P(Y_i=y|{x})=\alpha(y,i)\beta(y,i)/Z_x$

P (Y i = y, Y i + 1 = y, | x) = α (y, i) β (y, i + 1) e x p (Λ T f (y, y,, x, i + 1)) / Z x

$P(Y_i=y,Y_{i+1}=y^,|{x})=\alpha(y,i) \beta(y,i+1) exp({\Lambda}^T {f}(y,y^,,{x},i+1))/Z_x$

Z x = Σ y α (y, T) = Σ y β (y, 1)

$Z_x=\Sigma_y \alpha(y,T) = \Sigma_y \beta(y,1)$

Inference of Chain CRF

Inference in linear CRF using the Viterbi algorithm:

δ (y, i) = max y, δ (y,, i - 1) e x p (Λ T f (y, y,, x, i))

$\delta(y,i)=\max_{y^,} \delta(y^,,i-1)exp({\Lambda}^T {f}(y,y^,,{x},i))$
where the normalized probability of the best labeling is given by

maxyδ(n,y)Zx $\frac{\max_y \delta(n,y)}{Z_x}$ ,and the best labeling is given by

argmaxyδ(n,y) $argmax_y \delta(n,y)$
The time complexity is

O(TN2) $O(TN^2)$ .
We can also use pseudo likelihood to perform the inference.

y * t = a r g m a x y t P (y t | Λ, x)

$y_t^*=argmax_{y_t}P(y_t|{\Lambda, x})$

Training of Chain CRF

CRF too suffer from the bane of overfitting, we impose a penalty on large parameter values. The penalized log-likelihood is given by:

L Λ = Σ k (Λ T F (y k, x k) - l o g Z Λ (x k)) - | | Λ | | 2 2 σ 2

$\mathcal{L}_\Lambda = \Sigma_k({\Lambda^T F({y_k,x_k}) - logZ_{\Lambda}({x^k}) ) - \frac{||\Lambda||^2}{2\sigma^2}}$
and the gradient is given by:

\nabla L Λ = Σ k (F (y k, x k) - E P (y | x k) [F (y, x k)])) - Λ σ 2

$\nabla{{\mathcal{L}_\Lambda}} = \Sigma_k( F(y^k,x^k) - E_{P(y|x^k)[ F(y,x^k)]} ) ) - \frac{\Lambda}{\sigma^2}$
where

EP(y|xk)[F(y,xk)]=ΣiΣy,y,α(i−1,y,)β(i,y)exp(ΛTf(y,y,,xk,i)) $E_{P(y|x^k)[F(y,x^k)]} = \Sigma_i \Sigma_{y,y^,} \alpha(i-1,y^,) \beta(i,y) exp({\Lambda}^T {f}(y,y^,,{x^k},i))$ .\
Then Gradient Descent method (you can find in the last blog) can be used to train all the parameters.