文章目录
-
- 往期文章链接目录
- Log-Linear model
- Conditional Random Fields (CRF)
- Log-linear model to linear-CRF
- Inference problem for CRF
- Learning problem for CRF
- 往期文章链接目录
往期文章链接目录
Log-Linear model
Let x x x be an example, and let y y y be a possible label for it. A log-linear model assumes that
p ( y ∣ x ; w ) = exp [ ∑ j = 1 J w j F j ( x , y ) ] Z ( x , w ) p(y | x ; w)=\frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)} p(y∣x;w)=Z(x,w)exp[∑j=1JwjFj(x,y)]
where the partition function
Z ( x , w ) = ∑ y ′ exp [ ∑ j = 1 J w j F j ( x , y ′ ) ] Z(x, w)=\sum_{y^{\prime}} \exp [\sum_{j=1}^J w_{j} F_{j}\left(x, y^{\prime}\right)] Z(x,w)=y′∑exp[j=1∑JwjFj(x,y′)]
Note that in ∑ y ′ \sum_{y^{\prime}} ∑y′, we make a summation over all possible y y y. Therefore, given x x x, the label predicted by the model is
y ^ = argmax y p ( y ∣ x ; w ) = argmax y ∑ j = 1 J w j F j ( x , y ) \hat{y}=\underset{y}{\operatorname{argmax}} p(y | x ; w)=\underset{y}{\operatorname{argmax}} \sum_{j=1}^J w_{j} F_{j}(x, y) y^=yargmaxp(y∣x;w)=yargmaxj=1∑JwjFj(x,y)
Each expression F j ( x , y ) F_j(x, y) Fj(x,y) is called a feature-function. You can think of it as the j j j-th feature extracted from ( x , y ) (x,y) (x,y).
Remark of the log-linear model:
-
a linear combination ∑ j = 1 J w j F j ( x , y ) \sum_{j=1}^J w_{j} F_{j}(x, y) ∑j=1JwjFj(x,y) can take any positive or negative real value; the exponential makes it positive.
-
The division makes the result p ( y ∣ x ; w ) p(y | x ; w) p(y∣x;w) between 0 and 1, i.e. makes them be valid probabilities.
Conditional Random Fields (CRF)
Last time, we talked about Markov Random Fields. In this post, we are going to discuss Conditional Random Fields, which is an important special case of Markov Random Fields arises when they are applied to model a conditional probability distribution p ( y ∣ x ) p(y|x) p(y∣x), where x x x and y y y are vactor-valued variables.
![](https://i-blog.csdnimg.cn/blog_migrate/72c166697402d788c592b46826d84d4b.png)
Formal definition of CRF
Formally, a CRF is a Markov network which specifies a conditional distribution
P ( y ∣ x ) = 1 Z ( x ) ∏ c ∈ C ϕ c ( x c , y c ) P(y\mid x) = \frac{1}{Z(x)} \prod_{c \in C} \phi_c(x_c,y_c) P(y∣x)=Z(x)1c∈C∏ϕc(xc,yc)
with partition function
Z = ∑ y ∈ Y ∏ c ∈ C ϕ c ( x c , y c ) Z = \sum_{y \in \mathcal{Y}} \prod_{c \in C} \phi_c(x_c,y_c) Z=y∈Y∑c∈C∏ϕc(xc,yc)
we further assume that the factors ϕ c ( x c , y c ) \phi_c(x_c,y_c) ϕc(xc,yc) (maximal cliques) are of the form
ϕ c ( x c , y c ) = exp [ w c T f c ( x c , y c ) ] \phi_c(x_c,y_c) = \exp[w_c^T f_c(x_c, y_c)] ϕc(xc,yc)=exp[wcTfc(xc,yc)]
Since we require our potential function ϕ \phi ϕ to be non-negative, it’s natural to use the exponential function. f c ( x c , y c ) f_c(x_c, y_c) fc(xc,yc) can be an arbitrary set of features describing the compatibility between x c x_c xc and y c y_c yc. Note that these feature functions could be designed by manually doing feature engineering or using deep learning, LSTM, etc.
Log-linear model to linear-CRF
As a remainder, let x x x be an example, and let y y y be a possible label for it. Then a log-linear model assumes that
p ( y ∣ x ; w ) = exp [ ∑ j = 1 J w j F j ( x , y ) ] Z ( x , w ) p(y | x ; w)=\frac{\exp [\sum_{j=1}^J w_{j} F_{j}(x, y)]}{Z(x, w)} p(y∣x;w)=Z(x,w)exp[∑j=1JwjFj(x,y)]
From now on, we use the bar notation for sequences. Then to linear-CRF, we write the above equation as
p ( y ˉ ∣ x ˉ ; w ) = exp [ ∑ j = 1 J w j F j ( x ˉ , y ˉ ) ] Z ( x ˉ , w ) = exp [ ∑ j = 1 J w j ∑ i = 2 T f j ( y i − 1 , y i , x ˉ ) ] Z ( x ˉ , w ) ( 1 ) \begin{aligned} p(\bar y | \bar x; w) &= \frac{\exp [\sum_{j=1}^J w_{j} F_{j}(\bar x, \bar y)]}{Z(\bar x, w)}\\ &= \frac{\exp [\sum_{j=1}^J w_{j} \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x)]}{Z(\bar x, w)} &&\quad(1) \end{aligned} p(yˉ∣xˉ;w)=Z(xˉ,w)exp[∑j=1JwjFj(xˉ,yˉ)]=Z(xˉ,w)exp[∑j=1Jwj∑i=2Tfj(yi−1,yi,xˉ)](1)
where y y y can take values from { 1 , 2 , . . . , m } \{1,2,...,m\} { 1,2,...,m}. Here is an example:
Assume we have a sequence x ˉ = ( x 1 , x 2 , x 3 , x 4 ) \bar x = (x_1, x_2, x_3, x_4) xˉ=(x1,x2,x3,x4) and the corresponding hidden sequence y ˉ = ( y 1 , y 2 , y 3 , y 4 ) \bar y = (y_1, y_2, y_3, y_4) yˉ=(y1,y2,y3,y4).
![](https://i-blog.csdnimg.cn/blog_migrate/a64ed0056628c2c76243b43cfbdf1d91.png)
We can divide each feature-function F j ( x ˉ , y ˉ ) F_j(\bar x, \bar y) Fj(xˉ,yˉ) into fuctions for each maximal clique. That is,
F j ( x ˉ , y ˉ ) = ∑ i = 2 T f j ( y i − 1 , y i , x ˉ ) (1.1) F_j(\bar x, \bar y) = \sum_{i=2}^{T} f_j (y_{i-1}, y_i, \bar x) \tag {1.1} Fj(xˉ,yˉ)=i=2∑Tfj(yi−1,yi,xˉ)(1.1)
Perticularly, from the above figure, since we have 3 3 3 maximal cliques, so
F j ( x ˉ , y ˉ ) = f j ( y 1 , y 2 , x ˉ ) + f j ( y 2 , y 3 , x ˉ ) + f j ( y 3 , y 4 , x