【条件随机场】Linear Chain CRF原理和实现（上）

WinterShiver

已于 2022-04-24 11:05:48 修改

阅读量870

点赞数

分类专栏：机器学习与人工智能文章标签： CRF 条件随机场线性链深度学习机器学习

于 2022-03-18 18:42:26 首次发布

本文链接：https://blog.csdn.net/WinterShiver/article/details/123581998

版权

机器学习与人工智能专栏收录该内容

11 篇文章 0 订阅

订阅专栏

之前对CRF的了解仅是听说过的水平，这对于一个NLP博士来说确实不应该😄最近在项目中用到了CRF，于是参考¹把linear chain CRF的理论和代码过了一遍。

对于linear chain CRF的理论，建议预先阅读²进行了解。如果在阅读时觉得书上符号太多、比较晦涩，也可以结合着这篇博客来看书。在代码实现方面，pytorch官方给出了实现¹，但是和原理的对应写得比较简略，看完书的同学直接看这份代码，依然是困难的。本文的目的就是为基本了解linear chain CRF原理的读者，讲解代码实现的每个细节，完成搭建linear chain CRF的全过程。

本文的上编基于¹的代码，结合原理讲一遍¹的代码，逐次讲解代码的细节，还原代码实现的先后过程。同时，由于¹的CRF实现只考虑了单条样本，但没有考虑对一个batch样本的处理，本文的下编实现了处理一个batch样本的CRF Layer，讲解batch内样本长短不一时，处理mask的细节，并提供相应的代码。

这篇博客中出现的符号、代码中出现的变量名，尽量与²和¹保持一致。

Notations

Notations	Meanings
$x_i$	the input sequence $x_1, \dots, x_n$
$y_i$	the tag sequence $y_1, \dots, y_n$ , $y_i \in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace$
$h_i$	the hidden representation of each $x_i$ , $h_i \in \mathbb{R}^{{\lvert T \rvert}}$
$T$	the tag set, including $s t a r t$ and $s t o p$
$s t a r t, s t o p$	two additional special tags, $\in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace$
$P$	the (only) trainable parameter within CRF Layer, $\in \mathbb{R}^{{\lvert T \rvert}*{\lvert T \rvert}}$ `self.transitions = nn.Parameter(torch.randn(self.tagset_size, self.tagset_size))`

Basics

Linear-chain CRF: compute a conditional probability $P (y ∣ x)$ given $y$ (tag sequence) and $x$ (input sequence of tokens).

Estimate $P (y ∣ x)$ : calculate the sum of value of the feature functions as the estimated $S c o r e (x, y)$ for each $y$ , in which $\propto log\ P(y|x) = log\ P(y_1, \dots, y_n|x)$ , and $\frac{exp(Score(x,y))}{\sum_{y}exp(Score(x,y))}$

The feature functions for each position $\leq i \leq n$
- Emit score $h_i[y_i]$
  - Captures the semantic feature of this time step
- Transition score $P[y_i][y_{i-1}]$
  - Captures the local feature within adjacent tags, regardless of the absolute position
  - When $i = 1$ , the transition score is calculated specially, see _score_sentence
The feature function for the final transition
- Transition score $P[y_{n+1}][y_n]$
- It is calculated specially, see _score_sentence
$\sum_{i=1}^{n}{h_i[y_i]+P[y_i][y_{i-1}]} + P[y_{n+1}][y_n]$

Train: optimize the model by minimizing $-log\ P(y|x)$

Part 1: _score_sentence

Calculate the score for a specific sample, i.e. estimate $log\ P(y_0=start, y_1, \dots, y_n, y_{n+1}=stop|x)$ giving $y, x$

Add $y_0=start, y_{n+1}=stop$ to the tag sequence, where $\in \lbrace{0, 1, \dots, {\lvert T \rvert}-1}\rbrace$
- ```
START_TAG = "<START>"
STOP_TAG = "<STOP>"
tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

self.tag_to_ix = tag_to_ix
self.tagset_size = len(tag_to_ix)

tags = torch.cat([torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags])
```
- After update: tags.shape = (seq_len+1,)
Score = sum of value of feature functions
- For position $\leq i \leq n$
  - Emit score $h_i[y_i]$
  - Transition score $P[y_i][y_{i-1}]$
- For position $i = n + 1$
  - Transition score $P[stop][y_n]$
- $\sum_{i=1}^{n+1}{h_i[y_i]+P[y_i][y_{i-1}]}$

score = torch.zeros(1)
for i, feat in enumerate(feats):
    score = score + \
    	self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]

$\propto log \ P(y|x)$
- After softmax on y, $S c o r e (x, y)$ becomes $P (y ∣ x)$
Returns: $\propto log \ P(y|x)$

Part 2: _forward_alg

Calculate the total score for each possible $y$ given $x$ , i.e. estimate $log\ P(y_0=start, y_{n+1}=stop|x)$

The proceeding matrix $\in \mathbb{R}^{{\lvert T \rvert}*{\lvert T \rvert}}$
- For position $\leq i \leq n$ , $M_i[y_i][y_{i-1}] = h_i[y_i] + P[y_i][y_{i-1}]$
- For position $i = n + 1$ , $M_i[y_i][y_{i-1}] = P[y_i][y_{i-1}]$
- $\sum_{i=1}^{n+1} M_i[y_i][y_{i-1}]$
An extrapolation
- Consider the estimation of $P(y_0|y_0=start, x)$
  - $\alpha_0[y_0] = \begin{cases}0, & y_0=start \\-inf,& otherwise\end{cases}$
  - $\alpha_0 \propto log\ P(y_0|y_0=start, x)$
  - $\alpha_0 \in \mathbb{R}^{{\lvert T \rvert}}$ : init_alphas, the initial forward_var
  - ```
  init_alphas = torch.full((1, self.tagset_size), -10000.)
  init_alphas[0][self.tag_to_ix[START_TAG]] = 0.
  forward_var = init_alphas
```
- Consider the estimation of $P(y_1|y_0=start, x)$
  - $\forall y_1, y_0 \in T$ , $Score(y_1, y_0) = \alpha_0[y_0] + M_1[y_1][y_0] \propto log \ P(y_1,y_0|y_0=start, x)$
  - Softmax over $y_0$ : $\alpha_1 = Score(y_1) = log\ \frac{exp(Score(y_1, y_0))}{\sum_{y_0}{exp(Score(y_1, y_0))}}$
- Generalize to each time step $\leq i \leq n$
  - Please refer to the for loop in code: the part of Iterate through the sentence
  - $\forall y_{i}, y_{i-1} \in T$ , $Score(y_i, y_{i-1}) = \alpha_{i-1}[y_{i-1}] + M_{i}[y_{i}][y_{i-1}] \propto log \ P(y_{i},y_{i-1}|y_0=start, x)$
    - ```
    for next_tag in range(self.tagset_size):
        emit_score = feat[next_tag].view(
            1, -1).expand(1, self.tagset_size)
        trans_score = self.transitions[next_tag].view(1, -1)
        next_tag_var = forward_var + trans_score + emit_score
```
- softmax over $y_{i-1}$ : $\alpha_{i} = Score(y_i) = log\ \frac{exp(Score(y_i, y_{i-1}))}{\sum_{y_{i-1}}{exp(Score(y_{i}, y_{i-1}))}}$
  - ```
  alphas_t = []  
  for next_tag in range(self.tagset_size):
      alphas_t.append(log_sum_exp(next_tag_var).view(1))
  forward_var = torch.cat(alphas_t).view(1, -1)
```
- For time step $i = n + 1$
  - We only calculate $Score(stop, y_n) = Score(y_n) + M_{n+1}[stop][y_n]$
  - $y_n) \propto log\ P(y_{n+1}=stop,y_n|y_0=start, x)$
  - $log\ P(y_{n+1}=stop|y_0=start, x) = log\sum_{y_n} exp(Score(stop, y_n))$
  - ```
  terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
  alpha = log_sum_exp(terminal_var)
```
- $\alpha = log\ P(y_{n+1}=stop|y_0=start, x)$
  - since we always start from $y_0=start$ , $\alpha = log\ P(y_{n+1}=stop|x) = log\ P(y_0=start, y_{n+1}=stop|x)$
- Returns: $\alpha = log\ P(y_0=start, y_{n+1}=stop|x)$

Part 3: neg_log_likelihood

The training objective for CRF model: minimize the negative log likelihood of $P (y ∣ x)$

Forward score $\alpha$
- $\alpha = log\ P(y_0=start, y_{n+1}=stop|x)$
Gold score $S c o r e (x, y)$
- $log\ P(y_0=start, y_1, \dots, y_n, y_{n+1}=stop|x)$
The loss
- $\alpha-Score(x, y) = -log(y_1,\dots,y_n|y_0=start, y_{n+1}=stop, x)$
Since all sequences begin with $s t a r t$ and end with $s t o p$ , $-log(y_1,\dots,y_n|x)$

forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score

Part 4: _viterbi_decode

Make predictions based on CRF model: $y^*=\underset{y}{argmax}\ Score(x, y)$

Notations
- $Score(x,y,{\rm{i}})=\sum_{i=1}^{{\rm{i}}}{h_i[y_i]+P[y_i][y_{i-1}]}$
- $\alpha_i = \underset{y_i}{max}\ Score(x, y, i) \in \mathbb{R}^{{\lvert T \rvert}}$ , where $\alpha_{i}[y_j]= max\ Score(x, y, i)\ |_{y_i=y_j}$
The initial score $\alpha_0$
- $\alpha_{0}[y_j]=Score(x, y, 0)\ |_{y_0=y_j}$
- $\alpha_0[y_0] = \begin{cases}0, & y_0=start \\-inf,& otherwise\end{cases}$
- $\alpha_0 \in \mathbb{R}^{{\lvert T \rvert}}$ : init_vvars, the initial forward_var
Go forward, for position $\leq i \leq n$
- Please refer to the for loop in code
- Algorithm
  - notice that $Score(x, y, i) = Score(x, y, i-1) + h_i[y_i]+P[y_i][y_{i-1}]$
  - $\alpha_i'[y_i]=\underset{y_{i-1}}{max}\ \alpha_{i-1}[y_{i-1}]+P[y_i][y_{i-1}]$
    - ```
    next_tag_var = forward_var + self.transitions[next_tag]
    best_tag_id = argmax(next_tag_var)
```
- $\alpha_i[y_i] = \alpha_i'[y_i]+h_i[y_i]$
  - ```
  for next_tag in range(self.tagset_size):
  	viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
  forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
```
- Comments
  - The correctness is obvious
  - In the process, the information of current best path is recorded in
    - each time step: bptrs_t $[y_i]=\underset{y_{i-1}}{argmax}\ \alpha_{i-1}[y_{i-1}]+P[y_i][y_{i-1}]$ , and is recorded by backpointers
For time step $n + 1$
- $y)=\underset{y_{n}}{max}\ \alpha_{n}[y_{n}]+P[stop][y_{n}]$
- ```
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id]
```
Returns: the path score; the best path
- the best path is restored by tracing back backpointers