Neural Architectures for Named Entity Recognition
2.3 CRF Tagging Models
-
Input sentence: X=(X1,X2,...,XN)\mathbf{X}=\mathbf{(X_1,X_2,...,X_N)}X=(X1,X2,...,XN)
-
Matrix of scores (output by BiLSTM) : P∈Rn×k\mathbf{P}\in \mathbb{R}^{n\times k}P∈Rn×k
-Score of the jthj^{th}jth of ithi^{th}ithword in a sentence : Pi,jP_{i,j}Pi,j
-
The numbers of distinct tags: kkk
-
Sequence of predictions: y=(y1,y2,...,yn)\mathbf{y}=(y_1,y_2,...,y_n)y=(y1,y2,...,yn)
y\mathbf{y}y's score: s(X,y)=∑i=0nAyi,yi+1+∑i=0nPi,yis(\mathbf{X}, \mathbf{y})=\sum\limits_{i=0}^nA_{y_i,y_{i+1}}+\sum\limits_{i=0}^nP_{i,y_i}s(X,y)=i=0∑nAyi,yi+1+i=0∑nPi,yi
-
Matrix of transition scores: A∈R(k+2)×(k+2)A \in \mathbb{R}^{(k+2) \times(k+2)}A∈R(k+2)×(k+2)
- score of a transition from the tag iii to tag jjj: Ai,jA_{i,j}Ai,j
- start and end tag: y0,yny_0, y_ny0,yn
Probability for the sequence y\mathbf{y}y:
p(y∣X)=es(X,y)∑y~∈Yxes(X,y~)p(\mathbf{y}|\mathbf{X})=\frac{e^{s(\mathbf{X}, \mathbf{y})}}{\sum_{\tilde{y}}\in\mathbf{Y_x}e^{s(\mathbf{X}, \mathbf{\tilde{y}})}}p(y∣X)=∑y~∈Yxes(X,y~)es(X,y)
Maximize the log_probability of the correct tag sequence:
log(p(y∣X))=s(X,y)−log(∑y~∈Yxes(X,y~)=s(X,y)−logaddy~∈Yxs(X,y~)log(p(\mathbf{y}|\mathbf{X}))={s(\mathbf{X}, \mathbf{y})}-log(\sum\limits_{{\tilde{y}}\in\mathbf{Y_x}}e^{s(\mathbf{X}, \mathbf{\tilde{y}}})=s(\mathbf{X}, \mathbf{y})-{logadd}_{\tilde{y}\in\mathbf{Y_x}} s(\mathbf{X}, \mathbf{\tilde{y}})log(p(y∣X))=s(X,y)−log(y~∈Yx∑es(X,y~)=s(X,y)−logaddy~∈Yxs(X,y~)
- All possible tag sequences for a sentence X\mathbf{X}X: YX\mathbf{Y_X}YX
Predict the output sequence by:
y∗=argmaxy~∈YXs(X,y~)\mathbf{y}^*=argmax_{\tilde{y}\in \mathbf{Y_X}} s(\mathbf{X}, \mathbf{\tilde{y}})y∗=argmaxy~∈YXs(X,y~)