CS224N notes_chapter4_Word window classification and Neural Network

lirt15

于 2019-07-02 19:26:43 发布

阅读量143

点赞数

分类专栏： CS224N 文章标签： NLP CS224N

本文链接：https://blog.csdn.net/lirt15/article/details/94472259

版权

CS224N 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

第四讲 word window分类与神经网络

Classification background

Notations
input: $x_i$ , words/context windows/sentences/doc.etc
output: $y_i$ , labels such as sentiment/NER/other words.etc
i=1,2,…,N
#为了方便, 笔者接下来用cls代替classification
General cls method: assume x is fixed, train logistic regression weights W -> the decision boundary modified.
Goal: $\frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)}$
We try to minimize the negative log probability of the True class
$-\log p(y|x) = -\log \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)}$
Cross entropy. Because of one-hot p, the only term left is the negative log prob of the true label.
$H(p,q)=-\sum_{c=1}^C p(c)\log q(c)$
Thus, our final loss func could be written as
$J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})}$
In practice, we usually use Regularization terms to prevent the model from overfitting.
$J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2$

Updating word vectors for classification

Assume we have some pretrained word vectors and we want to use them in some new tasks.
In training set, we have ‘TV’ and ‘telly’ while in testing set we have ‘television’.
In pretrained model, they might be very similar, which means they are close in the vector space. But, after training, ‘TV’ and ‘telly’ might move a lot in the vector space while ‘television’ doesn’t move. This causes the similarity changes.
So, if dataset is small, we usually fix the word vec. Otherwise, retraining word vec might lead to good result.

Window classification & cross entropy error derivation tips

Window cls:

Idea: cls a word in its context window of neighboring words.
For example, named entity recognition(NER) into 4 classes:
- person, location, organization, none

Method:

softmax classifier by assigning a label to a center word and concatenating all word vecs surrounding it
example:
- … museums in Paris are amazing …
- $x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}]$
- $x_{window}\in R^{5d}$

Then we could use
$J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2$
to update $\theta$

A single layer neural network

Neuron:
$h_{w,b}(x)=f(w^Tx+b) \\ f(z)=\frac 1 {1+e^{-z}}$
A neural network = running several logistic regressions at the same time
For the first layer, we have
$a_1 = f(W_{11}x_1+W_{12}x_2+W_{13}x_3+W_{14}x_4+b) \\ a_2 = f(W_{21}x_1+W_{22}x_2+W_{23}x_3+W_{24}x_4+b) \\$
In matrix notation for a layer
$\begin{aligned} z=&Wx+b \\ a=&f(z) \end{aligned}$
f-> activation function. Usually a nonlinear function. If f is linear, then the multilayer NN just equals to a linear transform and thus, it is not powerful.
Take window cls for example.
$z=Wx+b \\ a=f(z) \\ s=U^Ta$
x: word vector,such as $x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}]$ , we assume $\in \mathbb{R}^{20\times 1}$
W: parameters, $W\in\mathbb{R}^{8\times 20}$
U: parameters, $U\in\mathbb{R}^{8\times 1}$
s means score

Max-Margin loss and backprop

max-margin loss
$s$ = score(museums in Paris are amazing)
$s_c$ = score(not all museums in Paris)
We want $s$ to be high, and $s_c$ to be low. Thus, we try to minimize
$J=\max (0,1-s+s_c)$
We call this **max-margin **loss.
Each window with a location at its center should have a score +1 higher than any window without a location at its center.
For each true window, we use negative sampling to get a false window and then sum over all training windows to get the final $J$
$s=U^Tf(Wx+b) \\ s_c=U^Tf(Wx_c+b)$
Next, we will do Backpropagation.
if $J < 0$ , it means that the model works well with this true/false pair. and we needn’t do backprop. Otherwise, we will update the parameters. After training for a long time, most of the training samples lead to $J < 0$ , and thus we do less calculation compared with the start.
$\frac{\partial s}{\partial U} = a=f(Wx+b) \\ \frac{\partial s}{\partial W}=U f'(Wx+b) x^T \\ f'(z) = f'(z)(1-f(z))$
And
$\frac{\partial s}{\partial W_{ij}}=U_i f'(z_i) x_j \\ =\delta_i x_j \\ \frac{\partial s}{\partial b_i} = \delta_i$
for word vector
$\begin{aligned} \frac{\partial s}{\partial x_{j}} =& \sum_{i=1}^2 \frac{\partial s}{\partial a_i} \frac{\partial a_i}{\partial x_j} \\ =& \sum_{i=1}^2 \frac{\partial U^Ta}{\partial a_i} \frac{\partial f(W_{i·} x + b_i)}{\partial x_j} \\ =& \sum_{i=1}^2 U_i f'(W_{i·} x + b_i) W_{ij} \\ =& \sum_{i=1}^2 \delta_i W_{ij} \\ =& W_{·j}^T \delta \end{aligned}$
Thus,
$\frac{\partial s}{\partial x}=W^T \delta$

lirt15

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CS224N notes_chapter4_Word window classification and Neural Network

第四讲 word window分类与神经网络Classification backgroundnotationsinput: xix_ixi, words/context windows/sentences/doc.etcoutput: yiy_iyi, labels such as sentiment/NER/other words.etci=1,2,…,N#为了方便, 笔者接下...
复制链接

扫一扫

专栏目录