CS224N notes_chapter4_Word window classification and Neural Network

第四讲 word window分类与神经网络

Classification background

Notations
input: x i x_i xi, words/context windows/sentences/doc.etc
output: y i y_i yi, labels such as sentiment/NER/other words.etc
i=1,2,…,N
#为了方便, 笔者接下来用cls代替classification
General cls method: assume x is fixed, train logistic regression weights W -> the decision boundary modified.
Goal: p ( y ∣ x ) = e x p ( W y x ) ∑ c = 1 C e x p ( W c x ) p(y|x)= \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)} p(yx)=c=1Cexp(Wcx)exp(Wyx)
We try to minimize the negative log probability of the True class
− log ⁡ p ( y ∣ x ) = − log ⁡ e x p ( W y x ) ∑ c = 1 C e x p ( W c x ) -\log p(y|x) = -\log \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)} logp(yx)=logc=1Cexp(Wcx)exp(Wyx)
Cross entropy. Because of one-hot p, the only term left is the negative log prob of the true label.
H ( p , q ) = − ∑ c = 1 C p ( c ) log ⁡ q ( c ) H(p,q)=-\sum_{c=1}^C p(c)\log q(c) H(p,q)=c=1Cp(c)logq(c)
Thus, our final loss func could be written as
J ( θ ) = 1 N ∑ i = 1 N − l o g e x p ( f y i ) ∑ c = 1 C e x p ( f y c ) J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} J(θ)=N1i=1Nlogc=1Cexp(fyc)exp(fyi)
In practice, we usually use Regularization terms to prevent the model from overfitting.
J ( θ ) = 1 N ∑ i = 1 N − l o g e x p ( f y i ) ∑ c = 1 C e x p ( f y c ) + λ θ 2 J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2 J(θ)=N1i=1Nlogc=1Cexp(fyc)exp(fyi)+λθ2

Updating word vectors for classification

Assume we have some pretrained word vectors and we want to use them in some new tasks.
In training set, we have ‘TV’ and ‘telly’ while in testing set we have ‘television’.
In pretrained model, they might be very similar, which means they are close in the vector space. But, after training, ‘TV’ and ‘telly’ might move a lot in the vector space while ‘television’ doesn’t move. This causes the similarity changes.
So, if dataset is small, we usually fix the word vec. Otherwise, retraining word vec might lead to good result.

Window classification & cross entropy error derivation tips

Window cls:

  • Idea: cls a word in its context window of neighboring words.
  • For example, named entity recognition(NER) into 4 classes:
    • person, location, organization, none

Method:

  • softmax classifier by assigning a label to a center word and concatenating all word vecs surrounding it
  • example:
    • … museums in Paris are amazing …
    • x w i n d o w = [ x m u s e u m s , x i n , x P a r i s , x a r e , x a m a z i n g ] x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}] xwindow=[xmuseums,xin,xParis,xare,xamazing]
    • x w i n d o w ∈ R 5 d x_{window}\in R^{5d} xwindowR5d

Then we could use
J ( θ ) = 1 N ∑ i = 1 N − l o g e x p ( f y i ) ∑ c = 1 C e x p ( f y c ) + λ θ 2 J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2 J(θ)=N1i=1Nlogc=1Cexp(fyc)exp(fyi)+λθ2
to update θ \theta θ

A single layer neural network

Neuron:
h w , b ( x ) = f ( w T x + b ) f ( z ) = 1 1 + e − z h_{w,b}(x)=f(w^Tx+b) \\ f(z)=\frac 1 {1+e^{-z}} hw,b(x)=f(wTx+b)f(z)=1+ez1
A neural network = running several logistic regressions at the same time
For the first layer, we have
a 1 = f ( W 11 x 1 + W 12 x 2 + W 13 x 3 + W 14 x 4 + b ) a 2 = f ( W 21 x 1 + W 22 x 2 + W 23 x 3 + W 24 x 4 + b ) a_1 = f(W_{11}x_1+W_{12}x_2+W_{13}x_3+W_{14}x_4+b) \\ a_2 = f(W_{21}x_1+W_{22}x_2+W_{23}x_3+W_{24}x_4+b) \\ a1=f(W11x1+W12x2+W13x3+W14x4+b)a2=f(W21x1+W22x2+W23x3+W24x4+b)
In matrix notation for a layer
z = W x + b a = f ( z ) \begin{aligned} z=&Wx+b \\ a=&f(z) \end{aligned} z=a=Wx+bf(z)
f-> activation function. Usually a nonlinear function. If f is linear, then the multilayer NN just equals to a linear transform and thus, it is not powerful.
Take window cls for example.
z = W x + b a = f ( z ) s = U T a z=Wx+b \\ a=f(z) \\ s=U^Ta z=Wx+ba=f(z)s=UTa
x: word vector,such as x w i n d o w = [ x m u s e u m s , x i n , x P a r i s , x a r e , x a m a z i n g ] x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}] xwindow=[xmuseums,xin,xParis,xare,xamazing], we assume x ∈ R 20 × 1 x \in \mathbb{R}^{20\times 1} xR20×1
W: parameters, W ∈ R 8 × 20 W\in\mathbb{R}^{8\times 20} WR8×20
U: parameters, U ∈ R 8 × 1 U\in\mathbb{R}^{8\times 1} UR8×1
s means score

Max-Margin loss and backprop

max-margin loss
s s s= score(museums in Paris are amazing)
s c s_c sc = score(not all museums in Paris)
We want s s s to be high, and s c s_c sc to be low. Thus, we try to minimize
J = max ⁡ ( 0 , 1 − s + s c ) J=\max (0,1-s+s_c) J=max(0,1s+sc)
We call this **max-margin **loss.
Each window with a location at its center should have a score +1 higher than any window without a location at its center.
For each true window, we use negative sampling to get a false window and then sum over all training windows to get the final J J J
s = U T f ( W x + b ) s c = U T f ( W x c + b ) s=U^Tf(Wx+b) \\ s_c=U^Tf(Wx_c+b) s=UTf(Wx+b)sc=UTf(Wxc+b)
Next, we will do Backpropagation.
if J &lt; 0 J&lt;0 J<0, it means that the model works well with this true/false pair. and we needn’t do backprop. Otherwise, we will update the parameters. After training for a long time, most of the training samples lead to J &lt; 0 J&lt;0 J<0, and thus we do less calculation compared with the start.
∂ s ∂ U = a = f ( W x + b ) ∂ s ∂ W = U f ′ ( W x + b ) x T f ′ ( z ) = f ′ ( z ) ( 1 − f ( z ) ) \frac{\partial s}{\partial U} = a=f(Wx+b) \\ \frac{\partial s}{\partial W}=U f&#x27;(Wx+b) x^T \\ f&#x27;(z) = f&#x27;(z)(1-f(z)) Us=a=f(Wx+b)Ws=Uf(Wx+b)xTf(z)=f(z)(1f(z))
And
∂ s ∂ W i j = U i f ′ ( z i ) x j = δ i x j ∂ s ∂ b i = δ i \frac{\partial s}{\partial W_{ij}}=U_i f&#x27;(z_i) x_j \\ =\delta_i x_j \\ \frac{\partial s}{\partial b_i} = \delta_i Wijs=Uif(zi)xj=δixjbis=δi
for word vector
∂ s ∂ x j = ∑ i = 1 2 ∂ s ∂ a i ∂ a i ∂ x j = ∑ i = 1 2 ∂ U T a ∂ a i ∂ f ( W i ⋅ x + b i ) ∂ x j = ∑ i = 1 2 U i f ′ ( W i ⋅ x + b i ) W i j = ∑ i = 1 2 δ i W i j = W ⋅ j T δ \begin{aligned} \frac{\partial s}{\partial x_{j}} =&amp; \sum_{i=1}^2 \frac{\partial s}{\partial a_i} \frac{\partial a_i}{\partial x_j} \\ =&amp; \sum_{i=1}^2 \frac{\partial U^Ta}{\partial a_i} \frac{\partial f(W_{i·} x + b_i)}{\partial x_j} \\ =&amp; \sum_{i=1}^2 U_i f&#x27;(W_{i·} x + b_i) W_{ij} \\ =&amp; \sum_{i=1}^2 \delta_i W_{ij} \\ =&amp; W_{·j}^T \delta \end{aligned} xjs=====i=12aisxjaii=12aiUTaxjf(Wix+bi)i=12Uif(Wix+bi)Wiji=12δiWijWjTδ
Thus,
∂ s ∂ x = W T δ \frac{\partial s}{\partial x}=W^T \delta xs=WTδ

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值