第四讲 word window分类与神经网络
Classification background
Notations
input:
x
i
x_i
xi, words/context windows/sentences/doc.etc
output:
y
i
y_i
yi, labels such as sentiment/NER/other words.etc
i=1,2,…,N
#为了方便, 笔者接下来用cls代替classification
General cls method: assume x is fixed, train logistic regression weights W -> the decision boundary modified.
Goal:
p
(
y
∣
x
)
=
e
x
p
(
W
y
x
)
∑
c
=
1
C
e
x
p
(
W
c
x
)
p(y|x)= \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)}
p(y∣x)=∑c=1Cexp(Wcx)exp(Wyx)
We try to minimize the negative log probability of the True class
−
log
p
(
y
∣
x
)
=
−
log
e
x
p
(
W
y
x
)
∑
c
=
1
C
e
x
p
(
W
c
x
)
-\log p(y|x) = -\log \frac{exp(W_yx)}{\sum_{c=1}^C exp(W_cx)}
−logp(y∣x)=−log∑c=1Cexp(Wcx)exp(Wyx)
Cross entropy. Because of one-hot p, the only term left is the negative log prob of the true label.
H
(
p
,
q
)
=
−
∑
c
=
1
C
p
(
c
)
log
q
(
c
)
H(p,q)=-\sum_{c=1}^C p(c)\log q(c)
H(p,q)=−c=1∑Cp(c)logq(c)
Thus, our final loss func could be written as
J
(
θ
)
=
1
N
∑
i
=
1
N
−
l
o
g
e
x
p
(
f
y
i
)
∑
c
=
1
C
e
x
p
(
f
y
c
)
J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})}
J(θ)=N1i=1∑N−log∑c=1Cexp(fyc)exp(fyi)
In practice, we usually use Regularization terms to prevent the model from overfitting.
J
(
θ
)
=
1
N
∑
i
=
1
N
−
l
o
g
e
x
p
(
f
y
i
)
∑
c
=
1
C
e
x
p
(
f
y
c
)
+
λ
θ
2
J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2
J(θ)=N1i=1∑N−log∑c=1Cexp(fyc)exp(fyi)+λθ2
Updating word vectors for classification
Assume we have some pretrained word vectors and we want to use them in some new tasks.
In training set, we have ‘TV’ and ‘telly’ while in testing set we have ‘television’.
In pretrained model, they might be very similar, which means they are close in the vector space. But, after training, ‘TV’ and ‘telly’ might move a lot in the vector space while ‘television’ doesn’t move. This causes the similarity changes.
So, if dataset is small, we usually fix the word vec. Otherwise, retraining word vec might lead to good result.
Window classification & cross entropy error derivation tips
Window cls:
- Idea: cls a word in its context window of neighboring words.
- For example, named entity recognition(NER) into 4 classes:
- person, location, organization, none
Method:
- softmax classifier by assigning a label to a center word and concatenating all word vecs surrounding it
- example:
- … museums in Paris are amazing …
- x w i n d o w = [ x m u s e u m s , x i n , x P a r i s , x a r e , x a m a z i n g ] x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}] xwindow=[xmuseums,xin,xParis,xare,xamazing]
- x w i n d o w ∈ R 5 d x_{window}\in R^{5d} xwindow∈R5d
Then we could use
J
(
θ
)
=
1
N
∑
i
=
1
N
−
l
o
g
e
x
p
(
f
y
i
)
∑
c
=
1
C
e
x
p
(
f
y
c
)
+
λ
θ
2
J(\theta) = \frac 1 N \sum_{i=1}^N -log \frac{exp(f_{y_i})}{\sum_{c=1}^C exp(f_{y_c})} + \lambda \theta^2
J(θ)=N1i=1∑N−log∑c=1Cexp(fyc)exp(fyi)+λθ2
to update
θ
\theta
θ
A single layer neural network
Neuron:
h
w
,
b
(
x
)
=
f
(
w
T
x
+
b
)
f
(
z
)
=
1
1
+
e
−
z
h_{w,b}(x)=f(w^Tx+b) \\ f(z)=\frac 1 {1+e^{-z}}
hw,b(x)=f(wTx+b)f(z)=1+e−z1
A neural network = running several logistic regressions at the same time
For the first layer, we have
a
1
=
f
(
W
11
x
1
+
W
12
x
2
+
W
13
x
3
+
W
14
x
4
+
b
)
a
2
=
f
(
W
21
x
1
+
W
22
x
2
+
W
23
x
3
+
W
24
x
4
+
b
)
a_1 = f(W_{11}x_1+W_{12}x_2+W_{13}x_3+W_{14}x_4+b) \\ a_2 = f(W_{21}x_1+W_{22}x_2+W_{23}x_3+W_{24}x_4+b) \\
a1=f(W11x1+W12x2+W13x3+W14x4+b)a2=f(W21x1+W22x2+W23x3+W24x4+b)
In matrix notation for a layer
z
=
W
x
+
b
a
=
f
(
z
)
\begin{aligned} z=&Wx+b \\ a=&f(z) \end{aligned}
z=a=Wx+bf(z)
f-> activation function. Usually a nonlinear function. If f is linear, then the multilayer NN just equals to a linear transform and thus, it is not powerful.
Take window cls for example.
z
=
W
x
+
b
a
=
f
(
z
)
s
=
U
T
a
z=Wx+b \\ a=f(z) \\ s=U^Ta
z=Wx+ba=f(z)s=UTa
x: word vector,such as
x
w
i
n
d
o
w
=
[
x
m
u
s
e
u
m
s
,
x
i
n
,
x
P
a
r
i
s
,
x
a
r
e
,
x
a
m
a
z
i
n
g
]
x_{window}=[x_{museums},x_{in},x_{Paris},x_{are},x_{amazing}]
xwindow=[xmuseums,xin,xParis,xare,xamazing], we assume
x
∈
R
20
×
1
x \in \mathbb{R}^{20\times 1}
x∈R20×1
W: parameters,
W
∈
R
8
×
20
W\in\mathbb{R}^{8\times 20}
W∈R8×20
U: parameters,
U
∈
R
8
×
1
U\in\mathbb{R}^{8\times 1}
U∈R8×1
s means score
Max-Margin loss and backprop
max-margin loss
s
s
s= score(museums in Paris are amazing)
s
c
s_c
sc = score(not all museums in Paris)
We want
s
s
s to be high, and
s
c
s_c
sc to be low. Thus, we try to minimize
J
=
max
(
0
,
1
−
s
+
s
c
)
J=\max (0,1-s+s_c)
J=max(0,1−s+sc)
We call this **max-margin **loss.
Each window with a location at its center should have a score +1 higher than any window without a location at its center.
For each true window, we use negative sampling to get a false window and then sum over all training windows to get the final
J
J
J
s
=
U
T
f
(
W
x
+
b
)
s
c
=
U
T
f
(
W
x
c
+
b
)
s=U^Tf(Wx+b) \\ s_c=U^Tf(Wx_c+b)
s=UTf(Wx+b)sc=UTf(Wxc+b)
Next, we will do Backpropagation.
if
J
<
0
J<0
J<0, it means that the model works well with this true/false pair. and we needn’t do backprop. Otherwise, we will update the parameters. After training for a long time, most of the training samples lead to
J
<
0
J<0
J<0, and thus we do less calculation compared with the start.
∂
s
∂
U
=
a
=
f
(
W
x
+
b
)
∂
s
∂
W
=
U
f
′
(
W
x
+
b
)
x
T
f
′
(
z
)
=
f
′
(
z
)
(
1
−
f
(
z
)
)
\frac{\partial s}{\partial U} = a=f(Wx+b) \\ \frac{\partial s}{\partial W}=U f'(Wx+b) x^T \\ f'(z) = f'(z)(1-f(z))
∂U∂s=a=f(Wx+b)∂W∂s=Uf′(Wx+b)xTf′(z)=f′(z)(1−f(z))
And
∂
s
∂
W
i
j
=
U
i
f
′
(
z
i
)
x
j
=
δ
i
x
j
∂
s
∂
b
i
=
δ
i
\frac{\partial s}{\partial W_{ij}}=U_i f'(z_i) x_j \\ =\delta_i x_j \\ \frac{\partial s}{\partial b_i} = \delta_i
∂Wij∂s=Uif′(zi)xj=δixj∂bi∂s=δi
for word vector
∂
s
∂
x
j
=
∑
i
=
1
2
∂
s
∂
a
i
∂
a
i
∂
x
j
=
∑
i
=
1
2
∂
U
T
a
∂
a
i
∂
f
(
W
i
⋅
x
+
b
i
)
∂
x
j
=
∑
i
=
1
2
U
i
f
′
(
W
i
⋅
x
+
b
i
)
W
i
j
=
∑
i
=
1
2
δ
i
W
i
j
=
W
⋅
j
T
δ
\begin{aligned} \frac{\partial s}{\partial x_{j}} =& \sum_{i=1}^2 \frac{\partial s}{\partial a_i} \frac{\partial a_i}{\partial x_j} \\ =& \sum_{i=1}^2 \frac{\partial U^Ta}{\partial a_i} \frac{\partial f(W_{i·} x + b_i)}{\partial x_j} \\ =& \sum_{i=1}^2 U_i f'(W_{i·} x + b_i) W_{ij} \\ =& \sum_{i=1}^2 \delta_i W_{ij} \\ =& W_{·j}^T \delta \end{aligned}
∂xj∂s=====i=1∑2∂ai∂s∂xj∂aii=1∑2∂ai∂UTa∂xj∂f(Wi⋅x+bi)i=1∑2Uif′(Wi⋅x+bi)Wiji=1∑2δiWijW⋅jTδ
Thus,
∂
s
∂
x
=
W
T
δ
\frac{\partial s}{\partial x}=W^T \delta
∂x∂s=WTδ