问题
在Softmax回归的风险函数
R
(
W
)
=
−
1
N
∑
n
=
1
N
∑
c
=
1
C
y
c
(
n
)
log
y
^
c
(
n
)
\mathcal{R}\left( \boldsymbol{W} \right) =-\frac{1}{N}\sum_{n=1}^N{\sum_{c=1}^C{y_{c}^{\left( n \right)}\log \hat{y}_{c}^{\left( n \right)}}}
R(W)=−N1n=1∑Nc=1∑Cyc(n)logy^c(n)
=
−
1
N
∑
n
=
1
N
(
y
(
n
)
)
T
log
y
^
(
n
)
\ \ \ \ \ \ \ =-\frac{1}{N}\sum_{n=1}^N{\left( y^{\left( n \right)} \right) ^T\log \hat{y}^{\left( n \right)}}
=−N1n=1∑N(y(n))Tlogy^(n)
如果加上正则化项会有什么影响?
解析
要注意的是,Softmax回归中使用的𝐶个权重向量是冗余的,即对所有的 权重向量都减去一个同样的向量𝒗,不改变其输出结果.因此,Softmax 回归往往需要使用正则化来约束其参数.此外,我们还可以利用这个特性来避免计算Softmax函数时在数值计算上溢出问题.不加入正则化项限制权重向量的大小, 可能造成权重向量过大, 产生上溢。
问题
验证平均感知器训练算法3.2中给出的平均权重向量的计算方式和公式(3.77)等价.
解析
平均感知器(Averaged Perceptron)平均感知器的形式为:
y
^
=
s
g
n
(
1
T
∑
k
=
1
K
c
k
(
w
k
T
x
)
)
\hat{y}=sgn\left( \frac{1}{T}\sum_{k=1}^K{c_k\left( \boldsymbol{w}_{k}^{T}\boldsymbol{x} \right)} \right)
y^=sgn(T1k=1∑Kck(wkTx))
=
s
g
n
(
1
T
(
∑
k
=
1
K
c
k
w
k
)
T
x
)
\ =sgn \left( \frac{1}{T}\left( \sum_{k=1}^K{c_k\boldsymbol{w}_k} \right) ^T\boldsymbol{x} \right)
=sgn⎝⎛T1(k=1∑Kckwk)Tx⎠⎞
=
s
g
n
(
(
1
T
∑
t
=
1
T
w
t
)
T
x
)
\ =sgn \left( \left( \frac{1}{T}\sum_{t=1}^T{\boldsymbol{w}_t} \right) ^T\boldsymbol{x} \right)
=sgn⎝⎛(T1t=1∑Twt)Tx⎠⎞
=
s
g
n
(
w
ˉ
T
x
)
\ =sgn \left( \boldsymbol{\bar{w}}^T\boldsymbol{x} \right)
=sgn(wˉTx)
其中T为迭代总回合数,
w
ˉ
\bar{w}
wˉ为T次迭代的平均权重向量。这个方法很简单,只需要在算法3.1中增加一个
w
ˉ
\bar{w}
wˉ,并且在每次迭代时都更新
w
ˉ
\bar{w}
wˉ.
算法3.2:
设预测错误的样本有K个,分别为
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
,
⋅
⋅
⋅
,
(
x
K
,
y
K
)
(x_1,y_1),(x_2,y_2),···,(x_K,y_K)
(x1,y1),(x2,y2),⋅⋅⋅,(xK,yK)并且设选取到这些样本时迭代次数为
t
k
t_k
tk根据上述算法知:
w
=
x
1
y
1
+
x
2
y
2
+
⋅
⋅
⋅
+
x
k
y
k
w=x_1y_1+x_2y_2+···+x_ky_k
w=x1y1+x2y2+⋅⋅⋅+xkyk
u
=
t
1
x
1
y
1
+
t
2
x
2
y
2
+
⋅
⋅
⋅
+
t
k
x
k
y
k
u=t_1x_1y_1+t_2x_2y_2+···+t_kx_ky_k
u=t1x1y1+t2x2y2+⋅⋅⋅+tkxkyk
因此,
w
ˉ
=
w
−
1
T
u
\bar{w}=w-\frac{1}{T}u
wˉ=w−T1u
=
x
1
y
1
+
x
2
y
2
+
⋅
⋅
⋅
⋅
+
x
k
y
k
−
1
T
(
t
1
x
1
y
1
+
t
2
x
2
y
2
+
⋅
⋅
⋅
+
t
k
x
k
y
k
)
\ =x_1y_1+x_2y_2+····+x_ky_k-\frac{1}{T}\left( t_1x_1y_1+t_2x_2y_2+···+t_kx_ky_k \right)
=x1y1+x2y2+⋅⋅⋅⋅+xkyk−T1(t1x1y1+t2x2y2+⋅⋅⋅+tkxkyk)
=
T
−
t
1
T
x
1
y
1
+
T
−
t
2
T
x
2
y
2
+
⋅
⋅
⋅
+
T
−
t
k
T
x
k
y
k
\ =\frac{T-t_1}{T}x_1y_1+\frac{T-t_2}{T}x_2y_2+···+\frac{T-t_k}{T}x_ky_k
=TT−t1x1y1+TT−t2x2y2+⋅⋅⋅+TT−tkxkyk
公式(3.77):
w
=
∑
t
=
1
T
w
t
w
t
=
∑
i
=
1
k
≤
t
x
i
y
i
w
ˉ
=
1
T
w
w=\sum_{t=1}^T{w_t}\ \ \ \ w_t=\sum_{i=1}^{k\le t}{x_iy_i}\ \ \bar{w}=\frac{1}{T}w
w=t=1∑Twt wt=i=1∑k≤txiyi wˉ=T1w
w
=
(
x
1
y
1
+
⋅
⋅
⋅
+
x
1
y
1
)
+
(
x
1
y
1
+
x
2
y
2
+
⋅
⋅
⋅
+
x
1
y
1
+
x
2
y
2
)
+
⋅
⋅
⋅
+
(
∑
i
=
1
k
x
i
y
i
+
⋅
⋅
⋅
+
∑
i
=
1
k
x
i
y
i
)
w=\left( x_1y_1+···+x_1y_1 \right) +\left( x_1y_1+x_2y_2+···+x_1y_1+x_2y_2 \right) +···+\left( \sum_{i=1}^k{x_iy_i+···+\sum_{i=1}^k{x_iy_i}} \right)
w=(x1y1+⋅⋅⋅+x1y1)+(x1y1+x2y2+⋅⋅⋅+x1y1+x2y2)+⋅⋅⋅+(i=1∑kxiyi+⋅⋅⋅+i=1∑kxiyi)
上式中i,当第二个预测错误的样本被选取时才开始
x
2
y
2
x_2y_2
x2y2也就是说到达
t
2
t_2
t2时刻才开始加,其余
x
i
y
i
x_iy_i
xiyi类似到达
t
i
t_i
ti时刻开始加,依次
x
i
y
i
x_iy_i
xiyi共相加了
T
−
t
i
T-t_i
T−ti次所以:
w
=
(
x
1
y
1
+
⋅
⋅
⋅
+
x
1
y
1
)
+
(
x
1
y
1
+
x
2
y
2
+
⋅
⋅
⋅
+
x
1
y
1
+
x
2
y
2
)
+
⋅
⋅
⋅
+
(
∑
i
=
1
k
x
i
y
i
+
⋅
⋅
⋅
+
∑
i
=
1
k
x
i
y
i
)
w=\left( x_1y_1+···+x_1y_1 \right) +\left( x_1y_1+x_2y_2+···+x_1y_1+x_2y_2 \right) +···+\left( \sum_{i=1}^k{x_iy_i+···+\sum_{i=1}^k{x_iy_i}} \right)
w=(x1y1+⋅⋅⋅+x1y1)+(x1y1+x2y2+⋅⋅⋅+x1y1+x2y2)+⋅⋅⋅+(i=1∑kxiyi+⋅⋅⋅+i=1∑kxiyi)
=
(
T
−
t
1
)
x
1
y
1
+
(
T
−
t
2
)
x
2
y
2
+
⋅
⋅
⋅
+
(
T
−
t
k
)
x
k
y
1
k
\ =\left( T-t_1 \right) x_1y_1+\left( T-t_2 \right) x_2y_2+···+\left( T-t_k \right) x_ky_{1k}
=(T−t1)x1y1+(T−t2)x2y2+⋅⋅⋅+(T−tk)xky1k
w
ˉ
\bar{w}
wˉ将上式除以T
w
ˉ
=
T
−
t
1
T
x
1
y
1
+
T
−
t
2
T
x
2
y
2
+
⋅
⋅
⋅
+
T
−
t
k
T
x
k
y
k
\bar{w} =\frac{T-t_1}{T}x_1y_1+\frac{T-t_2}{T}x_2y_2+···+\frac{T-t_k}{T}x_ky_k
wˉ=TT−t1x1y1+TT−t2x2y2+⋅⋅⋅+TT−tkxkyk