本文地址:https://goodgoodstudy.blog.csdn.net/article/details/109179566
预备知识
- 概率分布的变换,
f y ( y ) ∂ y = f x ( x ) ∂ x f_y(y) \partial y = f_x(x) \partial x fy(y)∂y=fx(x)∂x
或者
f y ( y ) = f x ( x ) ∂ x ∂ y (1) f_y(y) = f_x(x)\frac{\partial x}{\partial y} \tag{1} fy(y)=fx(x)∂y∂x(1)
参考证明 - DL散度,衡量两个分布之间的差异
D L ( p 1 ∥ p 2 ) = ∫ p 1 ( y ) log ( p 1 ( y ) p 2 ( y ) ) d y DL(p_1\| p_2) = \int p_1(y) \log \left(\frac{p_1(y)}{p_2(y)}\right) dy DL(p1∥p2)=∫p1(y)log(p2(y)p1(y))dy
IP
问题描述
假设神经元的输出方程为:
y
=
g
(
x
)
y = g(x)
y=g(x)
其中
x
x
x 为到达该神经元的所有信号的总和,并且服从分布
x
∼
f
x
(
x
)
x \sim f_x(x)
x∼fx(x)。
g
(
⋅
)
g(\cdot)
g(⋅) 为非线性激活函数,如 Sigmoid 函数,单调递增,由(1)式得:
y
∼
f
y
(
y
)
=
f
x
(
x
)
∂
x
∂
y
y \sim f_y(y)= f_x(x)\frac{\partial x}{\partial y}
y∼fy(y)=fx(x)∂y∂x
现在,我们希望神经元的输出
y
y
y 能够服从某一特定的分布
f
e
x
p
f_{exp}
fexp,如
指数分布:
f
e
x
p
=
1
μ
exp
(
−
y
μ
)
f_{exp} = \frac{1}{\mu} \exp\left(-\frac{y}{\mu}\right)
fexp=μ1exp(−μy)
高斯分布:
f
e
x
p
=
1
2
π
σ
exp
(
−
(
y
−
μ
)
2
σ
2
)
f_{exp} = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y-\mu)^2}{\sigma^2}\right)
fexp=2πσ1exp(−σ2(y−μ)2)
梯度下降法
将神经元的输出方程改为:
y
=
g
(
a
x
+
b
)
y = g(ax+b)
y=g(ax+b)
通过调整
a
a
a 和
b
b
b 的值,使得输出
y
y
y 的分布服从指定分布
构造损失函数:
D
L
(
f
y
∣
∣
f
e
x
p
)
=
∫
f
y
log
(
f
y
f
e
x
p
)
d
y
=
E
y
[
log
(
f
y
)
−
log
(
f
e
x
p
)
]
=
E
y
[
log
(
f
x
)
−
log
(
∂
y
∂
x
)
−
log
(
f
e
x
p
)
]
\begin{aligned} DL(f_y || f_{exp}) &= \int f_y \log\left(\frac{f_y}{f_{exp}}\right) dy \\\\ &= E_y[\log(f_y)-\log(f_{exp})] \\\\ &= E_y\left[\log(f_x) - \log\left(\frac{\partial y}{\partial x}\right)-\log(f_{exp})\right] \end{aligned}
DL(fy∣∣fexp)=∫fylog(fexpfy)dy=Ey[log(fy)−log(fexp)]=Ey[log(fx)−log(∂x∂y)−log(fexp)]
通过优化这个损失函数可以使得
f
y
f_y
fy 接近
f
e
x
p
f_{exp}
fexp
计算梯度
以 f e x p = 1 μ exp ( − y μ ) \displaystyle{f_{exp} = \frac{1}{\mu} \exp\left(-\frac{y}{\mu}\right)} fexp=μ1exp(−μy)为例
假设激活函数为 Sigmoid 函数,
g
(
x
)
=
1
1
−
exp
(
−
x
)
\displaystyle{g(x) = \frac{1}{1-\exp(-x)}}
g(x)=1−exp(−x)1,则
y
=
g
(
a
x
+
b
)
=
1
1
−
exp
(
−
a
x
−
b
)
y = g(ax+b) = \frac{1}{1-\exp(-ax-b)}
y=g(ax+b)=1−exp(−ax−b)1
则
∂
y
∂
x
=
a
y
(
1
−
y
)
(2)
\frac{\partial y}{\partial x} = ay(1-y) \tag{2}
∂x∂y=ay(1−y)(2)
∂
y
∂
a
=
x
y
(
1
−
y
)
(3)
\frac{\partial y}{\partial a} = xy(1-y) \tag{3}
∂a∂y=xy(1−y)(3)
∂
y
∂
b
=
y
(
1
−
y
)
(4)
\frac{\partial y}{\partial b} = y(1-y) \tag{4}
∂b∂y=y(1−y)(4)
- 参数
a
a
a 的梯度
∂ ∂ a D L ( f y ∣ ∣ f e x p ) = ∂ ∂ a E y [ log ( f x ) − log ( ∂ y ∂ x ) − log ( f e x p ) ] = E y [ 0 − ∂ ∂ a ( log a + log y + log ( 1 − y ) ) − ∂ ∂ a ( − log μ − y μ ) ] = − 1 a − E y [ 1 y ∂ y ∂ a − 1 1 − y ∂ y ∂ a + 1 μ ∂ y ∂ a ] = − 1 a − E y [ ( 1 y − 1 1 − y + 1 μ ) ∂ y ∂ a ] = − 1 a − E y [ x ( 1 − 2 y + 1 μ y ( 1 − y ) ) ] \begin{aligned} \frac{\partial}{\partial a} DL(f_y || f_{exp}) &= \frac{\partial}{\partial a}E_y\left[\log(f_x) - \log\left(\frac{\partial y}{\partial x}\right)-\log(f_{exp})\right] \\\\ &=E_y\left[0-\frac{\partial}{\partial a}(\log a + \log y + \log(1-y) ) - \frac{\partial}{\partial a}(-\log \mu - \frac{y}{\mu}) \right] \\\\ &= -\frac{1}{a} - E_y \left[\frac{1}{y} \frac{\partial y}{ \partial a} - \frac{1}{1-y} \frac{\partial y}{ \partial a} + \frac{1}{\mu} \frac{\partial y}{ \partial a}\right] \\\\ &= -\frac{1}{a} - E_y \left[\left(\frac{1}{y} - \frac{1}{1-y} + \frac{1}{\mu} \right)\frac{\partial y}{ \partial a}\right] \\\\ &= -\frac{1}{a} - E_y \left[x\left(1-2y+\frac{1}{\mu}y(1-y)\right)\right] \end{aligned} ∂a∂DL(fy∣∣fexp)=∂a∂Ey[log(fx)−log(∂x∂y)−log(fexp)]=Ey[0−∂a∂(loga+logy+log(1−y))−∂a∂(−logμ−μy)]=−a1−Ey[y1∂a∂y−1−y1∂a∂y+μ1∂a∂y]=−a1−Ey[(y1−1−y1+μ1)∂a∂y]=−a1−Ey[x(1−2y+μ1y(1−y))] - 参数
b
b
b 的梯度
∂ ∂ b D L ( f y ∣ ∣ f e x p ) = ∂ ∂ b E y [ log ( f x ) − log ( ∂ y ∂ x ) − log ( f e x p ) ] = E y [ 0 − ∂ ∂ b ( log a + log y + log ( 1 − y ) ) − ∂ ∂ b ( − log μ − y μ ) ] = − E y [ 1 y ∂ y ∂ b − 1 1 − y ∂ y ∂ b + 1 μ ∂ y ∂ b ] = − E y [ ( 1 y − 1 1 − y + 1 μ ) ∂ y ∂ b ] = − E y [ 1 − 2 y + 1 μ y ( 1 − y ) ] \begin{aligned} \frac{\partial}{\partial b} DL(f_y || f_{exp}) &= \frac{\partial}{\partial b}E_y\left[\log(f_x) - \log\left(\frac{\partial y}{\partial x}\right)-\log(f_{exp})\right] \\\\ &=E_y\left[0-\frac{\partial}{\partial b}(\log a + \log y + \log(1-y) ) - \frac{\partial}{\partial b}(-\log \mu - \frac{y}{\mu}) \right] \\\\ &= - E_y \left[\frac{1}{y} \frac{\partial y}{ \partial b} - \frac{1}{1-y} \frac{\partial y}{ \partial b} + \frac{1}{\mu} \frac{\partial y}{ \partial b}\right] \\\\ &= - E_y \left[\left(\frac{1}{y} - \frac{1}{1-y} + \frac{1}{\mu} \right)\frac{\partial y}{ \partial b}\right] \\\\ &= - E_y \left[1-2y+\frac{1}{\mu}y(1-y)\right] \end{aligned} ∂b∂DL(fy∣∣fexp)=∂b∂Ey[log(fx)−log(∂x∂y)−log(fexp)]=Ey[0−∂b∂(loga+logy+log(1−y))−∂b∂(−logμ−μy)]=−Ey[y1∂b∂y−1−y1∂b∂y+μ1∂b∂y]=−Ey[(y1−1−y1+μ1)∂b∂y]=−Ey[1−2y+μ1y(1−y)]
随机梯度下降
上边计算梯度都是基于 y 的期望,实际实现的时候是采用随机梯度下降算法
a
=
a
+
Δ
a
a = a + \Delta a
a=a+Δa
b
=
b
+
Δ
b
b = b + \Delta b
b=b+Δb
其中
Δ
b
=
η
[
1
−
(
2
+
1
μ
)
y
−
y
2
μ
]
Δ
a
=
η
a
+
x
η
[
1
−
(
2
+
1
μ
)
y
−
y
2
μ
]
=
η
a
+
x
Δ
b
\begin{aligned} \Delta b &= \eta \left[1-(2+\frac{1}{\mu})y-\frac{y^2}{\mu}\right] \\\\ \Delta a &= \frac{\eta}{a} + x\eta \left[1-(2+\frac{1}{\mu})y-\frac{y^2}{\mu}\right] \\\\ &= \frac{\eta}{a} + x\Delta b \end{aligned}
ΔbΔa=η[1−(2+μ1)y−μy2]=aη+xη[1−(2+μ1)y−μy2]=aη+xΔb
参考文献
- A Gradient Rule for the Plasticity of a Neuron’s Intrinsic Excitability
- Improving reservoirs using intrinsic plasticity
实验
IP 的价值有待商榷,代码:https://goodgoodstudy.blog.csdn.net/article/details/109226320
如下展示的是使用 IP 前后储备池状态的时空分布,纵坐标代表 100 个神经元,横坐标表示时间:
- 使用 IP 前
- 使用 IP 后
IP 规则使得每个神经的输出分布相近