数学:
https://www.cnblogs.com/pinard/p/6422831.html
https://www.cnblogs.com/pinard/p/6494810.html
https://www.cnblogs.com/pinard/p/10750718.html
https://www.cnblogs.com/pinard/p/10773942.html
https://www.cnblogs.com/pinard/p/6494810.html
https://blog.csdn.net/qq_37951753/article/details/79672615
https://blog.csdn.net/evanxxxnnn/article/details/83552318
https://zhuanlan.zhihu.com/p/45310446
https://blog.csdn.net/qq_36342854/article/details/103863741
https://mp.weixin.qq.com/s/2xYgaeLlmmUfxiHCbCa8dQ
c++:
https://www.cnblogs.com/xuefeng00/p/11093425.html
http://www.cplusplus.com/reference/random/normal_distribution/
https://people.sc.fsu.edu/~jburkardt/cpp_src/truncated_normal/truncated_normal.html
https://www.cnblogs.com/jingshikongming/p/9037881.html
https://www.zhihu.com/question/63507542
https://blog.csdn.net/yahamatarge/article/details/89380164
https://www.cnblogs.com/jhmu0613/p/7750798.html
https://blog.csdn.net/qq_25175067/article/details/80266003
https://blog.csdn.net/lmb1612977696/article/details/80035487
http://blog.chinaunix.net/uid-20773165-id-1847733.html
https://baijiahao.baidu.com/s?id=1651645857687261494&wfr=spider&for=pc
https://blog.csdn.net/weixin_34007291/article/details/93528095
理论
多层感知机前向传播及导数
z l = w l a l − 1 + b l a l = σ ( z l ) m l × n = σ ( m l × m l − 1 × m l − 1 × n + m l × 1 ) ( m l , ) = σ ( ( m l , m l − 1 ) × ( m l − 1 , ) + ( m l , ) ) z^l={w^{l}}a^{l-1}+b^l \\ a^l=\sigma(z^l)\\ m_l\times n=\sigma(m_{l}\times m_{l-1}\times m_{l-1}\times n + m_l\times 1) \\ (m_l,)=\sigma((m_l, m_{l-1})\times(m_{l-1},)+(m_l ,)) zl=wlal−1+blal=σ(zl)ml×n=σ(ml×ml−1×ml−1×n+ml×1)(ml,)=σ((ml,ml−1)×(ml−1,)+(ml,))
多层感知机反向传播
损失函数:
J
(
y
,
y
^
)
J(y, \hat{y})
J(y,y^)
导数传播
∂
J
(
y
,
y
^
)
∂
w
l
=
∂
J
(
y
,
y
^
)
∂
z
l
∂
z
l
∂
w
l
∂
J
(
y
,
y
^
)
∂
z
l
记
为
δ
l
∂
J
(
y
,
y
^
)
∂
w
l
=
δ
l
∂
z
l
∂
w
l
=
δ
l
a
l
−
1
T
∂
J
(
y
,
y
^
)
∂
z
L
=
▽
J
(
y
,
y
^
)
δ
L
−
1
=
∂
J
(
y
,
y
^
)
∂
z
L
−
1
=
∂
J
(
y
,
y
^
)
∂
z
L
∂
z
L
∂
a
L
−
1
∂
a
L
−
1
∂
z
L
−
1
=
w
L
T
δ
L
⊙
σ
′
(
z
L
−
1
)
∂
J
(
y
,
y
^
)
∂
b
l
=
∑
m
δ
m
l
\begin{aligned} \frac{\partial J(y, \hat{y})}{\partial w^l}& =\frac{\partial J(y, \hat{y})}{\partial z^l}\frac{\partial z^l}{\partial w^l} \\ \frac{\partial J(y, \hat{y})}{\partial z^l}&记为\delta^l \\ \frac{\partial J(y, \hat{y})}{\partial w^l}& =\delta^l\frac{\partial z^l}{\partial w^l}=\delta^l{a^{l-1}}^T \\ \frac{\partial J(y, \hat{y})}{\partial z^L} & = \triangledown J(y,\hat{y}) \\ \delta^{L-1}= \frac{\partial J(y, \hat{y})}{\partial z^{L-1}} & = \frac{\partial J(y, \hat{y})}{\partial z^L}\frac{\partial z^L}{\partial a^{L-1}}\frac{\partial a^{L-1}}{\partial z^{L-1}}={w^L}^T\delta^L \odot \sigma^{'}(z^{L-1}) \\ \frac{\partial J(y, \hat{y})}{\partial b^l}&=\sum_m\delta^l_m \end{aligned}
∂wl∂J(y,y^)∂zl∂J(y,y^)∂wl∂J(y,y^)∂zL∂J(y,y^)δL−1=∂zL−1∂J(y,y^)∂bl∂J(y,y^)=∂zl∂J(y,y^)∂wl∂zl记为δl=δl∂wl∂zl=δlal−1T=▽J(y,y^)=∂zL∂J(y,y^)∂aL−1∂zL∂zL−1∂aL−1=wLTδL⊙σ′(zL−1)=m∑δml
所以每层
l
l
l 维护
∂
z
l
∂
w
l
=
a
l
−
1
\frac{\partial z^{l}}{\partial w^l}=a^{l-1}
∂wl∂zl=al−1
每层
l
l
l 反向传播时返回
∂
J
(
y
,
y
^
)
∂
z
l
\frac{\partial J(y, \hat{y})}{\partial z^{l}}
∂zl∂J(y,y^)
CNN前向传播
卷 积 : z l = a l − 1 ∗ W l + b l a l = σ ( z l ) 池 化 : a l = p o o l i n g ( a l − 1 ) 卷积: \\z^l=a^{l-1}*W^l+b^l \\ a^l = \sigma(z^l) \\ 池化: \\ a^l = pooling(a^{l-1}) 卷积:zl=al−1∗Wl+blal=σ(zl)池化:al=pooling(al−1)
CNN反向传播
卷 积 : ∂ J ( y , y ^ ) ∂ z l = ( ∂ z l + 1 ∂ z l ) T ∂ J ( y , y ^ ) ∂ z l + 1 ∂ J ( y , y ^ ) ∂ z l = δ l z l + 1 = a l ∗ W l + 1 + b l + 1 = σ ( z l ) ∗ W l + 1 + b l + 1 ∂ z l + 1 ∂ z l = r o t ( W l + 1 ) ⊙ σ ′ ( z l ) ∂ J ( y , y ^ ) ∂ z l = δ l = δ l + 1 ∗ r o t ( W l + 1 ) ⊙ σ ′ ( z l ) ∂ J ( y , y ^ ) ∂ w l = a l − 1 ∗ δ l ∂ J ( y , y ^ ) ∂ b l = ∑ i , j δ i , j l 池 化 : 将 ∂ J ( y , y ^ ) ∂ a l − 1 按 照 池 化 的 权 重 回 填 成 输 入 的 矩 阵 ∂ J ( y , y ^ ) ∂ a l \begin{aligned} 卷积: \\ \frac{\partial J(y, \hat{y})}{\partial z^{l}} &=(\frac{\partial z^{l+1}}{\partial z^l} )^T\frac{\partial J(y, \hat{y})}{\partial z^{l+1}} \\ \frac{\partial J(y, \hat{y})}{\partial z^l} &=\delta^l \\z^{l+1}&=a^{l}*W^{l+1}+b^{l+1}=\sigma(z^{l})*W^{l+1}+b^{l+1} \\\frac{\partial z^{l+1}}{\partial z^l}&=rot(W^{l+1}) \odot \sigma^{'}(z^{l}) \\\frac{\partial J(y, \hat{y})}{\partial z^{l}} &= \delta^l=\delta^{l+1}*rot(W^{l+1})\odot \sigma^{'}(z^{l}) \\ \frac{\partial J(y, \hat{y})}{\partial w^{l}} &= a^{l-1}*\delta^l \\ \frac{\partial J(y, \hat{y})}{\partial b^{l}} &=\sum_{i,j} \delta^l_{i,j} \\ 池化: \\ 将\frac{\partial J(y, \hat{y})}{\partial a^{l-1}}&按照池化的权重回填成输入的矩阵\frac{\partial J(y, \hat{y})}{\partial a^l} \end{aligned} 卷积:∂zl∂J(y,y^)∂zl∂J(y,y^)zl+1∂zl∂zl+1∂zl∂J(y,y^)∂wl∂J(y,y^)∂bl∂J(y,y^)池化:将∂al−1∂J(y,y^)=(∂zl∂zl+1)T∂zl+1∂J(y,y^)=δl=al∗Wl+1+bl+1=σ(zl)∗Wl+1+bl+1=rot(Wl+1)⊙σ′(zl)=δl=δl+1∗rot(Wl+1)⊙σ′(zl)=al−1∗δl=i,j∑δi,jl按照池化的权重回填成输入的矩阵∂al∂J(y,y^)
每层
l
l
l 维护:
雅
克
比
矩
阵
:
∂
J
(
y
,
y
^
)
∂
a
l
−
1
=
(
∂
z
l
∂
a
l
−
1
)
T
∂
J
(
y
,
y
^
)
∂
z
l
=
δ
l
∗
r
o
t
(
W
l
)
雅克比矩阵:\frac{\partial J(y,\hat{y})}{\partial a^{l-1}}=(\frac{\partial z^{l}}{\partial a^{l-1}} )^T\frac{\partial J(y, \hat{y})}{\partial z^{l}} = \delta^{l}*rot(W^{l})\\
雅克比矩阵:∂al−1∂J(y,y^)=(∂al−1∂zl)T∂zl∂J(y,y^)=δl∗rot(Wl)
BatchNormal 前向传播
BatchNormal 反向传播
∂ J ∂ β = ∑ ∂ J ∂ y i ∂ J ∂ γ = ∑ ∂ J ∂ y i x i ^ ∂ C ∂ x i = ∑ k m ∂ C ∂ x ^ k ∂ x ^ k ∂ x i = ∂ C ∂ x ^ i 1 σ 2 + ϵ + ∑ k m ∂ C ∂ x ^ k ∂ x ^ k ∂ σ 2 ∂ σ 2 ∂ x i + ∑ k m ∂ C ∂ x ^ k ∂ x ^ k ∂ μ ∂ μ ∂ x i = ∂ C ∂ x ^ i 1 σ 2 + ϵ + ∂ σ 2 ∂ x i ⋅ ∑ k m ∂ C ∂ x ^ k ∂ x ^ k ∂ σ 2 + ∂ μ ∂ x i ⋅ ∑ k m ∂ C ∂ x ^ k ∂ x ^ k ∂ μ ∑ i m ∂ C ∂ x ^ i ∂ x ^ i ∂ σ 2 = ∑ i m ∂ C ∂ x ^ i [ − 1 2 x i − μ ( σ 2 + ϵ ) 3 ] = − 1 2 1 ( σ 2 + ϵ ) 3 ∑ i m ∂ C ∂ x ^ i ( x i − μ ) ∂ σ 2 ∂ x i = 2 m ( x i − μ ) ∑ i m ∂ C ∂ x ^ i ∂ x ^ i ∂ μ = − 1 σ 2 + ϵ ⋅ ∑ i m ∂ C ∂ x ^ i ∂ C ∂ x i = ∂ C ∂ x ^ i 1 σ 2 + ϵ + [ − 1 2 1 ( σ 2 + ϵ ) 3 ∑ i m ∂ C ∂ x ^ i ( x i − μ ) ] 2 m ( x i − μ ) + − 1 σ 2 + ϵ ⋅ ∑ i m ∂ C ∂ x ^ i 1 m = ∂ C ∂ x ^ i 1 σ 2 + ϵ − 1 m x i − μ ( σ 2 + ϵ ) 3 ∑ i m ∂ C ∂ x ^ i ( x i − μ ) − 1 m 1 σ 2 + ϵ ⋅ ∑ i m ∂ C ∂ x ^ i = 1 m 1 σ 2 + ϵ ⋅ { m ∂ C ∂ x ^ i − x i − μ ( σ 2 + ϵ ) 2 ∑ i m ∂ C ∂ x ^ i ( x i − μ ) − ∑ i m ∂ C ∂ x ^ i } \begin{aligned} \frac{\partial J}{\partial \beta}&=\sum\frac{\partial J}{\partial y_i} \\ \frac{\partial J}{\partial \gamma}&=\sum\frac{\partial J}{\partial y_i} \hat{x_i} \\\frac{\partial C}{\partial x_{i}} &=\sum_{k}^{m} \frac{\partial C}{\partial \hat{x}_{k}} \frac{\partial \hat{x}_{k}}{\partial x_{i}} \\ &=\frac{\partial C}{\partial \hat{x}_{i}} \frac{1}{\sqrt{\sigma^{2}+\epsilon}}+\sum_{k}^{m} \frac{\partial C}{\partial \hat{x}_{k}} \frac{\partial \hat{x}_{k}}{\partial \sigma^{2}} \frac{\partial \sigma^{2}}{\partial x_{i}}+\sum_{k}^{m} \frac{\partial C}{\partial \hat{x}_{k}} \frac{\partial \hat{x}_{k}}{\partial \mu} \frac{\partial \mu}{\partial x_{i}} \\ &=\frac{\partial C}{\partial \hat{x}_{i}} \frac{1}{\sqrt{\sigma^{2}+\epsilon}}+\frac{\partial \sigma^{2}}{\partial x_{i}} \cdot \sum_{k}^{m} \frac{\partial C}{\partial \hat{x}_{k}} \frac{\partial \hat{x}_{k}}{\partial \sigma^{2}}+\frac{\partial \mu}{\partial x_{i}} \cdot \sum_{k}^{m} \frac{\partial C}{\partial \hat{x}_{k}} \frac{\partial \hat{x}_{k}}{\partial \mu} \\ \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}} \frac{\partial \hat{x}_{i}}{\partial \sigma^{2}} &=\sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}}\left[-\frac{1}{2} \frac{x_{i}-\mu}{(\sqrt{\sigma^{2}+\epsilon})^{3}}\right] \\ &=-\frac{1}{2} \frac{1}{(\sqrt{\sigma^{2}+\epsilon})^{3}} \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}}\left(x_{i}-\mu\right) \\ \frac{\partial \sigma^{2}}{\partial x_{i}} &=\frac{2}{m}\left(x_{i}-\mu\right) \\ \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}} \frac{\partial \hat{x}_{i}}{\partial \mu} &=\frac{-1}{\sqrt{\sigma^{2}+\epsilon}} \cdot \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}} \\ \frac{\partial C}{\partial x_{i}} &=\frac{\partial C}{\partial \hat{x}_{i}} \frac{1}{\sqrt{\sigma^{2}+\epsilon}}+\left[-\frac{1}{2} \frac{1}{(\sqrt{\sigma^{2}+\epsilon})^{3}} \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}}\left(x_{i}-\mu\right)\right] \frac{2}{m}\left(x_{i}-\mu\right)\\&+\frac{-1}{\sqrt{\sigma^{2}+\epsilon}} \cdot \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}} \frac{1}{m} \\ &=\frac{\partial C}{\partial \hat{x}_{i}} \frac{1}{\sqrt{\sigma^{2}+\epsilon}}- \frac{1}{m} \frac{x_{i}-\mu}{(\sqrt{\sigma^{2}+\epsilon})^{3}} \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}}\left(x_{i}-\mu\right) \\&-\frac{1}{m} \frac{1}{\sqrt{\sigma^{2}+\epsilon}} \cdot \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}} \\ &=\frac{1}{m} \frac{1}{\sqrt{\sigma^{2}+\epsilon}} \cdot\left\{m \frac{\partial C}{\partial \hat{x}_{i}}-\frac{x_{i}-\mu}{(\sqrt{\sigma^{2}+\epsilon})^{2}} \sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}}\left(x_{i}-\mu\right)-\sum_{i}^{m} \frac{\partial C}{\partial \hat{x}_{i}}\right\} \end{aligned} ∂β∂J∂γ∂J∂xi∂Ci∑m∂x^i∂C∂σ2∂x^i∂xi∂σ2i∑m∂x^i∂C∂μ∂x^i∂xi∂C=∑∂yi∂J=∑∂yi∂Jxi^=k∑m∂x^k∂C∂xi∂x^k=∂x^i∂Cσ2+ϵ1+k∑m∂x^k∂C∂σ2∂x^k∂xi∂σ2+k∑m∂x^k∂C∂μ∂x^k∂xi∂μ=∂x^i∂Cσ2+ϵ1+∂xi∂σ2⋅k∑m∂x^k∂C∂σ2∂x^k+∂xi∂μ⋅k∑m∂x^k∂C∂μ∂x^k=i∑m∂x^i∂C[−21(σ2+ϵ)3xi−μ]=−21(σ2+ϵ)31i∑m∂x^i∂C(xi−μ)=m2(xi−μ)=σ2+ϵ−1⋅i∑m∂x^i∂C=∂x^i∂Cσ2+ϵ1+[−21(σ2+ϵ)31i∑m∂x^i∂C(xi−μ)]m2(xi−μ)+σ2+ϵ−1⋅i∑m∂x^i∂Cm1=∂x^i∂Cσ2+ϵ1−m1(σ2+ϵ)3xi−μi∑m∂x^i∂C(xi−μ)−m1σ2+ϵ1⋅i∑m∂x^i∂C=m1σ2+ϵ1⋅{m∂x^i∂C−(σ2+ϵ)2xi−μi∑m∂x^i∂C(xi−μ)−i∑m∂x^i∂C}
softmax
原始公式:
y
i
=
x
i
∑
j
x
j
y_i=\frac{x_i}{\sum_jx_j}
yi=∑jxjxi
问题:
e
n
e^n
en 当n稍微大点就爆精度了
解决:
log
∑
i
e
x
i
=
a
+
log
∑
i
e
x
i
−
a
\log{\sum_ie^{x_i}}=a+\log{\sum_ie^{x_i-a}}
logi∑exi=a+logi∑exi−a
a
a
a 取
x
i
x_i
xi 中的最大值
初始化
truncated normal
产生截断正态分布随机数,取值范围为 [ mean - 2 * stddev, mean + 2 * stddev ]
只会循环产生随机数,将不满足的去掉
后来翻墙找到了一个库
实践
总目标是跑的快,可能牺牲代码可读性和安全性(瞎几把写)。
类设计
框架
- Tensor 类, 维度,计算,名字,初始化
- Layer类,输入,输出,参数,反向传播
子类包括CNN、全连接、池化、batchnormal、激活和损失函数层 - 网络类,一个计算图DAG,用拓扑排序挨个计算,每个节点都是layer,记录入度,包含损失函数类和train、test接口
细节
1. Tensor
元素
- data[],一维
- shape,vector
- grad,null或者和data一样长
- name,char数组,默认null
方法
- 初始化
- 传入shape和初始化值
- 传入shape,random初始化
- dot
点乘 检查shape 遍历 - mul
矩阵乘法
检查shape - add
加法 检查shape - sub
减法 检查shape - div
除法 检查shape - print
名字(如果有)维度一行
数据一行
梯度(如果有)一行
可能重载流实现 - setName、getName()
- 重载运算符[]、()
取出data和grad - reshape
2. Layer
元素
无
方法
- forward
- backward
子类1 linear
- 神经元个数m,输入的维度
- Tensor类,二维,参数W
- Tensor类,参数bias
- Tensor类,input
子类2 cnn
- m,n,h,w,filter 个数,输入的channel,高宽
- stride
- padding
- W 参数 (m,n,h,w)
- bias 参数 (m,)
子类3 maxpooling
- h,w,stride
- 权重图,用于反向传播
子类4 softmax
子类5 relu
子类6 batchnormal
- momentum (mean = momentum * mean + (1.0 - momentum) * nowbatchmean)
- gamma, beta
- mean,var
- 形状以前一层是CNN还是linear来定
子类7 mse
子类8 cross validation loss
子类9 自定义loss
3. Net
元素
- 图DAG,链式前向星还是vector再说
- 每个节点是一个layer,但是要有输入输出buffer,防止分叉的情况重复计算
方法
- init 建立好DAG,也就是不支持动态图
- train 传入一个batch
- test
- ()重载运算符,前向传播
- save
- load