学习笔记|Pytorch使用教程13
本学习笔记主要摘自“深度之眼”,做一个总结,方便查阅。
使用Pytorch版本为1.2
- 梯度消失与爆炸
- Xavier方法与Kaiming方法
- 常用初始化方法
一.梯度消失与爆炸
梯度推导公式:
H
2
=
H
1
∗
W
2
Δ
W
2
=
∂
L
0
s
s
∂
W
2
=
∂
L
0
s
s
∂
o
u
t
∗
∂
o
u
t
∂
H
2
∗
∂
H
2
∂
w
2
=
∂
L
0
s
s
∂
o
u
t
∗
∂
o
u
t
∂
H
2
∗
H
1
\begin{aligned} \mathrm{H}_{2}=& \mathrm{H}_{1} * \mathrm{W}_{2} \\ \Delta \mathrm{W}_{2} &=\frac{\partial \mathrm{L}_{0} \mathrm{s} s}{\partial \mathrm{W}_{2}}=\frac{\partial \mathrm{L}_{0} \mathrm{s} s}{\partial \mathrm{out}} * \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} * \frac{\partial \mathrm{H}_{2}}{\partial \mathrm{w}_{2}} \\ &=\frac{\partial \mathrm{L}_{0} \mathrm{ss}}{\partial \mathrm{out}} * \frac{\partial \mathrm{out}}{\partial \mathrm{H}_{2}} * \mathrm{H}_{1} \end{aligned}
H2=ΔW2H1∗W2=∂W2∂L0ss=∂out∂L0ss∗∂H2∂out∗∂w2∂H2=∂out∂L0ss∗∂H2∂out∗H1
梯度消失:
H
1
→
0
⇒
Δ
W
2
→
0
\mathrm{H}_{1} \rightarrow 0 \Rightarrow \Delta \mathrm{W}_{2} \rightarrow 0
H1→0⇒ΔW2→0
梯度爆炸:
H
1
→
∞
⇒
Δ
W
2
→
∞
\mathrm{H}_{1} \rightarrow \infty \Rightarrow \Delta W_{2} \rightarrow \infty
H1→∞⇒ΔW2→∞
测试代码:
import os
import torch
import random
import numpy as np
import torch.nn as nn
from tools.common_tools import set_seed
set_seed(1) # 设置随机种子
class MLP(nn.Module):
def __init__(self, neural_num, layers):
super(MLP, self).__init__()
self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
self.neural_num = neural_num
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
"""
#x = torch.relu(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
"""
return x
def initialize(self):
for m in self.modules():
if isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data)
# nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num)) # normal: mean=0, std=1
# a = np.sqrt(6 / (self.neural_num + self.neural_num))
#
# tanh_gain = nn.init.calculate_gain('tanh')
# a *= tanh_gain
#
# nn.init.uniform_(m.weight.data, -a, a)
# nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)
# nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))
# nn.init.kaiming_normal_(m.weight.data)
# flag = 0
flag = 1
if flag:
layer_nums = 100
neural_nums = 256
batch_size = 16
net = MLP(neural_nums, layer_nums)
net.initialize()
inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
output = net(inputs)
print(output)
输出:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], grad_fn=<MmBackward>)
可以发现这个网络的输出的值很大,出现了nan的情况。
查看在那一层出现了梯度爆炸,设置:
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
#x = torch.relu(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
""""""
return x
输出:
layer:0, std:15.959932327270508
layer:1, std:256.6237487792969
layer:2, std:4107.24560546875
layer:3, std:65576.8125
layer:4, std:1045011.875
layer:5, std:17110408.0
layer:6, std:275461440.0
layer:7, std:4402537984.0
layer:8, std:71323615232.0
layer:9, std:1148104736768.0
layer:10, std:17911758454784.0
layer:11, std:283574813065216.0
layer:12, std:4480599540629504.0
layer:13, std:7.196813845908685e+16
layer:14, std:1.1507761512626258e+18
layer:15, std:1.8531105202862293e+19
layer:16, std:2.9677722308204246e+20
layer:17, std:4.780375660819944e+21
layer:18, std:7.61322258007914e+22
layer:19, std:1.2092650667673597e+24
layer:20, std:1.923256845372055e+25
layer:21, std:3.134466694721031e+26
layer:22, std:5.014437175989598e+27
layer:23, std:8.066614199776408e+28
layer:24, std:1.2392660797937701e+30
layer:25, std:1.9455685681908206e+31
layer:26, std:3.02381787247178e+32
layer:27, std:4.950357261592001e+33
layer:28, std:8.150924034825315e+34
layer:29, std:1.3229830735592165e+36
layer:30, std:2.0786816651036685e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[ inf, -2.6817e+38, inf, ..., inf,
inf, inf],
[ -inf, -inf, 1.4387e+38, ..., -1.3409e+38,
-1.9660e+38, -inf],
[-1.5873e+37, inf, -inf, ..., inf,
-inf, 1.1484e+38],
...,
[ 2.7754e+38, -1.6783e+38, -1.5531e+38, ..., inf,
-9.9440e+37, -2.5132e+38],
[-7.7183e+37, -inf, inf, ..., -2.6505e+38,
inf, inf],
[ inf, inf, -inf, ..., -inf,
inf, 1.7432e+38]], grad_fn=<MmBackward>)
发现在31层的时候就出现了梯度爆炸。
相关公式推导:
期望:
E
(
X
)
,
E
(
Y
)
E(X),E(Y)
E(X),E(Y)
方差:
D
(
x
)
=
E
{
∑
[
X
−
E
(
X
)
]
2
}
D(x)=E\left\{\sum[X-E(X)]^{2}\right\}
D(x)=E{∑[X−E(X)]2}
则:
- E ( X ∗ Y ) = E ( X ) ∗ E ( Y ) \mathrm{E}(X * Y)=E(X) * E(Y) E(X∗Y)=E(X)∗E(Y)
- D ( X ) = E ( X 2 ) − [ E ( X ) ] 2 D(X)=E\left(X^{2}\right)-[E(X)]^{2} D(X)=E(X2)−[E(X)]2
- D ( X + Y ) = D ( X ) + D ( Y ) D(X+Y)=D(X)+D(Y) D(X+Y)=D(X)+D(Y)
- 1.2.3 ⇒ D ( X ∗ Y ) = D ( X ) ∗ D ( Y ) + D ( X ) ∗ [ E ( Y ) ] 2 + D ( Y ) ⋆ [ E ( X ) ] 2 1.2.3 \Rightarrow D(X * Y)=D(X) * D(Y)+D(X) *[E(Y)]^{2}+D(Y) \star[E(X)]^{2} 1.2.3⇒D(X∗Y)=D(X)∗D(Y)+D(X)∗[E(Y)]2+D(Y)⋆[E(X)]2
若
E
(
X
)
=
0
,
E
(
Y
)
=
0
E(X)=0,E(Y)=0
E(X)=0,E(Y)=0
则
D
(
X
∗
Y
)
=
D
(
X
)
∗
D
(
Y
)
D(X * Y)=D(X) * D(Y)
D(X∗Y)=D(X)∗D(Y)
以图中
H
11
H_{11}
H11节点为例:
由本次输入:inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1
可知,期望(平均值)为0,方差为1,即满足条件:
E
(
X
)
=
0
,
E
(
Y
)
=
0
E(X)=0,E(Y)=0
E(X)=0,E(Y)=0,可得
D
(
X
∗
Y
)
=
D
(
X
)
∗
D
(
Y
)
D(X * Y)=D(X) * D(Y)
D(X∗Y)=D(X)∗D(Y)。
推导过程如下:
H
11
=
∑
i
=
0
n
X
i
∗
W
1
i
D
(
H
11
)
=
∑
i
=
0
n
D
(
X
i
)
∗
D
(
W
1
i
)
=
n
∗
(
1
∗
1
)
=
n
std
(
H
11
)
=
D
(
H
11
)
=
n
\begin{aligned} \mathrm{H}_{11}=& \sum_{i=0}^{n} X_{i} * W_{1 i} \quad \ \\ \mathrm{D}\left(\mathrm{H}_{11}\right) &=\sum_{i=0}^{n} D\left(X_{i}\right) * D\left(W_{1 i}\right) \\ &=n *(1 * 1) \\ &=n \\ \operatorname{std}\left(\mathrm{H}_{11}\right) &=\sqrt{\mathrm{D}\left(\mathrm{H}_{11}\right)}=\sqrt{\boldsymbol{n}} \end{aligned}
H11=D(H11)std(H11)i=0∑nXi∗W1i =i=0∑nD(Xi)∗D(W1i)=n∗(1∗1)=n=D(H11)=n
结论:
每经过一次前向传播,方差扩大
n
n
n倍,标准差扩大
n
\sqrt{n}
n倍。
验证:查看刚刚的输出
layer:0, std:15.959932327270508
layer:1, std:256.6237487792969
layer:2, std:4107.24560546875
layer2 的std 比 layer1的std 扩大了 16(
256
\sqrt{256}
256)倍
为了让网络层尺度不变,让方差一直等于1(只有1可以保证任意个数相乘还是为1),即满足条件:
D
(
H
1
)
=
n
∗
D
(
X
)
∗
D
(
W
)
=
1
\mathbf{D}\left(\mathrm{H}_{1}\right)=\boldsymbol{n} * \boldsymbol{D}(\boldsymbol{X}) * \boldsymbol{D}(\boldsymbol{W})=\mathbf{1}
D(H1)=n∗D(X)∗D(W)=1
可以推导出:
D
(
W
)
=
1
n
⇒
std
(
W
)
=
1
n
\boldsymbol{D}(\boldsymbol{W})=\frac{1}{n} \Rightarrow \operatorname{std}(W)=\sqrt{\frac{1}{n}}
D(W)=n1⇒std(W)=n1
即,初始化权重
W
W
W的标准差为
1
/
n
\sqrt{1/n}
1/n。
设置:nn.init.normal_(m.weight.data, std=np.sqrt(1/self.neural_num)) # normal: mean=0, std=sqrt(1/n)
输出:
layer:0, std:0.9974957704544067
layer:1, std:1.0024365186691284
layer:2, std:1.002745509147644
layer:3, std:1.0006227493286133
layer:4, std:0.9966009855270386
layer:5, std:1.019859790802002
layer:6, std:1.0261738300323486
layer:7, std:1.0250457525253296
layer:8, std:1.0378952026367188
layer:9, std:1.0441951751708984
layer:10, std:1.0181655883789062
......
layer:94, std:1.031973123550415
layer:95, std:1.0413124561309814
layer:96, std:1.0817031860351562
layer:97, std:1.1287994384765625
layer:98, std:1.1617799997329712
layer:99, std:1.2215300798416138
tensor([[-1.0696, -1.1373, 0.5047, ..., -0.4766, 1.5904, -0.1076],
[ 0.4572, 1.6211, 1.9660, ..., -0.3558, -1.1235, 0.0979],
[ 0.3909, -0.9998, -0.8680, ..., -2.4161, 0.5035, 0.2814],
...,
[ 0.1876, 0.7971, -0.5918, ..., 0.5395, -0.8932, 0.1211],
[-0.0102, -1.5027, -2.6860, ..., 0.6954, -0.1858, -0.8027],
[-0.5871, -1.3739, -2.9027, ..., 1.6734, 0.5094, -0.9986]],
grad_fn=<MmBackward>)
发现输出的值在合适的范围之内,并且标准差都在1左右,说明公式推导正确。
- 下面加入激活函数
设置:
def forward(self, x):
for (i, linear) in enumerate(self.linears):
x = linear(x)
x = torch.tanh(x)
print("layer:{}, std:{}".format(i, x.std()))
if torch.isnan(x.std()):
print("output is nan in {} layers".format(i))
break
""""""
return x
输出:
layer:0, std:0.6273701786994934
layer:1, std:0.48910173773765564
layer:2, std:0.4099564850330353
layer:3, std:0.35637012124061584
layer:4, std:0.32117360830307007
layer:5, std:0.2981105148792267
layer:6, std:0.27730831503868103
......
layer:94, std:0.07276967912912369
layer:95, std:0.07259567826986313
layer:96, std:0.07586522400379181
layer:97, std:0.07769151031970978
layer:98, std:0.07842091470956802
layer:99, std:0.08206240087747574
tensor([[-0.1103, -0.0739, 0.1278, ..., -0.0508, 0.1544, -0.0107],
[ 0.0807, 0.1208, 0.0030, ..., -0.0385, -0.1887, -0.0294],
[ 0.0321, -0.0833, -0.1482, ..., -0.1133, 0.0206, 0.0155],
...,
[ 0.0108, 0.0560, -0.1099, ..., 0.0459, -0.0961, -0.0124],
[ 0.0398, -0.0874, -0.2312, ..., 0.0294, -0.0562, -0.0556],
[-0.0234, -0.0297, -0.1155, ..., 0.1143, 0.0083, -0.0675]],
grad_fn=<TanhBackward>)
发现标准差,即数据越来越小,甚至到梯度消失的情况。
二.Xavier方法与Kaiming方法
1.Xavier初始化
- 方差一致性:保持数据尺度维持在恰当范围,通常方差为1
- 激活函数:饱和函数,如Sigmoid, Tanh
- 参考文献:《Understanding the difficulty of training deep feedforward neural networks》
考虑前向传播和反向传播,并结合方差一致性准则,得到等式:
n
i
∗
D
(
W
)
=
1
n_{i} * D(W)=1
ni∗D(W)=1
n
i
+
1
∗
D
(
W
)
=
1
n_{i+1} * D(W)=1
ni+1∗D(W)=1
n
i
n_{i}
ni是输入神经元个数,
n
i
+
1
n_{i+1}
ni+1是输出神经元个数。
可以得到:
⇒
D
(
W
)
=
2
n
i
+
n
i
+
1
\Rightarrow D(W)=\frac{2}{n_{i}+n_{i+1}}
⇒D(W)=ni+ni+12
Xavier通常采用均匀分布:
W
∼
U
[
−
a
,
a
]
W \sim U[-a, a]
W∼U[−a,a]
D
(
W
)
=
(
−
a
−
a
)
2
12
=
(
2
a
)
2
12
=
a
2
3
D(W)=\frac{(-a-a)^{2}}{12}=\frac{(2 a)^{2}}{12}=\frac{a^{2}}{3}
D(W)=12(−a−a)2=12(2a)2=3a2
把这里的
D
(
W
)
D(W)
D(W)和上面的
D
(
W
)
D(W)
D(W)相等,求出分布上限与下限:
2
n
i
+
n
i
+
1
=
a
2
3
⇒
a
=
6
n
i
+
n
i
+
1
\frac{2}{n_{i}+n_{i+1}}=\frac{a^{2}}{3} \Rightarrow a=\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}
ni+ni+12=3a2⇒a=ni+ni+16
⇒
W
∼
U
[
−
6
n
i
+
n
i
+
1
,
6
n
i
+
n
i
+
1
]
\Rightarrow W \sim U\left[-\frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}, \frac{\sqrt{6}}{\sqrt{n_{i}+n_{i+1}}}\right]
⇒W∼U[−ni+ni+16,ni+ni+16]
设置:
a = np.sqrt(6 / (self.neural_num + self.neural_num))
tanh_gain = nn.init.calculate_gain('tanh')
a *= tanh_gain
nn.init.uniform_(m.weight.data, -a, a)
输出:
layer:0, std:0.7571136355400085
layer:1, std:0.6924336552619934
layer:2, std:0.6677976846694946
layer:3, std:0.6551960110664368
layer:4, std:0.655646800994873
layer:5, std:0.6536089777946472
layer:6, std:0.6500504612922668
......
layer:95, std:0.6516367793083191
layer:96, std:0.643530011177063
layer:97, std:0.6426344513893127
layer:98, std:0.6408163905143738
layer:99, std:0.6442267298698425
tensor([[ 0.1155, 0.1244, 0.8218, ..., 0.9404, -0.6429, 0.5177],
[-0.9576, -0.2224, 0.8576, ..., -0.2517, 0.9361, 0.0118],
[ 0.9484, -0.2239, 0.8746, ..., -0.9592, 0.7936, 0.6285],
...,
[ 0.7192, 0.0835, -0.4407, ..., -0.9590, 0.2557, 0.5419],
[-0.9546, 0.5104, -0.8002, ..., -0.4366, -0.6098, 0.9672],
[ 0.6085, 0.3967, 0.1099, ..., 0.3905, -0.5264, 0.0729]],
grad_fn=<TanhBackward>)
发现方差在0.65左右,不大不小,比较适中。
使用pytorch封装好的方法:tanh_gain = nn.init.calculate_gain('tanh') nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)
输出:
layer:0, std:0.7571136355400085
layer:1, std:0.6924336552619934
layer:2, std:0.6677976846694946
layer:3, std:0.6551960110664368
layer:4, std:0.655646800994873
layer:5, std:0.6536089777946472
layer:6, std:0.6500504612922668
......
layer:95, std:0.6516367793083191
layer:96, std:0.643530011177063
layer:97, std:0.6426344513893127
layer:98, std:0.6408163905143738
layer:99, std:0.6442267298698425
tensor([[ 0.1155, 0.1244, 0.8218, ..., 0.9404, -0.6429, 0.5177],
[-0.9576, -0.2224, 0.8576, ..., -0.2517, 0.9361, 0.0118],
[ 0.9484, -0.2239, 0.8746, ..., -0.9592, 0.7936, 0.6285],
...,
[ 0.7192, 0.0835, -0.4407, ..., -0.9590, 0.2557, 0.5419],
[-0.9546, 0.5104, -0.8002, ..., -0.4366, -0.6098, 0.9672],
[ 0.6085, 0.3967, 0.1099, ..., 0.3905, -0.5264, 0.0729]],
grad_fn=<TanhBackward>)
和手动计算的结果一致。注意:本初始化方法注意针对"饱和函数"的激活方法。如果采用非饱和函数(在前向传播的时候),会出现数据剧增,如在前向传播设置:x = torch.relu(x)
输出:
layer:0, std:0.9689465165138245
layer:1, std:1.0872339010238647
layer:2, std:1.2967970371246338
......
layer:95, std:3661650.25
layer:96, std:4741351.5
layer:97, std:5300344.0
layer:98, std:6797731.0
layer:99, std:7640649.5
tensor([[ 0.0000, 3028669.0000, 12379584.0000, ...,
3593904.7500, 0.0000, 24658918.0000],
[ 0.0000, 2758812.2500, 11016996.0000, ...,
2970391.2500, 0.0000, 23173852.0000],
[ 0.0000, 2909405.2500, 13117483.0000, ...,
3867146.2500, 0.0000, 28463464.0000],
...,
[ 0.0000, 3913313.2500, 15489625.0000, ...,
5777772.0000, 0.0000, 33226552.0000],
[ 0.0000, 3673757.2500, 12739668.0000, ...,
4193462.0000, 0.0000, 26862394.0000],
[ 0.0000, 1913936.2500, 10243701.0000, ...,
4573383.5000, 0.0000, 22720464.0000]],
grad_fn=<ReluBackward0>)
2.Kaiming初始化
- 方差一致性:保持数据尺度维持在恰当范围,通常方差为1
- 激活函数: ReLU及其变种
- 参考文献:《Delving deep into rectifiers: Surpassing human- level performance on ImageNet classification》
推导得到方差:
D
(
W
)
=
2
n
i
\mathrm{D}(W)=\frac{2}{n_{i}}
D(W)=ni2
针对ReLU的变种,负半轴斜率为a,ReLU负半轴斜率为0,因此:
D
(
W
)
=
2
(
1
+
a
2
)
⋅
n
i
\mathrm{D}(W)=\frac{2}{\left(1+\mathrm{a}^{2}\right) \cdot n_{i}}
D(W)=(1+a2)⋅ni2
std
(
W
)
=
2
(
1
+
a
2
)
⋅
n
i
\operatorname{std}(W)=\sqrt{\frac{2}{\left(1+\mathrm{a}^{2}\right) \cdot n_{i}}}
std(W)=(1+a2)⋅ni2
设置:nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))
输出:
layer:0, std:0.826629638671875
layer:1, std:0.878681480884552
layer:2, std:0.9134420156478882
layer:3, std:0.8892467617988586
layer:4, std:0.8344276547431946
layer:5, std:0.87453693151474
......
layer:94, std:0.595414936542511
layer:95, std:0.6624482870101929
layer:96, std:0.6377813220024109
layer:97, std:0.6079217195510864
layer:98, std:0.6579239368438721
layer:99, std:0.6668398976325989
tensor([[0.0000, 1.3437, 0.0000, ..., 0.0000, 0.6444, 1.1867],
[0.0000, 0.9757, 0.0000, ..., 0.0000, 0.4645, 0.8594],
[0.0000, 1.0023, 0.0000, ..., 0.0000, 0.5147, 0.9196],
...,
[0.0000, 1.2873, 0.0000, ..., 0.0000, 0.6454, 1.1411],
[0.0000, 1.3588, 0.0000, ..., 0.0000, 0.6749, 1.2437],
[0.0000, 1.1807, 0.0000, ..., 0.0000, 0.5668, 1.0600]],
grad_fn=<ReluBackward0>)
数据不大不小,适中。
使用pytorch封装好的方法:nn.init.kaiming_normal_(m.weight.data)
输出:
layer:0, std:0.826629638671875
layer:1, std:0.878681480884552
layer:2, std:0.9134420156478882
layer:3, std:0.8892467617988586
layer:4, std:0.8344276547431946
layer:5, std:0.87453693151474
.......
layer:94, std:0.595414936542511
layer:95, std:0.6624482870101929
layer:96, std:0.6377813220024109
layer:97, std:0.6079217195510864
layer:98, std:0.6579239368438721
layer:99, std:0.6668398976325989
tensor([[0.0000, 1.3437, 0.0000, ..., 0.0000, 0.6444, 1.1867],
[0.0000, 0.9757, 0.0000, ..., 0.0000, 0.4645, 0.8594],
[0.0000, 1.0023, 0.0000, ..., 0.0000, 0.5147, 0.9196],
...,
[0.0000, 1.2873, 0.0000, ..., 0.0000, 0.6454, 1.1411],
[0.0000, 1.3588, 0.0000, ..., 0.0000, 0.6749, 1.2437],
[0.0000, 1.1807, 0.0000, ..., 0.0000, 0.5668, 1.0600]],
grad_fn=<ReluBackward0>)
和手动初始化结果一致。
三.常用初始化方法
- Xavier均匀分布
- Xavier正态分布
- Kaiming均匀分布
- Kaiming均匀分布
- 均匀分布
- 正态分布
- 常数分布
- 正交矩阵初始化
- 单位矩阵初始化
- 稀疏矩阵初始化
1.nn.init.calculate_gain
主要功能:计算激活函数的方差变化尺度
主要参数:
- nonlinearity:激活函数名称
- param:激活函数的参数,如Leaky ReLU的negative_ slop
测试代码
# ======================================= calculate gain =======================================
# flag = 0
flag = 1
if flag:
x = torch.randn(10000)
out = torch.tanh(x)
gain = x.std() / out.std()
print('gain:{}'.format(gain))
tanh_gain = nn.init.calculate_gain('tanh')
print('tanh_gain in PyTorch:', tanh_gain)
输出:
gain:1.5982500314712524
tanh_gain in PyTorch: 1.6666666666666667