动手学深度学习5.2 参数管理-笔记&练习（PyTorch）

最新推荐文章于 2024-07-17 16:56:56 发布

scdifsn

最新推荐文章于 2024-07-17 16:56:56 发布

阅读量1k

点赞数 24

分类专栏：《动手学深度学习》笔记及练习详解文章标签：深度学习笔记 pytorch python

本文链接：https://blog.csdn.net/scdifsn/article/details/139798531

版权

《动手学深度学习》笔记及练习详解专栏收录该内容

34 篇文章 2 订阅

订阅专栏

以下内容为结合李沐老师的课程和教材补充的学习笔记，以及对课后练习的一些思考，自留回顾，也供同学之人交流参考。

本节课程地址：参数管理_哔哩哔哩_bilibili

本节教材地址：5.2. 参数管理 — 动手学深度学习 2.0.0 documentation (d2l.ai)

本节开源代码：...>d2l-zh>pytorch>chapter_multilayer-perceptrons>parameters.ipynb

参数管理

在选择了架构并设置了超参数后，我们就进入了训练阶段。此时，我们的目标是找到使损失函数最小化的模型参数值。经过训练后，我们将需要使用这些参数来做出未来的预测。此外，有时我们希望提取参数，以便在其他环境中复用它们，将模型保存下来，以便它可以在其他软件中执行，或者为了获得科学的理解而进行检查。

之前的介绍中，我们只依靠深度学习框架来完成训练的工作，而忽略了操作参数的具体细节。本节，我们将介绍以下内容：

访问参数，用于调试、诊断和可视化；
参数初始化；
在不同模型组件间共享参数。

(我们首先看一下具有单隐藏层的多层感知机。)

import torch
from torch import nn

net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))
X = torch.rand(size=(2, 4))
net(X)

输出结果：

tensor([[0.2780],
[0.3197]], grad_fn=<AddmmBackward0>)

[参数访问]

我们从已有模型中访问参数。当通过Sequential类定义模型时，我们可以通过索引来访问模型的任意层。这就像模型是一个列表一样，每层的参数都在其属性中。如下所示，我们可以检查第二个全连接层的参数。

print(net[2].state_dict())

输出结果：

OrderedDict([('weight', tensor([[-0.0732, 0.0553, -0.3387, -0.2949, -0.1384, 0.2179, -0.3526, -0.1084]])), ('bias', tensor([0.3457]))])

输出的结果告诉我们一些重要的事情：首先，这个全连接层包含两个参数，分别是该层的权重和偏置。两者都存储为单精度浮点数（float32）。注意，参数名称允许唯一标识每个参数，即使在包含数百个层的网络中也是如此。

[目标参数]

注意，每个参数都表示为参数类的一个实例。要对参数执行任何操作，首先我们需要访问底层的数值。有几种方法可以做到这一点。有些比较简单，而另一些则比较通用。下面的代码从第二个全连接层（即第三个神经网络层）提取偏置，提取后返回的是一个参数类实例，并进一步访问该参数的值。

print(type(net[2].bias))
print(net[2].bias)
print(net[2].bias.data)

输出结果：

<class 'torch.nn.parameter.Parameter'>
Parameter containing:
tensor([0.3457], requires_grad=True)
tensor([0.3457])

参数是复合的对象，包含值、梯度和额外信息。这就是我们需要显式参数值的原因。除了值之外，我们还可以访问每个参数的梯度。在上面这个网络中，由于我们还没有调用反向传播，所以参数的梯度处于初始状态。

net[2].weight.grad == None

输出结果：

True

[一次性访问所有参数]

当我们需要对所有参数执行操作时，逐个访问它们可能会很麻烦。当我们处理更复杂的块（例如，嵌套块）时，情况可能会变得特别复杂，因为我们需要递归整个树来提取每个子块的参数。下面，我们将通过演示来比较访问第一个全连接层的参数和访问所有层。

print(*[(name, param.shape) for name, param in net[0].named_parameters()])
print(*[(name, param.shape) for name, param in net.named_parameters()])

输出结果：

('weight', torch.Size([8, 4])) ('bias', torch.Size([8]))
('0.weight', torch.Size([8, 4])) ('0.bias', torch.Size([8])) ('2.weight', torch.Size([1, 8])) ('2.bias', torch.Size([1]))

这为我们提供了另一种访问网络参数的方式，如下所示。

net.state_dict()['2.bias'].data

输出结果：

tensor([0.3457])

[从嵌套块收集参数]

让我们看看，如果我们将多个块相互嵌套，参数命名约定是如何工作的。我们首先定义一个生成块的函数（可以说是“块工厂”），然后将这些块组合到更大的块中。

def block1():
    return nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
                         nn.Linear(8, 4), nn.ReLU())

def block2():
    net = nn.Sequential()
    for i in range(4):
        # 在这里嵌套
        net.add_module(f'block {i}', block1())
    return net

rgnet = nn.Sequential(block2(), nn.Linear(4, 1))
rgnet(X)

输出结果：

tensor([[0.4464],
[0.4464]], grad_fn=<AddmmBackward0>)

[设计了网络后，我们看看它是如何工作的。]

print(rgnet)

输出结果：

Sequential(
(0): Sequential(
(block 0): Sequential(
(0): Linear(in_features=4, out_features=8, bias=True)
(1): ReLU()
(2): Linear(in_features=8, out_features=4, bias=True)
(3): ReLU()
)
(block 1): Sequential(
(0): Linear(in_features=4, out_features=8, bias=True)
(1): ReLU()
(2): Linear(in_features=8, out_features=4, bias=True)
(3): ReLU()
)
(block 2): Sequential(
(0): Linear(in_features=4, out_features=8, bias=True)
(1): ReLU()
(2): Linear(in_features=8, out_features=4, bias=True)
(3): ReLU()
)
(block 3): Sequential(
(0): Linear(in_features=4, out_features=8, bias=True)
(1): ReLU()
(2): Linear(in_features=8, out_features=4, bias=True)
(3): ReLU()
)
)
(1): Linear(in_features=4, out_features=1, bias=True)
)

因为层是分层嵌套的，所以我们也可以像通过嵌套列表索引一样访问它们。下面，我们访问第一个主要的块中、第二个子块的第一层的偏置项。

rgnet[0][1][0].bias.data

输出结果：

tensor([-0.1836, 0.2985, -0.4316, -0.0451, 0.2871, 0.4270, 0.2767, -0.2890])

参数初始化

知道了如何访问参数后，现在我们看看如何正确地初始化参数。我们在 4.8节中讨论了良好初始化的必要性。深度学习框架提供默认随机初始化，也允许我们创建自定义初始化方法，满足我们通过其他规则实现初始化权重。

默认情况下，PyTorch会根据一个范围均匀地初始化权重和偏置矩阵，这个范围是根据输入和输出维度计算出的。 PyTorch的nn.init模块提供了多种预置初始化方法。

[内置初始化]

让我们首先调用内置的初始化器。下面的代码将所有权重参数初始化为标准差为0.01的高斯随机变量，且将偏置参数设置为0。

def init_normal(m):
    if type(m) == nn.Linear:
        nn.init.normal_(m.weight, mean=0, std=0.01)
        nn.init.zeros_(m.bias)

net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

输出结果：

(tensor([-0.0135, 0.0024, 0.0041, 0.0077]), tensor(0.))

我们还可以将所有参数初始化为给定的常数，比如初始化为1。

def init_constant(m):
    if type(m) == nn.Linear:
        nn.init.constant_(m.weight, 1)
        nn.init.zeros_(m.bias)
net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

输出结果：

(tensor([1., 1., 1., 1.]), tensor(0.))

我们还可以[对某些块应用不同的初始化方法]。例如，下面我们使用Xavier初始化方法初始化第一个神经网络层，然后将第三个神经网络层初始化为常量值42。

def init_xavier(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
def init_42(m):
    if type(m) == nn.Linear:
        nn.init.constant_(m.weight, 42)

net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data)
print(net[0].weight.data[0])
print(net[2].weight.data)

输出结果：

tensor([[ 0.2984, 0.6197, 0.2913, 0.3016],
[ 0.3407, -0.5672, 0.2265, -0.2300],
[ 0.3116, 0.3533, 0.3410, 0.1382],
[ 0.6002, -0.1407, 0.1687, 0.0764],
[-0.2376, 0.6743, 0.4148, -0.5530],
[ 0.0631, 0.0897, 0.3714, 0.4179],
[ 0.1419, -0.6549, -0.4194, -0.5257],
[-0.6068, 0.2096, -0.3073, 0.0447]])
tensor([0.2984, 0.6197, 0.2913, 0.3016])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.]])

[自定义初始化]

有时，深度学习框架没有提供我们需要的初始化方法。在下面的例子中，我们使用以下的分布为任意权重参数 $w$ 定义初始化方法：

$\begin{aligned} w \sim \begin{cases} U(5, 10) & \text{ prob. } \frac{1}{4} \\ 0 & \text{ prob. } \frac{1}{2} \\ U(-10, -5) & \text{ prob. } \frac{1}{4} \end{cases} \end{aligned}$

同样，我们实现了一个my_init函数来应用到net。

def my_init(m):
    if type(m) == nn.Linear:
        # m.named_parameters()返回一个包含参数名称和对应参数的元组的迭代器，[0]选择权重打印
        print("Init", *[(name, param.shape)
                        for name, param in m.named_parameters()][0])
        nn.init.uniform_(m.weight, -10, 10)
        # >=5的权重×自身，<5的为0，也是一种正则化
        m.weight.data *= m.weight.data.abs() >= 5

net.apply(my_init)
net[0].weight[:2]

输出结果：

Init weight torch.Size([8, 4])
Init weight torch.Size([1, 8])
tensor([[ 8.5249, -7.3139, -5.3505, 7.4283],
[-0.0000, 0.0000, 0.0000, -0.0000]], grad_fn=<SliceBackward0>)

注意，我们始终可以直接设置参数。

net[0].weight.data[:] += 1
net[0].weight.data[0, 0] = 42
net[0].weight.data[0]

输出结果：

tensor([42.0000, -6.3139, -4.3505, 8.4283])

[参数绑定]

有时我们希望在多个层间共享参数：我们可以定义一个稠密层，然后使用它的参数来设置另一个层的参数。

# 我们需要给共享层一个名称，以便可以引用它的参数
shared = nn.Linear(8, 8)
net = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.Linear(8, 1))
net(X)
# 检查参数是否相同
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# 确保它们实际上是同一个对象，而不只是有相同的值
print(net[2].weight.data[0] == net[4].weight.data[0])

输出结果：

tensor([True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True])

这个例子表明第三个和第五个神经网络层的参数是绑定的。它们不仅值相等，而且由相同的张量表示。因此，如果我们改变其中一个参数，另一个参数也会改变。这里有一个问题：当参数绑定时，梯度会发生什么情况？答案是由于模型参数包含梯度，因此在反向传播期间第二个隐藏层（即第三个神经网络层）和第三个隐藏层（即第五个神经网络层）的梯度会加在一起。

两个shared层共享的参数包括梯度，当有两个shared层时，两个shared层的梯度因绑定而相同，而等于叠加两次后的梯度值。

import torch.optim as optim

loss = nn.MSELoss()
optimizer = optim.Adam(net.parameters())
target = torch.randn(2, 1)

net1 = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.Linear(8, 1))

l1 = loss(net1(X), target)
optimizer.zero_grad()
l1.backward()

print(net[2].weight.grad)

net2 = nn.Sequential(nn.Linear(4, 8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.Linear(8, 1))

l2 = loss(net2(X), target)
optimizer.zero_grad()
l2.backward()

print(net[2].weight.grad)
print(net[4].weight.grad)

输出结果：

tensor([[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0096, 0.0598, 0.0000, 0.0000, 0.0000, 0.0764, 0.0067],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, -0.0012, -0.0086, 0.0000, 0.0000, 0.0000, -0.0064, -0.0067],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0018, 0.0109, 0.0000, 0.0000, 0.0000, 0.0140, 0.0012]])
tensor([[ 1.3191e+04, 9.2218e+03, 4.8035e+03, 9.7819e+03, 6.4717e+03,
1.2992e+04, 0.0000e+00, 7.0550e+01],
[-2.5508e+03, 0.0000e+00, -4.5458e+01, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, -2.7084e+01],
[-2.4279e+03, -1.6238e+01, -5.1312e+01, -1.7224e+01, -1.1396e+01,
-2.2877e+01, 0.0000e+00, -2.5657e+01],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00],
[-5.2026e+03, 0.0000e+00, -9.2717e+01, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, -5.5242e+01],
[-7.7069e+03, -3.9781e+01, -1.5705e+02, -4.2197e+01, -2.7917e+01,
-5.6046e+01, 0.0000e+00, -8.1532e+01]])
tensor([[ 1.3191e+04, 9.2218e+03, 4.8035e+03, 9.7819e+03, 6.4717e+03,
1.2992e+04, 0.0000e+00, 7.0550e+01],
[-2.5508e+03, 0.0000e+00, -4.5458e+01, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, -2.7084e+01],
[-2.4279e+03, -1.6238e+01, -5.1312e+01, -1.7224e+01, -1.1396e+01,
-2.2877e+01, 0.0000e+00, -2.5657e+01],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, 0.0000e+00],
[-5.2026e+03, 0.0000e+00, -9.2717e+01, 0.0000e+00, 0.0000e+00,
0.0000e+00, 0.0000e+00, -5.5242e+01],
[-7.7069e+03, -3.9781e+01, -1.5705e+02, -4.2197e+01, -2.7917e+01,
-5.6046e+01, 0.0000e+00, -8.1532e+01]])

小结

我们有几种方法可以访问、初始化和绑定模型参数。
我们可以使用自定义初始化方法。

练习

使用5 .1节中定义的FancyMLP模型，访问各个层的参数。

解：
FancyMLP模型参考https://www.bookstack.cn/read/Dive-into-DL-PyTorch/a3ef4377d3242797.md
代码如下：

class FancyMLP(nn.Module):
    def __init__(self, **kwargs):
        super(FancyMLP, self).__init__(**kwargs)
        self.rand_weight = torch.rand((20, 20), requires_grad=False) # 不可训练参数（常数参数）
        self.linear = nn.Linear(20, 20)
    def forward(self, x):
        x = self.linear(x)
        # 使用创建的常数参数，以及nn.functional中的relu函数和mm函数
        x = nn.functional.relu(torch.mm(x, self.rand_weight.data) + 1)
        # 复用全连接层。等价于两个全连接层共享参数
        x = self.linear(x)
        # 控制流，这里我们需要调用item函数来返回标量进行比较
        while x.norm().item() > 1:
            x /= 2
        if x.norm().item() < 0.8:
            x *= 10
        return x.sum()

net = FancyMLP()    
print(*[(name, param.shape) for name, param in net.named_parameters()])

输出结果：

('linear.weight', torch.Size([20, 20])) ('linear.bias', torch.Size([20]))

2. 查看初始化模块文档以了解不同的初始化方法。

解：
代码如下：

dir(nn.init)

输出结果：

['Tensor',
'__builtins__',
'__cached__',
'__doc__',
'__file__',
'__loader__',
'__name__',
'__package__',
'__spec__',
'_calculate_correct_fan',
'_calculate_fan_in_and_fan_out',
'_make_deprecate',
'_no_grad_fill_',
'_no_grad_normal_',
'_no_grad_trunc_normal_',
'_no_grad_uniform_',
'_no_grad_zero_',
'calculate_gain',
'constant',
'constant_',
'dirac',
'dirac_',
'eye',
'eye_',
'kaiming_normal',
'kaiming_normal_',
'kaiming_uniform',
'kaiming_uniform_',
'math',
'normal',
'normal_',
'ones_',
'orthogonal',
'orthogonal_',
'sparse',
'sparse_',
'torch',
'trunc_normal_',
'uniform',
'uniform_',
'warnings',
'xavier_normal',
'xavier_normal_',
'xavier_uniform',
'xavier_uniform_',
'zeros_']

# 将输入的Tensor（3/4/5维）填充为狄拉克δ函数
# 可以使得卷积层在初始化时保持输入信号的恒等映射，即直接传递输入到输出而不进行任何变换
help(nn.init.dirac_)

输出结果：

Help on function dirac_ in module torch.nn.init:

dirac_(tensor, groups=1)
Fills the {3, 4, 5}-dimensional input `Tensor` with the Dirac
delta function. Preserves the identity of the inputs in `Convolutional`
layers, where as many input channels are preserved as possible. In case
of groups>1, each group of channels preserves identity

Args:
tensor: a {3, 4, 5}-dimensional `torch.Tensor`
groups (int, optional): number of groups in the conv layer (default: 1)
Examples:
>>> w = torch.empty(3, 16, 5, 5)
>>> nn.init.dirac_(w)
>>> w = torch.empty(3, 24, 5, 5)
>>> nn.init.dirac_(w, 3)

# 将一个2维的输入张量填充为单位矩阵
# 类似dirac_初始化，可以使得网络在初始化时保持输入信号的恒等映射
help(nn.init.eye_)

输出结果：

Help on function eye_ in module torch.nn.init:

eye_(tensor)
Fills the 2-dimensional input `Tensor` with the identity
matrix. Preserves the identity of the inputs in `Linear` layers, where as
many inputs are preserved as possible.

Args:
tensor: a 2-dimensional `torch.Tensor`

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.eye_(w)

# 通常称为He初始化，使用正态分布/均匀分布来初始化输入张量的值
# 特别适用于使用ReLU或Leaky ReLU激活函数的深度神经网络
help(nn.init.kaiming_normal_)
help(nn.init.kaiming_uniform_)

输出结果：

Help on function kaiming_normal_ in module torch.nn.init:

kaiming_normal_(tensor: torch.Tensor, a: float = 0, mode: str = 'fan_in', nonlinearity: str = 'leaky_relu')
Fills the input `Tensor` with values according to the method
described in `Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification` - He, K. et al. (2015), using a
normal distribution. The resulting tensor will have values sampled from
:math:`\mathcal{N}(0, \text{std}^2)` where

.. math::
\text{std} = \frac{\text{gain}}{\sqrt{\text{fan\_mode}}}

Also known as He initialization.

Args:
tensor: an n-dimensional `torch.Tensor`
a: the negative slope of the rectifier used after this layer (only
used with ``'leaky_relu'``)
mode: either ``'fan_in'`` (default) or ``'fan_out'``. Choosing ``'fan_in'``
preserves the magnitude of the variance of the weights in the
forward pass. Choosing ``'fan_out'`` preserves the magnitudes in the
backwards pass.
nonlinearity: the non-linear function (`nn.functional` name),
recommended to use only with ``'relu'`` or ``'leaky_relu'`` (default).

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.kaiming_normal_(w, mode='fan_out', nonlinearity='relu')

Help on function kaiming_uniform_ in module torch.nn.init:

kaiming_uniform_(tensor: torch.Tensor, a: float = 0, mode: str = 'fan_in', nonlinearity: str = 'leaky_relu')
Fills the input `Tensor` with values according to the method
described in `Delving deep into rectifiers: Surpassing human-level
performance on ImageNet classification` - He, K. et al. (2015), using a
uniform distribution. The resulting tensor will have values sampled from
:math:`\mathcal{U}(-\text{bound}, \text{bound})` where

.. math::
\text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan\_mode}}}

Also known as He initialization.

Args:
tensor: an n-dimensional `torch.Tensor`
a: the negative slope of the rectifier used after this layer (only
used with ``'leaky_relu'``)
mode: either ``'fan_in'`` (default) or ``'fan_out'``. Choosing ``'fan_in'``
preserves the magnitude of the variance of the weights in the
forward pass. Choosing ``'fan_out'`` preserves the magnitudes in the
backwards pass.
nonlinearity: the non-linear function (`nn.functional` name),
recommended to use only with ``'relu'`` or ``'leaky_relu'`` (default).

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu')

# 将一个至少具有两个维度的输入张量填充为正交矩阵
# 如果张量有超过两个维度，那么其余的维度将被展平
# 正交初始化通常用于深度线性网络，可以保证权重矩阵的列向量（或行向量）之间是正交的，有助于避免破坏梯度的信息，特别是在深度网络中。
# 通过正交化，可以减少不同方向上的信号干扰，从而可能提高网络的学习效率
# 正交初始化通常不适用于非线性网络，如使用 ReLU 激活函数的网络，因为正交性不能保证在非线性变换下保持
help(nn.init.orthogonal_)

输出结果：

Help on function orthogonal_ in module torch.nn.init:

orthogonal_(tensor, gain=1)
Fills the input `Tensor` with a (semi) orthogonal matrix, as
described in `Exact solutions to the nonlinear dynamics of learning in deep
linear neural networks` - Saxe, A. et al. (2013). The input tensor must have
at least 2 dimensions, and for tensors with more than 2 dimensions the
trailing dimensions are flattened.

Args:
tensor: an n-dimensional `torch.Tensor`, where :math:`n \geq 2`
gain: optional scaling factor

Examples:
>>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK)
>>> w = torch.empty(3, 5)
>>> nn.init.orthogonal_(w)

# 将一个二维的输入张量填充为一个稀疏矩阵，其中非零元素来自均值为 0，标准差为 0.01 的正态分布
# 通过减少非零元素的数量，可以降低模型的参数数量，从而减少计算量和存储需求
# 稀疏表示有时能够提高模型的泛化能力，因为它强迫模型学习更加分散的特征表示 
help(nn.init.sparse_)

输出结果：

Help on function sparse_ in module torch.nn.init:

sparse_(tensor, sparsity, std=0.01)
Fills the 2D input `Tensor` as a sparse matrix, where the
non-zero elements will be drawn from the normal distribution
:math:`\mathcal{N}(0, 0.01)`, as described in `Deep learning via
Hessian-free optimization` - Martens, J. (2010).

Args:
tensor: an n-dimensional `torch.Tensor`
sparsity: The fraction of elements in each column to be set to zero
std: the standard deviation of the normal distribution used to generate
the non-zero values

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.sparse_(w, sparsity=0.1)

# 将输入张量初始化为截断正态分布的值
# 截断正态分布默认是以0为均值，1为标准差的正态分布，但是所有在截断范围[-2, 2]范围外的值都会被重新抽取，直到它们落在这个界限内
help(nn.init.trunc_normal_)

输出结果：

Help on function trunc_normal_ in module torch.nn.init:

trunc_normal_(tensor: torch.Tensor, mean: float = 0.0, std: float = 1.0, a: float = -2.0, b: float = 2.0) -> torch.Tensor
Fills the input Tensor with values drawn from a truncated
normal distribution. The values are effectively drawn from the
normal distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`
with values outside :math:`[a, b]` redrawn until they are within
the bounds. The method used for generating the random values works
best when :math:`a \leq \text{mean} \leq b`.

Args:
tensor: an n-dimensional `torch.Tensor`
mean: the mean of the normal distribution
std: the standard deviation of the normal distribution
a: the minimum cutoff value
b: the maximum cutoff value

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.trunc_normal_(w)

# 称为 Glorot 初始化或 Xavier 初始化
# 从正态分布/均匀分布抽取值来初始化输入张量的值
# 对于使用激活函数的神经网络有益，保持输入和输出的方差一致，有助于网络在训练初期的稳定性
# 通过适当地缩放初始权重，可以减少训练初期的不稳定性，并可能加快收敛速度
help(nn.init.xavier_normal_)
help(nn.init.xavier_uniform_)

输出结果：

Help on function xavier_normal_ in module torch.nn.init:

xavier_normal_(tensor: torch.Tensor, gain: float = 1.0) -> torch.Tensor
Fills the input `Tensor` with values according to the method
described in `Understanding the difficulty of training deep feedforward
neural networks` - Glorot, X. & Bengio, Y. (2010), using a normal
distribution. The resulting tensor will have values sampled from
:math:`\mathcal{N}(0, \text{std}^2)` where

.. math::
\text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan\_in} + \text{fan\_out}}}

Also known as Glorot initialization.

Args:
tensor: an n-dimensional `torch.Tensor`
gain: an optional scaling factor

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.xavier_normal_(w)

Help on function xavier_uniform_ in module torch.nn.init:

xavier_uniform_(tensor: torch.Tensor, gain: float = 1.0) -> torch.Tensor
Fills the input `Tensor` with values according to the method
described in `Understanding the difficulty of training deep feedforward
neural networks` - Glorot, X. & Bengio, Y. (2010), using a uniform
distribution. The resulting tensor will have values sampled from
:math:`\mathcal{U}(-a, a)` where

.. math::
a = \text{gain} \times \sqrt{\frac{6}{\text{fan\_in} + \text{fan\_out}}}

Also known as Glorot initialization.

Args:
tensor: an n-dimensional `torch.Tensor`
gain: an optional scaling factor

Examples:
>>> w = torch.empty(3, 5)
>>> nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))

3. 构建包含共享参数层的多层感知机并对其进行训练。在训练过程中，观察模型各层的参数和梯度。

解：
共享层的参数和梯度始终一致。
代码如下：

shared = nn.Linear(4, 4)
net = nn.Sequential(nn.Linear(3, 4), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.Linear(4, 1))

X = torch.randn(2, 3)
y = torch.randn(2, 1)

loss = nn.MSELoss()
trainer = torch.optim.SGD(net.parameters(), lr=0.1)

num_epochs = 10
for epoch in range(num_epochs):
    trainer.zero_grad()
    l = loss(net(X), y)
    l.backward()
    trainer.step() 

    print(f"Epoch {epoch+1}/{num_epochs}, \nLoss: {l.item()}")
    for i in [0, 2, 4, 6]:
        print(f"net {i}: \nWeight:{net[i].weight.data}, \nGrad:{net[i].weight.grad}, \nBias:{net[i].bias.data}")

输出结果：

Epoch 1/10,
Loss: 1.3032385110855103
net 0:
Weight:tensor([[ 0.3227, 0.1437, -0.2189],
[-0.2913, 0.2566, -0.0781],
[ 0.4490, 0.4446, 0.3157],
[ 0.2661, 0.0345, -0.0449]]),
Grad:tensor([[-0.0381, -0.0518, 0.1226],
[-0.0556, -0.0757, 0.1792],
[ 0.0000, 0.0000, 0.0000],
[ 0.0274, 0.0373, -0.0882]]),
Bias:tensor([ 0.1514, 0.4024, -0.1775, 0.4685])
net 2:
Weight:tensor([[ 0.4128, 0.4375, 0.2494, 0.0892],
[ 0.0211, -0.0903, -0.4176, 0.2948],
[-0.3988, 0.3721, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.2749, -0.1355, -0.0097, -0.1496],
[ 0.8790, 0.3351, 0.0670, 0.1575],
[ 0.7636, 0.2539, 0.0718, 0.0152],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4690, -0.0460, 0.3919, -0.2411])
net 4:
Weight:tensor([[ 0.4128, 0.4375, 0.2494, 0.0892],
[ 0.0211, -0.0903, -0.4176, 0.2948],
[-0.3988, 0.3721, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.2749, -0.1355, -0.0097, -0.1496],
[ 0.8790, 0.3351, 0.0670, 0.1575],
[ 0.7636, 0.2539, 0.0718, 0.0152],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4690, -0.0460, 0.3919, -0.2411])
net 6:
Weight:tensor([[-0.2198, 0.3985, 0.3971, 0.1381]]),
Grad:tensor([[1.5999, 0.1422, 0.4558, 0.0000]]),
Bias:tensor([0.2143])
Epoch 2/10,
Loss: 0.7786167860031128
net 0:
Weight:tensor([[ 0.3209, 0.1482, -0.2198],
[-0.2943, 0.2604, -0.0759],
[ 0.4490, 0.4446, 0.3157],
[ 0.2686, 0.0381, -0.0531]]),
Grad:tensor([[ 0.0178, -0.0455, 0.0088],
[ 0.0299, -0.0380, -0.0218],
[ 0.0000, 0.0000, 0.0000],
[-0.0248, -0.0361, 0.0822]]),
Bias:tensor([ 0.1503, 0.3998, -0.1775, 0.4722])
net 2:
Weight:tensor([[ 0.4307, 0.4420, 0.2494, 0.0868],
[ 0.0337, -0.0832, -0.4176, 0.3053],
[-0.3782, 0.3758, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.1786, -0.0447, 0.0000, 0.0242],
[-0.1260, -0.0708, 0.0000, -0.1056],
[-0.2055, -0.0376, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4785, -0.0324, 0.4134, -0.2411])
net 4:
Weight:tensor([[ 0.4307, 0.4420, 0.2494, 0.0868],
[ 0.0337, -0.0832, -0.4176, 0.3053],
[-0.3782, 0.3758, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.1786, -0.0447, 0.0000, 0.0242],
[-0.1260, -0.0708, 0.0000, -0.1056],
[-0.2055, -0.0376, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4785, -0.0324, 0.4134, -0.2411])
net 6:
Weight:tensor([[-0.2846, 0.3985, 0.4012, 0.1381]]),
Grad:tensor([[ 0.6481, 0.0000, -0.0411, 0.0000]]),
Bias:tensor([0.1560])
Epoch 3/10,
Loss: 0.705264687538147
net 0:
Weight:tensor([[ 0.3222, 0.1593, -0.2327],
[-0.2930, 0.2721, -0.0895],
[ 0.4490, 0.4446, 0.3157],
[ 0.2687, 0.0396, -0.0549]]),
Grad:tensor([[-0.0124, -0.1107, 0.1289],
[-0.0131, -0.1164, 0.1355],
[ 0.0000, 0.0000, 0.0000],
[-0.0017, -0.0153, 0.0178]]),
Bias:tensor([ 0.1542, 0.4039, -0.1775, 0.4728])
net 2:
Weight:tensor([[ 0.4598, 0.4554, 0.2494, 0.0945],
[ 0.0325, -0.0843, -0.4176, 0.3046],
[-0.4013, 0.3736, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.2916, -0.1340, 0.0000, -0.0770],
[ 0.0116, 0.0108, 0.0000, 0.0070],
[ 0.2310, 0.0219, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4972, -0.0332, 0.3999, -0.2411])
net 4:
Weight:tensor([[ 0.4598, 0.4554, 0.2494, 0.0945],
[ 0.0325, -0.0843, -0.4176, 0.3046],
[-0.4013, 0.3736, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.2916, -0.1340, 0.0000, -0.0770],
[ 0.0116, 0.0108, 0.0000, 0.0070],
[ 0.2310, 0.0219, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4972, -0.0332, 0.3999, -0.2411])
net 6:
Weight:tensor([[-0.3278, 0.3985, 0.4071, 0.1381]]),
Grad:tensor([[ 0.4322, 0.0000, -0.0590, 0.0000]]),
Bias:tensor([0.1226])
Epoch 4/10,
Loss: 0.6840156316757202
net 0:
Weight:tensor([[ 0.3167, 0.1663, -0.2287],
[-0.2995, 0.2777, -0.0822],
[ 0.4490, 0.4446, 0.3157],
[ 0.2703, 0.0444, -0.0625]]),
Grad:tensor([[ 0.0551, -0.0701, -0.0400],
[ 0.0652, -0.0563, -0.0724],
[ 0.0000, 0.0000, 0.0000],
[-0.0157, -0.0479, 0.0758]]),
Bias:tensor([ 0.1493, 0.3975, -0.1775, 0.4757])
net 2:
Weight:tensor([[ 0.4646, 0.4574, 0.2494, 0.0856],
[ 0.0458, -0.0761, -0.4176, 0.3153],
[-0.3678, 0.3804, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.0473, -0.0204, 0.0000, 0.0889],
[-0.1330, -0.0823, 0.0000, -0.1066],
[-0.3348, -0.0675, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4883, -0.0201, 0.4319, -0.2411])
net 4:
Weight:tensor([[ 0.4646, 0.4574, 0.2494, 0.0856],
[ 0.0458, -0.0761, -0.4176, 0.3153],
[-0.3678, 0.3804, 0.4372, -0.3398],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.0473, -0.0204, 0.0000, 0.0889],
[-0.1330, -0.0823, 0.0000, -0.1066],
[-0.3348, -0.0675, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4883, -0.0201, 0.4319, -0.2411])
net 6:
Weight:tensor([[-0.3470, 0.3985, 0.4117, 0.1381]]),
Grad:tensor([[ 0.1922, 0.0000, -0.0460, 0.0000]]),
Bias:tensor([0.1144])
Epoch 5/10,
Loss: 0.6664673089981079
net 0:
Weight:tensor([[ 0.3139, 0.1760, -0.2325],
[-0.2990, 0.2909, -0.0957],
[ 0.4490, 0.4446, 0.3157],
[ 0.2685, 0.0456, -0.0602]]),
Grad:tensor([[ 0.0278, -0.0965, 0.0377],
[-0.0049, -0.1322, 0.1347],
[ 0.0000, 0.0000, 0.0000],
[ 0.0177, -0.0114, -0.0234]]),
Bias:tensor([ 0.1482, 0.4010, -0.1775, 0.4739])
net 2:
Weight:tensor([[ 0.4841, 0.4681, 0.2497, 0.0883],
[ 0.0334, -0.0755, -0.4180, 0.3156],
[-0.3753, 0.3836, 0.4368, -0.3345],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.1952, -0.1068, -0.0034, -0.0277],
[ 0.1242, -0.0057, 0.0039, -0.0031],
[ 0.0745, -0.0325, 0.0040, -0.0522],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4938, -0.0232, 0.4347, -0.2411])
net 4:
Weight:tensor([[ 0.4841, 0.4681, 0.2497, 0.0883],
[ 0.0334, -0.0755, -0.4180, 0.3156],
[-0.3753, 0.3836, 0.4368, -0.3345],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.1952, -0.1068, -0.0034, -0.0277],
[ 0.1242, -0.0057, 0.0039, -0.0031],
[ 0.0745, -0.0325, 0.0040, -0.0522],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4938, -0.0232, 0.4347, -0.2411])
net 6:
Weight:tensor([[-0.3676, 0.3977, 0.4193, 0.1381]]),
Grad:tensor([[ 0.2056, 0.0084, -0.0761, 0.0000]]),
Bias:tensor([0.1059])
Epoch 6/10,
Loss: 0.6515052318572998
net 0:
Weight:tensor([[ 0.3127, 0.1911, -0.2445],
[-0.3038, 0.3008, -0.0958],
[ 0.4490, 0.4446, 0.3157],
[ 0.2696, 0.0503, -0.0666]]),
Grad:tensor([[ 0.0119, -0.1514, 0.1206],
[ 0.0480, -0.0989, 0.0009],
[ 0.0000, 0.0000, 0.0000],
[-0.0103, -0.0470, 0.0644]]),
Bias:tensor([ 0.1502, 0.3976, -0.1775, 0.4761])
net 2:
Weight:tensor([[ 0.4968, 0.4776, 0.2505, 0.0879],
[ 0.0336, -0.0752, -0.4180, 0.3156],
[-0.3909, 0.3798, 0.4359, -0.3405],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-1.2753e-01, -9.4444e-02, -7.8834e-03, 3.9911e-03],
[-1.8256e-03, -3.4052e-03, 0.0000e+00, 1.3338e-04],
[ 1.5611e-01, 3.8467e-02, 8.9926e-03, 5.9336e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]]),
Bias:tensor([ 0.4907, -0.0233, 0.4293, -0.2411])
net 4:
Weight:tensor([[ 0.4968, 0.4776, 0.2505, 0.0879],
[ 0.0336, -0.0752, -0.4180, 0.3156],
[-0.3909, 0.3798, 0.4359, -0.3405],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-1.2753e-01, -9.4444e-02, -7.8834e-03, 3.9911e-03],
[-1.8256e-03, -3.4052e-03, 0.0000e+00, 1.3338e-04],
[ 1.5611e-01, 3.8467e-02, 8.9926e-03, 5.9336e-02],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]]),
Bias:tensor([ 0.4907, -0.0233, 0.4293, -0.2411])
net 6:
Weight:tensor([[-0.3746, 0.3977, 0.4285, 0.1381]]),
Grad:tensor([[ 0.0702, 0.0000, -0.0919, 0.0000]]),
Bias:tensor([0.1103])
Epoch 7/10,
Loss: 0.6527106761932373
net 0:
Weight:tensor([[ 0.3044, 0.1992, -0.2362],
[-0.3128, 0.3070, -0.0843],
[ 0.4490, 0.4446, 0.3157],
[ 0.2711, 0.0560, -0.0751]]),
Grad:tensor([[ 0.0826, -0.0803, -0.0834],
[ 0.0899, -0.0622, -0.1147],
[ 0.0000, 0.0000, 0.0000],
[-0.0155, -0.0577, 0.0846]]),
Bias:tensor([ 0.1423, 0.3884, -0.1775, 0.4792])
net 2:
Weight:tensor([[ 0.4932, 0.4778, 0.2505, 0.0747],
[ 0.0067, -0.0720, -0.4180, 0.3280],
[-0.3519, 0.3886, 0.4359, -0.3405],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[ 0.0361, -0.0022, 0.0000, 0.1321],
[ 0.2683, -0.0313, 0.0000, -0.1239],
[-0.3896, -0.0877, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4705, -0.0390, 0.4658, -0.2411])
net 4:
Weight:tensor([[ 0.4932, 0.4778, 0.2505, 0.0747],
[ 0.0067, -0.0720, -0.4180, 0.3280],
[-0.3519, 0.3886, 0.4359, -0.3405],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[ 0.0361, -0.0022, 0.0000, 0.1321],
[ 0.2683, -0.0313, 0.0000, -0.1239],
[-0.3896, -0.0877, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4705, -0.0390, 0.4658, -0.2411])
net 6:
Weight:tensor([[-0.3769, 0.3972, 0.4372, 0.1381]]),
Grad:tensor([[ 0.0234, 0.0050, -0.0876, 0.0000]]),
Bias:tensor([0.1191])
Epoch 8/10,
Loss: 0.6448743939399719
net 0:
Weight:tensor([[ 0.3045, 0.2153, -0.2516],
[-0.3167, 0.3175, -0.0867],
[ 0.4490, 0.4446, 0.3157],
[ 0.2727, 0.0607, -0.0825]]),
Grad:tensor([[-0.0004, -0.1615, 0.1538],
[ 0.0394, -0.1052, 0.0236],
[ 0.0000, 0.0000, 0.0000],
[-0.0154, -0.0471, 0.0743]]),
Bias:tensor([ 0.1459, 0.3861, -0.1775, 0.4821])
net 2:
Weight:tensor([[ 0.5147, 0.4898, 0.2520, 0.0775],
[ 0.0070, -0.0717, -0.4180, 0.3280],
[-0.3746, 0.3836, 0.4343, -0.3474],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.2152, -0.1202, -0.0143, -0.0276],
[-0.0027, -0.0036, 0.0000, -0.0008],
[ 0.2265, 0.0500, 0.0166, 0.0697],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4740, -0.0390, 0.4556, -0.2411])
net 4:
Weight:tensor([[ 0.5147, 0.4898, 0.2520, 0.0775],
[ 0.0070, -0.0717, -0.4180, 0.3280],
[-0.3746, 0.3836, 0.4343, -0.3474],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.2152, -0.1202, -0.0143, -0.0276],
[-0.0027, -0.0036, 0.0000, -0.0008],
[ 0.2265, 0.0500, 0.0166, 0.0697],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4740, -0.0390, 0.4556, -0.2411])
net 6:
Weight:tensor([[-0.3966, 0.3972, 0.4448, 0.1381]]),
Grad:tensor([[ 0.1962, 0.0000, -0.0759, 0.0000]]),
Bias:tensor([0.1141])
Epoch 9/10,
Loss: 0.6262081861495972
net 0:
Weight:tensor([[ 0.3001, 0.2296, -0.2565],
[-0.3209, 0.3310, -0.0914],
[ 0.4490, 0.4446, 0.3157],
[ 0.2718, 0.0635, -0.0834]]),
Grad:tensor([[ 0.0442, -0.1426, 0.0497],
[ 0.0417, -0.1343, 0.0468],
[ 0.0000, 0.0000, 0.0000],
[ 0.0084, -0.0272, 0.0095]]),
Bias:tensor([ 0.1439, 0.3842, -0.1775, 0.4817])
net 2:
Weight:tensor([[ 0.5263, 0.4994, 0.2520, 0.0757],
[ 0.0073, -0.0710, -0.4180, 0.3279],
[-0.3818, 0.3850, 0.4343, -0.3474],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.1160, -0.0959, 0.0000, 0.0175],
[-0.0033, -0.0069, 0.0000, 0.0011],
[ 0.0718, -0.0139, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4659, -0.0392, 0.4603, -0.2411])
net 4:
Weight:tensor([[ 0.5263, 0.4994, 0.2520, 0.0757],
[ 0.0073, -0.0710, -0.4180, 0.3279],
[-0.3818, 0.3850, 0.4343, -0.3474],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[-0.1160, -0.0959, 0.0000, 0.0175],
[-0.0033, -0.0069, 0.0000, 0.0011],
[ 0.0718, -0.0139, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4659, -0.0392, 0.4603, -0.2411])
net 6:
Weight:tensor([[-0.3983, 0.3972, 0.4569, 0.1381]]),
Grad:tensor([[ 0.0177, 0.0000, -0.1207, 0.0000]]),
Bias:tensor([0.1247])
Epoch 10/10,
Loss: 0.616180956363678
net 0:
Weight:tensor([[ 0.2909, 0.2388, -0.2475],
[-0.3303, 0.3387, -0.0806],
[ 0.4490, 0.4446, 0.3157],
[ 0.2732, 0.0690, -0.0914]]),
Grad:tensor([[ 0.0921, -0.0922, -0.0904],
[ 0.0938, -0.0774, -0.1078],
[ 0.0000, 0.0000, 0.0000],
[-0.0139, -0.0554, 0.0792]]),
Bias:tensor([ 0.1352, 0.3749, -0.1775, 0.4846])
net 2:
Weight:tensor([[ 0.5252, 0.5015, 0.2520, 0.0625],
[ 0.0218, -0.0620, -0.4180, 0.3394],
[-0.3419, 0.3937, 0.4343, -0.3474],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[ 0.0117, -0.0209, 0.0000, 0.1325],
[-0.1450, -0.0902, 0.0000, -0.1149],
[-0.3988, -0.0872, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4449, -0.0266, 0.4985, -0.2411])
net 4:
Weight:tensor([[ 0.5252, 0.5015, 0.2520, 0.0625],
[ 0.0218, -0.0620, -0.4180, 0.3394],
[-0.3419, 0.3937, 0.4343, -0.3474],
[-0.0770, -0.1823, -0.4463, -0.3748]]),
Grad:tensor([[ 0.0117, -0.0209, 0.0000, 0.1325],
[-0.1450, -0.0902, 0.0000, -0.1149],
[-0.3988, -0.0872, 0.0000, 0.0000],
[ 0.0000, 0.0000, 0.0000, 0.0000]]),
Bias:tensor([ 0.4449, -0.0266, 0.4985, -0.2411])
net 6:
Weight:tensor([[-0.4012, 0.3972, 0.4694, 0.1381]]),
Grad:tensor([[ 0.0284, 0.0000, -0.1252, 0.0000]]),
Bias:tensor([0.1353])