实验五前馈神经网络（2）自动梯度计算 & 优化问题

torch提供了torch.nn.Module类，来方便快速的实现自己的层和模型。模型和层都可以基于torch.nn.Moduler扩充实现，模型只是一种特殊的层。继承了torch.nn.Module类的算子中，可以在内部直接调用其它继承torch.nn.Module类的算子，torch框架会自动识别算子中内嵌的此类算子，并自动计算它们的梯度，并在优化时更新它们的参数。

4.3.1 利用预定义算子重新实现前馈神经网络

使用torch的预定义算子来重新实现二分类任务。主要使用到torch.nn.Module。

1. 使用pytorch的预定义算子来重新实现二分类任务。

import torch.nn as nn
import torch
from torch.nn.init import normal_,constant_
import torch.nn.functional as F

class Model_MLP_L2_V2(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Model_MLP_L2_V2, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        normal_(self.fc1.weight, mean=0., std=1.)
        constant_(self.fc1.bias, val=0.0)
        self.fc2 = nn.Linear(hidden_size, output_size)
        normal_(self.fc2.weight, mean=0., std=1.)
        constant_(self.fc2.bias, val=0.0)
        self.act_fn = torch.sigmoid

    # 前向计算
    def forward(self, inputs):
        z1 = self.fc1(inputs)
        a1 = self.act_fn(z1)
        z2 = self.fc2(a1)
        a2 = self.act_fn(z2)
        return a2

2. 增加一个3个神经元的隐藏层，再次实现二分类。

class Model_MLP_L2_V2(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Model_MLP_L2_V2, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        normal_(self.fc1.weight, mean=0., std=1.)
        constant_(self.fc1.bias, val=0.0)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        normal_(self.fc2.weight, mean=0., std=1.)
        constant_(self.fc2.bias, val=0.0)
        self.act_fn = torch.sigmoid
        self.fc3 = nn.Linear(hidden_size, output_size)
        normal_(self.fc3.weight, mean=0., std=1.)
        constant_(self.fc3.bias, val=0.0)
        self.act_fn = torch.sigmoid

    # 前向计算
    def forward(self, inputs):
        z1 = self.fc1(inputs)
        a1 = self.act_fn(z1)
        z2 = self.fc2(a1)
        a2 = self.act_fn(z2)
        z3 = self.fc3(a2)
        a3 = self.act_fn(z3)
        return a3

4.3.2 完善Runner类

基于上一节实现的 RunnerV2_1 类，本节的 RunnerV2_2 类在训练过程中使用自动梯度计算；模型保存时，使用state_dict方法获取模型参数；模型加载时，使用set_state_dict方法加载模型参数.

class RunnerV2_2(object):
    def __init__(self, model, optimizer, metric, loss_fn, **kwargs):
        self.model = model
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.metric = metric

        # 记录训练过程中的评估指标变化情况
        self.train_scores = []
        self.dev_scores = []

        # 记录训练过程中的评价指标变化情况
        self.train_loss = []
        self.dev_loss = []

    def train(self, train_set, dev_set, **kwargs):
        # 将模型切换为训练模式
        self.model.train()

        # 传入训练轮数，如果没有传入值则默认为0
        num_epochs = kwargs.get("num_epochs", 0)
        # 传入log打印频率，如果没有传入值则默认为100
        log_epochs = kwargs.get("log_epochs", 100)
        # 传入模型保存路径，如果没有传入值则默认为"best_model.pdparams"
        save_path = kwargs.get("save_path", "best_model.pdparams")

        # log打印函数，如果没有传入则默认为"None"
        custom_print_log = kwargs.get("custom_print_log", None)

        # 记录全局最优指标
        best_score = 0
        # 进行num_epochs轮训练
        for epoch in range(num_epochs):
            X, y = train_set
            # 获取模型预测
            logits = self.model(X)
            # 计算交叉熵损失
            trn_loss = self.loss_fn(logits, y)
            self.train_loss.append(trn_loss.item())
            # 计算评估指标
            trn_score = self.metric(logits, y).item()
            self.train_scores.append(trn_score)

            # 自动计算参数梯度
            trn_loss.backward()
            if custom_print_log is not None:
                # 打印每一层的梯度
                custom_print_log(self)

            # 参数更新
            self.optimizer.step()
            # 清空梯度
            self.optimizer.zero_grad()

            dev_score, dev_loss = self.evaluate(dev_set)
            # 如果当前指标为最优指标，保存该模型
            if dev_score > best_score:
                self.save_model(save_path)
                print(f"[Evaluate] best accuracy performence has been updated: {best_score:.5f} --> {dev_score:.5f}")
                best_score = dev_score

            if log_epochs and epoch % log_epochs == 0:
                print(f"[Train] epoch: {epoch}/{num_epochs}, loss: {trn_loss.item()}")

    # 模型评估阶段，使用'paddle.no_grad()'控制不计算和存储梯度
    @torch.no_grad()
    def evaluate(self, data_set):
        # 将模型切换为评估模式
        self.model.eval()

        X, y = data_set
        # 计算模型输出
        logits = self.model(X)
        # 计算损失函数
        loss = self.loss_fn(logits, y).item()
        self.dev_loss.append(loss)
        # 计算评估指标
        score = self.metric(logits, y).item()
        self.dev_scores.append(score)
        return score, loss

    def predict(self, X):
        # 将模型切换为评估模式
        self.model.eval()
        return self.model(X)

    # 使用'model.state_dict()'获取模型参数，并进行保存
    def save_model(self, saved_path):
        torch.save(self.model.state_dict(), saved_path)

    # 使用'model.set_state_dict'加载模型参数
    def load_model(self, model_path):
        state_dict = torch.load(model_path)
        self.model.set_state_dict(state_dict)

4.3.3 模型训练

实例化RunnerV2类，并传入训练配置，代码实现如下：

1. 使用pytorch的预定义算子来重新实现二分类任务。

# 设置模型
input_size = 2
hidden_size = 5
output_size = 1
model = Model_MLP_L2_V2(input_size=input_size, hidden_size=hidden_size, output_size=output_size)

# 设置损失函数
loss_fn = F.binary_cross_entropy

# 设置优化器
learning_rate = 0.2
optimizer = torch.optim.SGD(model.parameters(), learning_rate)

# 设置评价指标
metric = accuracy

# 其他参数
epoch_num = 1000
saved_path = 'best_model.pdparams'

# 实例化RunnerV2类，并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=epoch_num, log_epochs=50, save_path="best_model.pdparams")

结果：

[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.45000
[Train] epoch: 0/1000, loss: 0.8735500574111938
[Evaluate] best accuracy performence has been updated: 0.45000 --> 0.46250
[Evaluate] best accuracy performence has been updated: 0.46250 --> 0.56250
[Evaluate] best accuracy performence has been updated: 0.56250 --> 0.63750
[Evaluate] best accuracy performence has been updated: 0.63750 --> 0.73125
[Evaluate] best accuracy performence has been updated: 0.73125 --> 0.76250
[Evaluate] best accuracy performence has been updated: 0.76250 --> 0.78750
[Train] epoch: 50/1000, loss: 0.4802948832511902
[Evaluate] best accuracy performence has been updated: 0.78750 --> 0.79375
[Evaluate] best accuracy performence has been updated: 0.79375 --> 0.80000
[Evaluate] best accuracy performence has been updated: 0.80000 --> 0.80625
[Evaluate] best accuracy performence has been updated: 0.80625 --> 0.81250
[Evaluate] best accuracy performence has been updated: 0.81250 --> 0.81875
[Train] epoch: 100/1000, loss: 0.41373705863952637
[Evaluate] best accuracy performence has been updated: 0.81875 --> 0.82500
[Evaluate] best accuracy performence has been updated: 0.82500 --> 0.83125
[Train] epoch: 150/1000, loss: 0.378179132938385
[Train] epoch: 200/1000, loss: 0.35704904794692993
[Evaluate] best accuracy performence has been updated: 0.83125 --> 0.83750
[Train] epoch: 250/1000, loss: 0.343575119972229
[Train] epoch: 300/1000, loss: 0.33467379212379456
[Train] epoch: 350/1000, loss: 0.32868289947509766
[Evaluate] best accuracy performence has been updated: 0.83750 --> 0.84375
[Evaluate] best accuracy performence has been updated: 0.84375 --> 0.85000
[Train] epoch: 400/1000, loss: 0.32459431886672974
[Train] epoch: 450/1000, loss: 0.3217635750770569
[Train] epoch: 500/1000, loss: 0.31977084279060364
[Train] epoch: 550/1000, loss: 0.3183404505252838
[Train] epoch: 600/1000, loss: 0.31728997826576233
[Train] epoch: 650/1000, loss: 0.3164980411529541
[Train] epoch: 700/1000, loss: 0.31588321924209595
[Train] epoch: 750/1000, loss: 0.3153904974460602
[Evaluate] best accuracy performence has been updated: 0.85000 --> 0.85625
[Train] epoch: 800/1000, loss: 0.31498241424560547
[Evaluate] best accuracy performence has been updated: 0.85625 --> 0.86250
[Evaluate] best accuracy performence has been updated: 0.86250 --> 0.86875
[Train] epoch: 850/1000, loss: 0.3146332800388336
[Train] epoch: 900/1000, loss: 0.3143254518508911
[Train] epoch: 950/1000, loss: 0.31404656171798706

将训练过程中训练集与验证集的准确率变化情况进行可视化。

# 可视化观察训练集与验证集的指标变化情况
def plot(runner, fig_name):
    plt.figure(figsize=(10, 5))
    epochs = [i for i in range(len(runner.train_scores))]

    plt.subplot(1, 2, 1)
    plt.plot(epochs, runner.train_loss, color='#e4007f', label="Train loss")
    plt.plot(epochs, runner.dev_loss, color='#f19ec2', linestyle='--', label="Dev loss")
    # 绘制坐标轴和图例
    plt.ylabel("loss", fontsize='large')
    plt.xlabel("epoch", fontsize='large')
    plt.legend(loc='upper right', fontsize='x-large')

    plt.subplot(1, 2, 2)
    plt.plot(epochs, runner.train_scores, color='#e4007f', label="Train accuracy")
    plt.plot(epochs, runner.dev_scores, color='#f19ec2', linestyle='--', label="Dev accuracy")
    # 绘制坐标轴和图例
    plt.ylabel("score", fontsize='large')
    plt.xlabel("epoch", fontsize='large')
    plt.legend(loc='lower right', fontsize='x-large')

    plt.savefig(fig_name)
    plt.show()


plot(runner, 'fw-acc.pdf')

# 模型评价
torch.load("best_model.pdparams")
score, loss = runner.evaluate([X_test, y_test])
print("[Test] score/loss: {:.4f}/{:.4f}".format(score, loss))

使用测试数据对训练完成后的最优模型进行评价，观察模型在测试集上的准确率以及loss情况。代码如下：

# 模型评价
torch.load("best_model.pdparams")
score, loss = runner.evaluate([X_test, y_test])
print("[Test] score/loss: {:.4f}/{:.4f}".format(score, loss))

从结果来看，模型在测试集上取得了较高的准确率。

2. 增加一个3个神经元的隐藏层，再次实现二分类。

# 设置模型
input_size = 2
hidden_size = 5
add_hidden_size = 3 
output_size = 1
model = Model_MLP_L2_V2(input_size=input_size, hidden_size=hidden_size, output_size=output_size,add_size = add_hidden_size)

# 设置损失函数
loss_fn = F.binary_cross_entropy

# 设置优化器
learning_rate = 0.2
optimizer = torch.optim.SGD(model.parameters(), learning_rate)

# 设置评价指标
metric = accuracy

# 其他参数
epoch_num = 1000
saved_path = 'best_model.pdparams'

# 实例化RunnerV2类，并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=epoch_num, log_epochs=50, save_path="best_model.pdparams")

结果：

[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.46875
[Train] epoch: 0/1000, loss: 0.792105495929718
[Evaluate] best accuracy performence has been updated: 0.46875 --> 0.47500
[Evaluate] best accuracy performence has been updated: 0.47500 --> 0.48125
[Evaluate] best accuracy performence has been updated: 0.48125 --> 0.49375
[Evaluate] best accuracy performence has been updated: 0.49375 --> 0.50625
[Evaluate] best accuracy performence has been updated: 0.50625 --> 0.51250
[Evaluate] best accuracy performence has been updated: 0.51250 --> 0.51875
[Evaluate] best accuracy performence has been updated: 0.51875 --> 0.52500
[Evaluate] best accuracy performence has been updated: 0.52500 --> 0.53125
[Evaluate] best accuracy performence has been updated: 0.53125 --> 0.53750
[Evaluate] best accuracy performence has been updated: 0.53750 --> 0.54375
[Evaluate] best accuracy performence has been updated: 0.54375 --> 0.55000
[Evaluate] best accuracy performence has been updated: 0.55000 --> 0.56250
[Train] epoch: 50/1000, loss: 0.6632017493247986
[Evaluate] best accuracy performence has been updated: 0.56250 --> 0.56875
[Evaluate] best accuracy performence has been updated: 0.56875 --> 0.57500
[Evaluate] best accuracy performence has been updated: 0.57500 --> 0.59375
[Evaluate] best accuracy performence has been updated: 0.59375 --> 0.60000
[Evaluate] best accuracy performence has been updated: 0.60000 --> 0.61250
[Evaluate] best accuracy performence has been updated: 0.61250 --> 0.61875
[Evaluate] best accuracy performence has been updated: 0.61875 --> 0.62500
[Evaluate] best accuracy performence has been updated: 0.62500 --> 0.63125
[Evaluate] best accuracy performence has been updated: 0.63125 --> 0.63750
[Evaluate] best accuracy performence has been updated: 0.63750 --> 0.64375
[Evaluate] best accuracy performence has been updated: 0.64375 --> 0.65000
[Evaluate] best accuracy performence has been updated: 0.65000 --> 0.65625
[Train] epoch: 100/1000, loss: 0.612221360206604
[Evaluate] best accuracy performence has been updated: 0.65625 --> 0.66250
[Evaluate] best accuracy performence has been updated: 0.66250 --> 0.66875
[Evaluate] best accuracy performence has been updated: 0.66875 --> 0.67500
[Evaluate] best accuracy performence has been updated: 0.67500 --> 0.68125
[Evaluate] best accuracy performence has been updated: 0.68125 --> 0.68750
[Evaluate] best accuracy performence has been updated: 0.68750 --> 0.69375
[Evaluate] best accuracy performence has been updated: 0.69375 --> 0.70000
[Evaluate] best accuracy performence has been updated: 0.70000 --> 0.70625
[Evaluate] best accuracy performence has been updated: 0.70625 --> 0.71250
[Evaluate] best accuracy performence has been updated: 0.71250 --> 0.72500
[Evaluate] best accuracy performence has been updated: 0.72500 --> 0.73125
[Train] epoch: 150/1000, loss: 0.5504528284072876
[Evaluate] best accuracy performence has been updated: 0.73125 --> 0.73750
[Evaluate] best accuracy performence has been updated: 0.73750 --> 0.74375
[Evaluate] best accuracy performence has been updated: 0.74375 --> 0.75000
[Evaluate] best accuracy performence has been updated: 0.75000 --> 0.75625
[Evaluate] best accuracy performence has been updated: 0.75625 --> 0.76250
[Evaluate] best accuracy performence has been updated: 0.76250 --> 0.76875
[Evaluate] best accuracy performence has been updated: 0.76875 --> 0.77500
[Train] epoch: 200/1000, loss: 0.4901648461818695
[Evaluate] best accuracy performence has been updated: 0.77500 --> 0.78125
[Evaluate] best accuracy performence has been updated: 0.78125 --> 0.78750
[Evaluate] best accuracy performence has been updated: 0.78750 --> 0.79375
[Train] epoch: 250/1000, loss: 0.4414004385471344
[Evaluate] best accuracy performence has been updated: 0.79375 --> 0.80000
[Evaluate] best accuracy performence has been updated: 0.80000 --> 0.80625
[Evaluate] best accuracy performence has been updated: 0.80625 --> 0.81250
[Train] epoch: 300/1000, loss: 0.40371209383010864
[Evaluate] best accuracy performence has been updated: 0.81250 --> 0.81875
[Train] epoch: 350/1000, loss: 0.37464267015457153
[Train] epoch: 400/1000, loss: 0.3527087867259979
[Evaluate] best accuracy performence has been updated: 0.81875 --> 0.82500
[Evaluate] best accuracy performence has been updated: 0.82500 --> 0.83125
[Evaluate] best accuracy performence has been updated: 0.83125 --> 0.83750
[Train] epoch: 450/1000, loss: 0.33678582310676575
[Evaluate] best accuracy performence has been updated: 0.83750 --> 0.84375
[Train] epoch: 500/1000, loss: 0.32561513781547546
[Train] epoch: 550/1000, loss: 0.31791070103645325
[Evaluate] best accuracy performence has been updated: 0.84375 --> 0.85000
[Train] epoch: 600/1000, loss: 0.312591016292572
[Train] epoch: 650/1000, loss: 0.30886030197143555
[Train] epoch: 700/1000, loss: 0.3061731457710266
[Train] epoch: 750/1000, loss: 0.3041686415672302
[Train] epoch: 800/1000, loss: 0.30261164903640747
[Train] epoch: 850/1000, loss: 0.30135002732276917
[Train] epoch: 900/1000, loss: 0.3002856969833374
[Train] epoch: 950/1000, loss: 0.29935526847839355

通过结果可以发现，增加一个三个隐藏层的神经元后，误差减少，学习率有所提升。

【思考题】

自定义梯度计算和自动梯度计算：

从计算性能、计算结果等多方面比较，谈谈自己的看法。

自定义梯度计算：构建模型有时需要使用自定义的函数，为了不影响模型的反向传播，需要实现自动梯度计算，即把自定义函数嵌入计算图。

自动梯度计算：PyTorch提供的autograd包能够根据输入和前向传播过程自动构建计算图，并执行反向传播。Tensor是这个pytorch的自动求导部分的核心类，如果将其属性.requires_grad=True，它将开始追踪在该tensor上的所有操作，从而实现利用链式法则进行的梯度传播。完成计算后，可以调用.backward()来完成所有梯度计算。此Tensor的梯度将累积到.grad属性中。

对Model_MLP_L2_V2函数分别添加

def back(self, grads):
grad_input = torch.multiply(self.outputs, (1.0 - self.outputs))
return torch.multiply((grads, grad_input))
def back(self):
loss = -1.0 * (self.labels / self.predicts - (1 - self.labels) / (1 - self.predicts)) / self.num
self.model.zidingback(loss)

自定义梯度：

自动梯度：

对比可得自定义梯度计算准确率低但速度快。

4.4 优化问题

4.4.1 参数初始化

实现一个神经网络前，需要先初始化模型参数。

如果对每一层的权重和偏置都用0初始化，那么通过第一遍前向计算，所有隐藏层神经元的激活值都相同；在反向传播时，所有权重的更新也都相同，这样会导致隐藏层神经元没有差异性，出现对称权重现象。

接下来，将模型参数全都初始化为0，看实验结果。这里重新定义了一个类TwoLayerNet_Zeros，两个线性层的参数全都初始化为0。

class Model_MLP_L2_V4(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Model_MLP_L2_V4, self).__init__()
        # 使用'paddle.nn.Linear'定义线性层。
        # 其中in_features为线性层输入维度；out_features为线性层输出维度
        # weight_attr为权重参数属性
        # bias_attr为偏置参数属性

        
        self.fc1 = nn.Linear(input_size, hidden_size)
        constant_(self.fc1.weight, val=0.0)
        constant_(self.fc1.bias, val=0.0)
        self.fc2 = nn.Linear(hidden_size, output_size)
        constant_(self.fc2.weight,val=0.0)
        constant_(self.fc2.bias, val=0.0)
        self.act_fn = torch.sigmoid
        
        # 使用'paddle.nn.functional.sigmoid'定义 Logistic 激活函数
        self.act_fn = torch.sigmoid
        
    # 前向计算
    def forward(self, inputs):
        z1 = self.fc1(inputs)
        a1 = self.act_fn(z1)
        z2 = self.fc2(a1)
        a2 = self.act_fn(z2)
        return a2
def print_weights(runner):
    print('The weights of the Layers：')
    for item in runner.model.named_parameters():
        print(item)

利用Runner类训练模型：

# 设置模型
input_size = 2
hidden_size = 5
output_size = 1
model = Model_MLP_L2_V4(input_size=input_size, hidden_size=hidden_size, output_size=output_size)

# 设置损失函数
loss_fn = F.binary_cross_entropy

# 设置优化器
learning_rate = 0.2 #5e-2
optimizer = torch.optim.SGD(lr=learning_rate, params=model.parameters())

# 设置评价指标
metric = accuracy

# 其他参数
epoch = 2000
saved_path = 'best_model.pdparams'

# 实例化RunnerV2类，并传入训练配置
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

runner.train([X_train, y_train], [X_dev, y_dev], num_epochs=5, log_epochs=50, save_path="best_model.pdparams",custom_print_log=print_weights)

可视化训练和验证集上的主准确率和loss变化：

plot(runner, "fw-zero.pdf")

从输出结果看，二分类准确率为50%左右，说明模型没有学到任何内容。训练和验证loss几乎没有怎么下降。

为了避免对称权重现象，可以使用高斯分布或均匀分布初始化神经网络的参数。

高斯分布和均匀分布采样的实现和可视化代码如下：

# 使用'paddle.normal'实现高斯分布采样，其中'mean'为高斯分布的均值，'std'为高斯分布的标准差，'shape'为输出形状
gausian_weights = torch.normal(mean=0.0, std=1.0, size=[10000])
# 使用'paddle.uniform'实现在[min,max)范围内的均匀分布采样，其中'shape'为输出形状
uniform_weights = torch.Tensor(10000)
uniform_weights.uniform_(-1,1)
print(uniform_weights)
# 绘制两种参数分布
plt.figure()
plt.subplot(1,2,1)
plt.title('Gausian Distribution')
plt.hist(gausian_weights, bins=200, density=True, color='#f19ec2')
plt.subplot(1,2,2)
plt.title('Uniform Distribution')
plt.hist(uniform_weights, bins=200, density=True, color='#e4007f')
plt.savefig('fw-gausian-uniform.pdf')
plt.show()

4.4.2 梯度消失问题

在神经网络的构建过程中，随着网络层数的增加，理论上网络的拟合能力也应该是越来越好的。但是随着网络变深，参数学习更加困难，容易出现梯度消失问题。

由于Sigmoid型函数的饱和性，饱和区的导数更接近于0，误差经过每一层传递都会不断衰减。当网络层数很深时，梯度就会不停衰减，甚至消失，使得整个网络很难训练，这就是所谓的梯度消失问题。
在深度神经网络中，减轻梯度消失问题的方法有很多种，一种简单有效的方式就是使用导数比较大的激活函数，如：ReLU。

下面通过一个简单的实验观察前馈神经网络的梯度消失现象和改进方法。

4.4.2.1 模型构建

定义一个前馈神经网络，包含4个隐藏层和1个输出层，通过传入的参数指定激活函数。代码实现如下：

# 定义多层前馈神经网络
class Model_MLP_L5(nn.Module):
    def __init__(self, input_size, output_size, act='sigmoid', w_init=torch.normal(mean=torch.tensor(0.0), std=torch.tensor(0.01)), b_init=torch.tensor(1.0)):
        super(Model_MLP_L5, self).__init__()
        self.fc1 = torch.nn.Linear(input_size, 3)
        self.fc2 = torch.nn.Linear(3, 3)
        self.fc3 = torch.nn.Linear(3, 3)
        self.fc4 = torch.nn.Linear(3, 3)
        self.fc5 = torch.nn.Linear(3, output_size)
        # 定义网络使用的激活函数
        if act == 'sigmoid':
            self.act = F.sigmoid
        elif act == 'relu':
            self.act = F.relu
        elif act == 'lrelu':
            self.act = F.leaky_relu
        else:
            raise ValueError("Please enter sigmoid relu or lrelu!")
        # 初始化线性层权重和偏置参数
        self.init_weights(w_init, b_init)

    # 初始化线性层权重和偏置参数
    def init_weights(self, w_init, b_init):
        # 使用'named_sublayers'遍历所有网络层
        for n, m in self.named_parameters():
            # 如果是线性层，则使用指定方式进行参数初始化
            if isinstance(m, nn.Linear):
                w_init(m.weight)
                b_init(m.bias)

    def forward(self, inputs):
        outputs = self.fc1(inputs)
        outputs = self.act(outputs)
        outputs = self.fc2(outputs)
        outputs = self.act(outputs)
        outputs = self.fc3(outputs)
        outputs = self.act(outputs)
        outputs = self.fc4(outputs)
        outputs = self.act(outputs)
        outputs = self.fc5(outputs)
        outputs = F.sigmoid(outputs)
        return outputs

4.4.2.2 使用Sigmoid型函数进行训练

使用Sigmoid型函数作为激活函数，为了便于观察梯度消失现象，只进行一轮网络优化。代码实现如下：
定义梯度打印函数

def print_grads(runner):
    # 打印每一层的权重的模
    print('The gradient of the Layers：')
    for name,item in runner.model.named_parameters():
        if(len(item.size())==2):
             print(item)
             print(name,torch.norm(input=item,p=2))
             # 学习率大小
lr = 0.01
# 定义网络，激活函数使用sigmoid
model =  Model_MLP_L5(input_size=2, output_size=1, act='sigmoid')
# 定义优化器
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
# 定义损失函数，使用交叉熵损失函数
loss_fn = F.binary_cross_entropy
# 定义评价指标
metric = accuracy
# 指定梯度打印函数
custom_print_log=print_grads

实例化RunnerV2_2类，并传入训练配置。代码实现如下：

# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

模型训练，打印网络每层梯度值的ℓ2范数。代码实现如下：

# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], 
            num_epochs=1, log_epochs=None, 
            save_path="best_model.pdparams", 
            custom_print_log=custom_print_log)

观察实验结果可以发现，梯度经过每一个神经层的传递都会不断衰减，最终传递到第一个神经层时，梯度几乎完全消失。

4.4.2.3 使用ReLU函数进行模型训练

lr = 0.01  # 学习率大小

# 定义网络，激活函数使用relu
model =  Model_MLP_L5(input_size=2, output_size=1, act='relu')

# 定义优化器
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())

# 定义损失函数
# 定义损失函数，这里使用交叉熵损失函数
loss_fn = F.binary_cross_entropy

# 定义评估指标
metric = accuracy

# 实例化Runner
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], 
            num_epochs=1, log_epochs=None, 
            save_path="best_model.pdparams", 
            custom_print_log=custom_print_log)

The gradient of the Layers：
Parameter containing:
tensor([[ 0.4356, -0.1862],
        [-0.1747, -0.5807],
        [ 0.5868,  0.3485]], requires_grad=True)
fc1.weight tensor(1.0286, grad_fn=<NormBackward1>)
Parameter containing:
tensor([[-0.5669,  0.1052, -0.4021],
        [-0.2340,  0.2295,  0.5551],
        [ 0.1985, -0.2634,  0.0469]], requires_grad=True)
fc2.weight tensor(1.0103, grad_fn=<NormBackward1>)
Parameter containing:
tensor([[ 0.1177, -0.4179,  0.3368],
        [-0.3646, -0.3053, -0.0222],
        [-0.5638,  0.2787,  0.3291]], requires_grad=True)
fc3.weight tensor(1.0161, grad_fn=<NormBackward1>)
Parameter containing:
tensor([[ 0.2918, -0.2805, -0.1391],
        [-0.2274,  0.3770, -0.2773],
        [-0.3378, -0.2125,  0.3613]], requires_grad=True)
fc4.weight tensor(0.8624, grad_fn=<NormBackward1>)
Parameter containing:
tensor([[-0.1999,  0.0174, -0.0853]], requires_grad=True)
fc5.weight tensor(0.2180, grad_fn=<NormBackward1>)
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.53125

图4.4 展示了使用不同激活函数时，网络每层梯度值的ℓ2ℓ2范数情况。从结果可以看到，5层的全连接前馈神经网络使用Sigmoid型函数作为激活函数时，梯度经过每一个神经层的传递都会不断衰减，最终传递到第一个神经层时，梯度几乎完全消失。改为ReLU激活函数后，梯度消失现象得到了缓解，每一层的参数都具有梯度值。

4.4.3 死亡ReLU问题

ReLU激活函数可以一定程度上改善梯度消失问题，但是在某些情况下容易出现死亡ReLU问题，使得网络难以训练。

这是由于当x<0x<0时，ReLU函数的输出恒为0。在训练过程中，如果参数在一次不恰当的更新后，某个ReLU神经元在所有训练数据上都不能被激活（即输出为0），那么这个神经元自身参数的梯度永远都会是0，在以后的训练过程中永远都不能被激活。

一种简单有效的优化方式就是将激活函数更换为Leaky ReLU、ELU等ReLU的变种。

4.4.3.1 使用ReLU进行模型训练

使用第4.4.2节中定义的多层全连接前馈网络进行实验，使用ReLU作为激活函数，观察死亡ReLU现象和优化方法。当神经层的偏置被初始化为一个相对于权重较大的负值时，可以想像，输入经过神经层的处理，最终的输出会为负值，从而导致死亡ReLU现象。

# 定义网络，并使用较大的负值来初始化偏置
model =  Model_MLP_L5(input_size=2, output_size=1, act='relu', b_init=torch.tensor(-8.0))

实例化RunnerV2类，启动模型训练，打印网络每层梯度值的ℓ2范数。代码实现如下：

# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)

# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], 
            num_epochs=1, log_epochs=0, 
            save_path="best_model.pdparams", 
            custom_print_log=custom_print_log)

从输出结果可以发现，使用 ReLU 作为激活函数，当满足条件时，会发生死亡ReLU问题，网络训练过程中 ReLU 神经元的梯度始终为0，参数无法更新。

针对死亡ReLU问题，一种简单有效的优化方式就是将激活函数更换为Leaky ReLU、ELU等ReLU 的变种。接下来，观察将激活函数更换为 Leaky ReLU时的梯度情况。

4.4.3.2 使用Leaky ReLU进行模型训练

将激活函数更换为Leaky ReLU进行模型训练，观察梯度情况。代码实现如下：

# 重新定义网络，使用Leaky ReLU激活函数
model =  Model_MLP_L5(input_size=2, output_size=1, act='lrelu', b_init=torch.tensor(-8.0))
# 实例化Runner类
runner = RunnerV2_2(model, optimizer, metric, loss_fn)
# 启动训练
runner.train([X_train, y_train], [X_dev, y_dev], 
            num_epochs=1, log_epochps=None, 
            save_path="best_model.pdparams", 
            custom_print_log=custom_print_log)

从输出结果可以看到，将激活函数更换为Leaky ReLU后，死亡ReLU问题得到了改善，梯度恢复正常，参数也可以正常更新。但是由于 Leaky ReLU 中，x<0 时的斜率默认只有0.01，所以反向传播时，随着网络层数的加深，梯度值越来越小。如果想要改善这一现象，将 Leaky ReLU 中，x<0 时的斜率调大即可

梯度消失和梯度爆炸产生的原因即解决办法

在反向传播过程中需要对激活函数进行求导，如果导数大于1，那么随着网络层数的增加梯度更新将会朝着指数爆炸的方式增加这就是梯度爆炸。

同样如果导数小于1，那么随着网络层数的增加梯度更新信息会朝着指数衰减的方式减少这就是梯度消失。

因此，梯度消失、爆炸，其根本原因在于反向传播训练法则，属于先天不足。
梯度消失解决方法：

为了解决梯度的问题，采取无监督逐层训练方法，其基本思想是每次训练一层隐节点，训练时将上一层隐节点的输出作为输入，而本层隐节点的输出作为下一层隐节点的输入，此过程是逐层“预训练”；在预训练完成后，再对整个网络进行“微调”。此思想相当于是先寻找局部最优，然后整合起来寻找全局最优
梯度爆炸解决方法：

采用权重正则化。通过对网络权重做正则限制过拟合，仔细看正则项在损失函数的形式：

其中，是指正则项系数，因此，如果发生梯度爆炸，权值的范数就会变的非常大，通过正则化项，可以部分限制梯度爆炸的发生。但在深度模型中，梯度消失更常见一些。