NNDL 实验七 循环神经网络(2)梯度爆炸实验

 文章目录

前言

6.2.1 梯度打印函数

【思考】什么是范数,什么是L2范数,这里为什么要打印梯度范数?

6.2.2 复现梯度爆炸现象

6.2.3 使用梯度截断解决梯度爆炸问题

                             ​编辑

【思考题】梯度截断解决梯度爆炸问题的原理是什么? 

总结


前言

       这次真是差点写不完了,我在最后说了这几天干啥了,大体就是听说莲池要封就会了老家,结果,回家之后,听到了全保定封控的消息,这次写了好久,熬夜写完了。

       希望疫情快点过去吧,写的不太好,请老师和各位大佬多批评指正。

         


 

一、6.2 梯度爆炸实验

造成简单循环网络较难建模长程依赖问题的原因有两个:梯度爆炸梯度消失

梯度爆炸问题:比较容易解决,一般通过权重衰减梯度截断可以较好地来避免;

梯度消失问题:更加有效的方式是改变模型,比如通过长短期记忆网络LSTM来进行缓解。


本节将首先进行复现简单循环网络中的梯度爆炸问题,然后尝试使用梯度截断的方式进行解决。

采用长度为20的数据集进行实验,

训练过程中将进行输出W,U,b的梯度向量的范数,以此来衡量梯度的变化情况。

6.2.1 梯度打印函数

使用custom_print_log实现了在训练过程中打印梯度的功能,custom_print_log需要接收runner的实例,并通过model.named_parameters()获取该模型中的参数名和参数值. 这里我们分别定义W_listU_listb_list,用于分别存储训练过程中参数W,U和bW,U和b的梯度范数。

# coding=gbk
import random
import numpy as np
import os
import torch
from torch.utils.data import Dataset
import torch
import torch.nn as nn
from torch.nn.init import xavier_uniform
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
import os
import random
import torch
import numpy as np
from nndl import Accuracy, RunnerV3
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

       这是需要用到的库,先说一下,好多用的是前边的,要是有兴趣去可以看一下,同样,注意解码方式。

W_list = []
U_list = []
b_list = []


# 计算梯度范数
def custom_print_log(runner):
    model = runner.model
    W_grad_l2, U_grad_l2, b_grad_l2 = 0, 0, 0
    for name, param in model.named_parameters():
        if name == "rnn_model.W":

            W_grad_l2 = torch.norm(param.grad, p=2).numpy()
        if name == "rnn_model.U":
            U_grad_l2 = torch.norm(param.grad, p=2).numpy()
        if name == "rnn_model.b":
            b_grad_l2 = torch.norm(param.grad, p=2).numpy()
    print(f"[Training] W_grad_l2: {W_grad_l2:.5f}, U_grad_l2: {U_grad_l2:.5f}, b_grad_l2: {b_grad_l2:.5f} ")
    W_list.append(W_grad_l2)
    U_list.append(U_grad_l2)
    b_list.append(b_grad_l2)

【思考】什么是范数,什么是L2范数,这里为什么要打印梯度范数?

      这里,要说一下范数的概念。

范数,是具有“距离”概念的函数。我们知道距离的定义是一个宽泛的概念,只要满足非负、自反、三角不等式就可以称之为距离。范数是一种强化了的距离概念,它在定义上比距离多了一条数乘的运算法则。有时候为了便于理解,我们可以把范数当作距离来理解。

在数学上,范数包括向量范数和矩阵范数,向量范数表征向量空间中向量的大小,矩阵范数表征矩阵引起变化的大小。一种非严密的解释就是,对应向量范数,向量空间中的向量都是有大小的,这个大小如何度量,就是用范数来度量的,不同的范数都可以来度量这个大小,就好比米和尺都可以来度量远近一样;对于矩阵范数,学过线性代数,我们知道,通过运算AX=B,可以将向量X变化为B,矩阵范数就是来度量这个变化大小的。

       然后,具体作用,咱们来看一下机器学习神书鱼书

      说实话,有的时候不太明白的话,去看鱼书,真的很有用,鱼书真的是太厉害了。

       我认为,打印的作用,就是看了看,梯度的变化,因为,这个其实有点类似于范数的作用,就像是看一个数的变化,看的是绝对值得变化,不是看的他的带正负性相加的大小。

        下边是我看的一个博客的东西。

        什么是梯度_guanhuazhan的博客-CSDN博客_梯度

机器学习/深度学习中,需要使用训练数据来最小化损失函数,从而确定参数的值。而最小化损失函数,即需要求得损失函数的极值。
求解函数极值时,需要用到导数。对于某个连续函数f ( x ) f(x)f(x),令其一阶导数 f ′ ( x ) = 0 f'(x)=0f′(x)=0 ,通过求解该微分方程,便可直接获得极值点。但当变量很多或者函数很复杂时,f ′ ( x ) = 0 f'(x)=0f′(x)=0的显式解并不容易求得。且计算机并不擅长于求解微分方程。
计算机所擅长的是,凭借强大的计算能力,通过插值等方法(如牛顿下山法、弦截法等),海量尝试,一步一步的去把函数的极值点“试”出来。
而海量尝试需要一个方向感,为了阐述这个方向感,介绍一下方向导数。

  • 方向导数
    方向导数是偏导数的概念的推广, 偏导数研究的是指定方向 (坐标轴方向) 的变化率,到了方向导数,研究哪个方向可就不一定了。
    函数在某一点处的方向导数在其梯度方向上达到最大值,此最大值即梯度的范数。
  • 梯度
    在一个数量场中,函数在给定点处沿不同的方向,其方向导数一般是不相同的。那么沿着哪一个方向其方向导数最大,其最大值为多少,这是我们所关心的,为此引进一个很重要的概念: 梯度。
    函数在某一点处的方向导数在其梯度方向上达到最大值。
    这就是说,沿梯度方向,函数值增加最快。同样可知,方向导数的最小值在梯度的相反方向取得,此最小值为最大值的相反数,从而沿梯度相反方向函数值的减少最快。
  • 梯度值
    在单变量的实值函数中,梯度可简单理解为只是导数,或者说对于一个线性函数而言,梯度就是曲线在某点的斜率。
    对于多维变量的函数,比如在一个三维直角坐标系,该函数的梯度就可以表示为公式
    在这里插入图片描述
    为求得这个梯度值,要用到“偏导”的概念。
    γ \gammaγ 在机器学习中常被称为学习率 ( learning rate ), 也就是上面梯度下降法中的步长。
    通过算出目标函数的梯度(算出对于所有参数的偏导数)并在其反方向更新完参数 Θ \ThetaΘ ,在此过程完成后也便是达到了函数值减少最快的效果,那么在经过迭代以后目标函数即可很快地到达一个极小值。如果该函数是凸函数,该极小值也便是全局最小值,此时梯度下降法可保证收敛到全局最优解。

梯度下降由梯度方向,和步长决定,每次移动一点点。但是每一次移动都是对你所在的那个点来说,往极值方向,所以能够保证收敛。

 

6.2.2 复现梯度爆炸现象

在这里插入图片描述

 

为了更好地复现梯度爆炸问题,使用SGD优化器将批大小和学习率调大,学习率为0.2,同时在计算交叉熵损失时,将reduction设置为sum,表示将损失进行累加。 代码实现如下:

np.random.seed(0)
random.seed(0)
torch.manual_seed(0)

# 训练轮次
num_epochs = 50
# 学习率
lr = 0.2
# 输入数字的类别数
num_digits = 10
# 将数字映射为向量的维度
input_size = 32
# 隐状态向量的维度
hidden_size = 32
# 预测数字的类别数
num_classes = 19
# 批大小
batch_size = 64
# 模型保存目录
save_dir = "./checkpoints"


# 可以设置不同的length进行不同长度数据的预测实验
length = 20
print(f"\n====> Training SRN with data of length {length}.")

# 加载长度为length的数据
data_path = f"./datasets/{length}"
train_examples, dev_examples, test_examples = load_data(data_path)
train_set, dev_set, test_set = DigitSumDataset(train_examples), DigitSumDataset(dev_examples),DigitSumDataset(test_examples)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size)
dev_loader = torch.utils.data.DataLoader(dev_set, batch_size=batch_size)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=batch_size)
# 实例化模型
base_model = SRN(input_size, hidden_size)
model = Model_RNN4SeqClass(base_model, num_digits, input_size, hidden_size, num_classes)
# 指定优化器
optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
# 定义评价指标
metric = Accuracy()
# 定义损失函数
loss_fn = nn.CrossEntropyLoss(reduction="sum")

# 基于以上组件,实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric)

# 进行模型训练
model_save_path = os.path.join(save_dir, f"srn_explosion_model_{length}.pdparams")
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=100, log_steps=1,
             save_path=model_save_path, custom_print_log=custom_print_log)

这个调用了好多之前的函数,放在下边了。

class DigitSumDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, idx):
        example = self.data[idx]
        seq = torch.tensor(example[0], dtype=torch.int64)
        label = torch.tensor(example[1], dtype=torch.int64)
        return seq, label

    def __len__(self):
        return len(self.data)
class Embedding(nn.Module):
    def __init__(self, num_embeddings, embedding_dim,
                 para_attr=xavier_uniform):
        super(Embedding, self).__init__()
        # 定义嵌入矩阵
        W=torch.zeros(size=[num_embeddings, embedding_dim], dtype=torch.float32)
        self.W = torch.nn.Parameter(W)
        xavier_uniform(W)
    def forward(self, inputs):
        # 根据索引获取对应词向量
        embs = self.W[inputs]
        return embs
# 加载数据
def load_data(data_path):
    # 加载训练集
    train_examples = []
    train_path = os.path.join(data_path, "train.txt")
    with open(train_path, "r", encoding="utf-8") as f:
        for line in f.readlines():
            # 解析一行数据,将其处理为数字序列seq和标签label
            items = line.strip().split("\t")
            seq = [int(i) for i in items[0].split(" ")]
            label = int(items[1])
            train_examples.append((seq, label))

    # 加载验证集
    dev_examples = []
    dev_path = os.path.join(data_path, "dev.txt")
    with open(dev_path, "r", encoding="utf-8") as f:
        for line in f.readlines():
            # 解析一行数据,将其处理为数字序列seq和标签label
            items = line.strip().split("\t")
            seq = [int(i) for i in items[0].split(" ")]
            label = int(items[1])
            dev_examples.append((seq, label))

    # 加载测试集
    test_examples = []
    test_path = os.path.join(data_path, "test.txt")
    with open(test_path, "r", encoding="utf-8") as f:
        for line in f.readlines():
            # 解析一行数据,将其处理为数字序列seq和标签label
            items = line.strip().split("\t")
            seq = [int(i) for i in items[0].split(" ")]
            label = int(items[1])
            test_examples.append((seq, label))

    return train_examples, dev_examples, test_examples
# SRN模型
class SRN(nn.Module):
    def __init__(self, input_size,  hidden_size, W_attr=None, U_attr=None, b_attr=None):
        super(SRN, self).__init__()
        # 嵌入向量的维度
        self.input_size = input_size
        # 隐状态的维度
        self.hidden_size = hidden_size
        # 定义模型参数W,其shape为 input_size x hidden_size
        if W_attr==None:
            W=torch.zeros(size=[input_size, hidden_size], dtype=torch.float32)
        else:
            W=torch.tensor(W_attr,dtype=torch.float32)
        self.W = torch.nn.Parameter(W)
        # 定义模型参数U,其shape为hidden_size x hidden_size
        if U_attr==None:
            U=torch.zeros(size=[hidden_size, hidden_size], dtype=torch.float32)
        else:
            U=torch.tensor(U_attr,dtype=torch.float32)
        self.U = torch.nn.Parameter(U)
        # 定义模型参数b,其shape为 1 x hidden_size
        if b_attr==None:
            b=torch.zeros(size=[1, hidden_size], dtype=torch.float32)
        else:
            b=torch.tensor(b_attr,dtype=torch.float32)
        self.b = torch.nn.Parameter(b)

    # 初始化向量
    def init_state(self, batch_size):
        hidden_state = torch.zeros(size=[batch_size, self.hidden_size], dtype=torch.float32)
        return hidden_state

    # 定义前向计算
    def forward(self, inputs, hidden_state=None):
        # inputs: 输入数据, 其shape为batch_size x seq_len x input_size
        batch_size, seq_len, input_size = inputs.shape

        # 初始化起始状态的隐向量, 其shape为 batch_size x hidden_size
        if hidden_state is None:
            hidden_state = self.init_state(batch_size)

        # 循环执行RNN计算
        for step in range(seq_len):
            # 获取当前时刻的输入数据step_input, 其shape为 batch_size x input_size
            step_input = inputs[:, step, :]
            # 获取当前时刻的隐状态向量hidden_state, 其shape为 batch_size x hidden_size
            hidden_state = hidden_state+F.tanh(torch.matmul(step_input, self.W) + torch.matmul(hidden_state, self.U) + self.b)
        return hidden_state
# 基于RNN实现数字预测的模型
class Model_RNN4SeqClass(nn.Module):
    def __init__(self, model, num_digits, input_size, hidden_size, num_classes):
        super(Model_RNN4SeqClass, self).__init__()
        # 传入实例化的RNN层,例如SRN
        self.rnn_model = model
        # 词典大小
        self.num_digits = num_digits
        # 嵌入向量的维度
        self.input_size = input_size
        # 定义Embedding层
        self.embedding = Embedding(num_digits, input_size)
        # 定义线性层
        self.linear = nn.Linear(hidden_size, num_classes)

    def forward(self, inputs):
        # 将数字序列映射为相应向量
        inputs_emb = self.embedding(inputs)
        # 调用RNN模型
        hidden_state = self.rnn_model(inputs_emb)
        # 使用最后一个时刻的状态进行数字预测
        logits = self.linear(hidden_state)
        return logits

 运行结果为:

====> Training SRN with data of length 20.
[Train] epoch: 0/50, step: 0/250, loss: 186.21339
C:/Users/LENOVO/PycharmProjects/pythonProject/深度学习/第三十三个  梯度爆炸.py:76: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
  xavier_uniform(W)
D:\ProgramData\Anaconda3\envs\pytorch\lib\site-packages\torch\nn\functional.py:1795: UserWarning: nn.functional.tanh is deprecated. Use torch.tanh instead.
  warnings.warn("nn.functional.tanh is deprecated. Use torch.tanh instead.")
[Training] W_grad_l2: 153.80124, U_grad_l2: 0.00000, b_grad_l2: 145.78906 
[Train] epoch: 0/50, step: 1/250, loss: 708.52667
[Training] W_grad_l2: 568.55756, U_grad_l2: 17895.19531, b_grad_l2: 571.97260 
[Train] epoch: 0/50, step: 2/250, loss: 1597.21106
[Training] W_grad_l2: 194.40019, U_grad_l2: 7365.36719, b_grad_l2: 213.46704 
[Train] epoch: 0/50, step: 3/250, loss: 727.86835
[Training] W_grad_l2: 298.26129, U_grad_l2: 16652.06445, b_grad_l2: 285.03024 
[Train] epoch: 0/50, step: 4/250, loss: 533.27521
[Training] W_grad_l2: 136.88324, U_grad_l2: 3190.75854, b_grad_l2: 141.68430 
[Train] epoch: 1/50, step: 5/250, loss: 4452.34082
[Training] W_grad_l2: 236.43051, U_grad_l2: 3292.67847, b_grad_l2: 259.82944 
[Train] epoch: 1/50, step: 6/250, loss: 813.83344
[Training] W_grad_l2: 392.05902, U_grad_l2: 628.60596, b_grad_l2: 331.43436 
[Train] epoch: 1/50, step: 7/250, loss: 1634.58240
[Training] W_grad_l2: 1076.98792, U_grad_l2: 2315.21704, b_grad_l2: 1321.85315 
[Train] epoch: 1/50, step: 8/250, loss: 3512.06909
[Training] W_grad_l2: 988.60632, U_grad_l2: 4505.04346, b_grad_l2: 1298.94727 
[Train] epoch: 1/50, step: 9/250, loss: 1489.40576
[Training] W_grad_l2: 632.75385, U_grad_l2: 1525.12915, b_grad_l2: 822.00928 
[Train] epoch: 2/50, step: 10/250, loss: 2568.98096
[Training] W_grad_l2: 242.73698, U_grad_l2: 848.93695, b_grad_l2: 216.90353 
[Train] epoch: 2/50, step: 11/250, loss: 747.40607
[Training] W_grad_l2: 135.77904, U_grad_l2: 927.38440, b_grad_l2: 107.91950 
[Train] epoch: 2/50, step: 12/250, loss: 2748.28149
[Training] W_grad_l2: 426.71057, U_grad_l2: 884.18860, b_grad_l2: 392.46512 
[Train] epoch: 2/50, step: 13/250, loss: 2204.92407
[Training] W_grad_l2: 616.31995, U_grad_l2: 1591.48193, b_grad_l2: 639.45947 
[Train] epoch: 2/50, step: 14/250, loss: 2337.81934
[Training] W_grad_l2: 147.38071, U_grad_l2: 685.91541, b_grad_l2: 110.99716 
[Train] epoch: 3/50, step: 15/250, loss: 4630.19824
[Training] W_grad_l2: 401.85770, U_grad_l2: 596.25848, b_grad_l2: 423.88623 
[Train] epoch: 3/50, step: 16/250, loss: 4108.71094
[Training] W_grad_l2: 1063.28748, U_grad_l2: 2066.43091, b_grad_l2: 933.23157 
[Train] epoch: 3/50, step: 17/250, loss: 8016.23584
[Training] W_grad_l2: 506.90164, U_grad_l2: 4599.65820, b_grad_l2: 517.09277 
[Train] epoch: 3/50, step: 18/250, loss: 2424.85229
[Training] W_grad_l2: 129.12135, U_grad_l2: 675.90424, b_grad_l2: 105.74139 
[Train] epoch: 3/50, step: 19/250, loss: 2195.64673
[Training] W_grad_l2: 288.23907, U_grad_l2: 1078.69067, b_grad_l2: 142.31001 
[Train] epoch: 4/50, step: 20/250, loss: 2101.90381
[Training] W_grad_l2: 1381.09705, U_grad_l2: 2698.74951, b_grad_l2: 606.16608 
[Train] epoch: 4/50, step: 21/250, loss: 1335.80493
[Training] W_grad_l2: 303.76605, U_grad_l2: 1882.76392, b_grad_l2: 250.37833 
[Train] epoch: 4/50, step: 22/250, loss: 3370.64502
[Training] W_grad_l2: 2490.67212, U_grad_l2: 4149.84424, b_grad_l2: 1275.26392 
[Train] epoch: 4/50, step: 23/250, loss: 2641.53345
[Training] W_grad_l2: 363.26627, U_grad_l2: 708.46198, b_grad_l2: 163.15649 
[Train] epoch: 4/50, step: 24/250, loss: 2223.49756
[Training] W_grad_l2: 1690.52734, U_grad_l2: 1648.62036, b_grad_l2: 853.01355 
[Train] epoch: 5/50, step: 25/250, loss: 2403.47217
[Training] W_grad_l2: 393.19580, U_grad_l2: 1518.15906, b_grad_l2: 381.73193 
[Train] epoch: 5/50, step: 26/250, loss: 1602.93469
[Training] W_grad_l2: 402.13727, U_grad_l2: 983.40454, b_grad_l2: 205.58714 
[Train] epoch: 5/50, step: 27/250, loss: 1560.14380
[Training] W_grad_l2: 469.22964, U_grad_l2: 549.50610, b_grad_l2: 220.65347 
[Train] epoch: 5/50, step: 28/250, loss: 1169.99744
[Training] W_grad_l2: 348.95605, U_grad_l2: 1318.54285, b_grad_l2: 229.25320 
[Train] epoch: 5/50, step: 29/250, loss: 1711.58557
[Training] W_grad_l2: 60.36230, U_grad_l2: 458.70596, b_grad_l2: 33.40940 
[Train] epoch: 6/50, step: 30/250, loss: 2371.88477
[Training] W_grad_l2: 158.26450, U_grad_l2: 703.25842, b_grad_l2: 108.12794 
[Train] epoch: 6/50, step: 31/250, loss: 1519.99841
[Training] W_grad_l2: 499.43515, U_grad_l2: 1407.49341, b_grad_l2: 202.56265 
[Train] epoch: 6/50, step: 32/250, loss: 2021.68335
[Training] W_grad_l2: 823.89661, U_grad_l2: 3710.75146, b_grad_l2: 667.87573 
[Train] epoch: 6/50, step: 33/250, loss: 879.29437
[Training] W_grad_l2: 641.19958, U_grad_l2: 1647.09021, b_grad_l2: 452.08405 
[Train] epoch: 6/50, step: 34/250, loss: 1207.27551
[Training] W_grad_l2: 499.42621, U_grad_l2: 3324.51978, b_grad_l2: 290.58469 
[Train] epoch: 7/50, step: 35/250, loss: 614.93738
[Training] W_grad_l2: 86.36131, U_grad_l2: 638.95251, b_grad_l2: 66.54583 
[Train] epoch: 7/50, step: 36/250, loss: 1373.09424
[Training] W_grad_l2: 711.57129, U_grad_l2: 3837.32178, b_grad_l2: 476.38354 
[Train] epoch: 7/50, step: 37/250, loss: 610.71814
[Training] W_grad_l2: 1087.88550, U_grad_l2: 1673.08826, b_grad_l2: 601.19678 
[Train] epoch: 7/50, step: 38/250, loss: 475.19589
[Training] W_grad_l2: 257.90613, U_grad_l2: 843.23022, b_grad_l2: 102.96564 
[Train] epoch: 7/50, step: 39/250, loss: 710.82050
[Training] W_grad_l2: 140.07088, U_grad_l2: 644.95160, b_grad_l2: 65.04790 
[Train] epoch: 8/50, step: 40/250, loss: 976.08154
[Training] W_grad_l2: 89.42922, U_grad_l2: 445.68610, b_grad_l2: 83.51238 
[Train] epoch: 8/50, step: 41/250, loss: 707.80542
[Training] W_grad_l2: 616.51648, U_grad_l2: 953.24042, b_grad_l2: 408.97501 
[Train] epoch: 8/50, step: 42/250, loss: 405.03241
[Training] W_grad_l2: 421.07596, U_grad_l2: 687.18024, b_grad_l2: 276.64359 
[Train] epoch: 8/50, step: 43/250, loss: 696.20312
[Training] W_grad_l2: 250.99290, U_grad_l2: 473.85352, b_grad_l2: 122.85160 
[Train] epoch: 8/50, step: 44/250, loss: 448.20691
[Training] W_grad_l2: 203.75464, U_grad_l2: 264.47464, b_grad_l2: 98.90932 
[Train] epoch: 9/50, step: 45/250, loss: 908.73187
[Training] W_grad_l2: 2800.12500, U_grad_l2: 4175.88232, b_grad_l2: 1164.45715 
[Train] epoch: 9/50, step: 46/250, loss: 598.41876
[Training] W_grad_l2: 872.08673, U_grad_l2: 1742.76514, b_grad_l2: 362.93369 
[Train] epoch: 9/50, step: 47/250, loss: 627.91107
[Training] W_grad_l2: 521.35468, U_grad_l2: 674.21942, b_grad_l2: 145.03313 
[Train] epoch: 9/50, step: 48/250, loss: 346.36353
[Training] W_grad_l2: 617.08368, U_grad_l2: 929.50867, b_grad_l2: 398.16833 
[Train] epoch: 9/50, step: 49/250, loss: 245.62643
[Training] W_grad_l2: 919.39624, U_grad_l2: 1712.06567, b_grad_l2: 273.45828 
[Train] epoch: 10/50, step: 50/250, loss: 536.15771
[Training] W_grad_l2: 3773.06665, U_grad_l2: 5999.01660, b_grad_l2: 2233.00854 
[Train] epoch: 10/50, step: 51/250, loss: 376.56516
[Training] W_grad_l2: 659.40826, U_grad_l2: 1951.77600, b_grad_l2: 237.97621 
[Train] epoch: 10/50, step: 52/250, loss: 414.12241
[Training] W_grad_l2: 1883.84448, U_grad_l2: 1496.09033, b_grad_l2: 678.35693 
[Train] epoch: 10/50, step: 53/250, loss: 351.15121
[Training] W_grad_l2: 780.89209, U_grad_l2: 1047.18994, b_grad_l2: 294.78894 
[Train] epoch: 10/50, step: 54/250, loss: 333.44766
[Training] W_grad_l2: 918.23212, U_grad_l2: 776.05096, b_grad_l2: 295.90305 
[Train] epoch: 11/50, step: 55/250, loss: 675.48627
[Training] W_grad_l2: 1771.51550, U_grad_l2: 2145.17773, b_grad_l2: 583.11560 
[Train] epoch: 11/50, step: 56/250, loss: 431.52206
[Training] W_grad_l2: 3371.81763, U_grad_l2: 3635.41089, b_grad_l2: 1113.23047 
[Train] epoch: 11/50, step: 57/250, loss: 318.71204
[Training] W_grad_l2: 1156.06030, U_grad_l2: 1768.55078, b_grad_l2: 508.63959 
[Train] epoch: 11/50, step: 58/250, loss: 346.00888
[Training] W_grad_l2: 1412.77905, U_grad_l2: 1698.72009, b_grad_l2: 362.82333 
[Train] epoch: 11/50, step: 59/250, loss: 296.12466
[Training] W_grad_l2: 852.94464, U_grad_l2: 729.31793, b_grad_l2: 270.99191 
[Train] epoch: 12/50, step: 60/250, loss: 415.69067
[Training] W_grad_l2: 2748.24658, U_grad_l2: 1324.11304, b_grad_l2: 1161.00952 
[Train] epoch: 12/50, step: 61/250, loss: 414.08554
[Training] W_grad_l2: 695.03668, U_grad_l2: 894.41168, b_grad_l2: 163.86638 
[Train] epoch: 12/50, step: 62/250, loss: 1325.71655
[Training] W_grad_l2: 602.97925, U_grad_l2: 1468.90588, b_grad_l2: 220.32057 
[Train] epoch: 12/50, step: 63/250, loss: 535.86975
[Training] W_grad_l2: 483.68716, U_grad_l2: 663.78918, b_grad_l2: 223.05298 
[Train] epoch: 12/50, step: 64/250, loss: 543.88580
[Training] W_grad_l2: 718.34784, U_grad_l2: 1272.91931, b_grad_l2: 246.71823 
[Train] epoch: 13/50, step: 65/250, loss: 812.57806
[Training] W_grad_l2: 1436.21777, U_grad_l2: 3087.88623, b_grad_l2: 480.53537 
[Train] epoch: 13/50, step: 66/250, loss: 650.51019
[Training] W_grad_l2: 691.19507, U_grad_l2: 1367.31055, b_grad_l2: 154.02106 
[Train] epoch: 13/50, step: 67/250, loss: 592.34412
[Training] W_grad_l2: 3211.27246, U_grad_l2: 6517.72607, b_grad_l2: 1146.67908 
[Train] epoch: 13/50, step: 68/250, loss: 761.48773
[Training] W_grad_l2: 488.15079, U_grad_l2: 1189.95410, b_grad_l2: 158.08804 
[Train] epoch: 13/50, step: 69/250, loss: 454.55777
[Training] W_grad_l2: 418.05728, U_grad_l2: 764.91974, b_grad_l2: 167.55203 
[Train] epoch: 14/50, step: 70/250, loss: 1706.06018
[Training] W_grad_l2: 993.02869, U_grad_l2: 1746.46912, b_grad_l2: 270.02426 
[Train] epoch: 14/50, step: 71/250, loss: 859.81451
[Training] W_grad_l2: 407.76157, U_grad_l2: 867.41522, b_grad_l2: 90.93882 
[Train] epoch: 14/50, step: 72/250, loss: 1018.08649
[Training] W_grad_l2: 961.73138, U_grad_l2: 1733.37085, b_grad_l2: 241.23929 
[Train] epoch: 14/50, step: 73/250, loss: 647.45349
[Training] W_grad_l2: 608.02423, U_grad_l2: 1054.87573, b_grad_l2: 205.44684 
[Train] epoch: 14/50, step: 74/250, loss: 345.71646
[Training] W_grad_l2: 348.14307, U_grad_l2: 695.56726, b_grad_l2: 89.60760 
[Train] epoch: 15/50, step: 75/250, loss: 786.05859
[Training] W_grad_l2: 950.33423, U_grad_l2: 2084.12964, b_grad_l2: 292.91693 
[Train] epoch: 15/50, step: 76/250, loss: 538.88605
[Training] W_grad_l2: 567.02869, U_grad_l2: 1689.03369, b_grad_l2: 176.20786 
[Train] epoch: 15/50, step: 77/250, loss: 593.69232
[Training] W_grad_l2: 1124.69434, U_grad_l2: 2476.19507, b_grad_l2: 234.76392 
[Train] epoch: 15/50, step: 78/250, loss: 639.73016
[Training] W_grad_l2: 448.68301, U_grad_l2: 1180.24219, b_grad_l2: 142.97983 
[Train] epoch: 15/50, step: 79/250, loss: 717.81311
[Training] W_grad_l2: 517.89886, U_grad_l2: 2771.45605, b_grad_l2: 205.49802 
[Train] epoch: 16/50, step: 80/250, loss: 1340.44897
[Training] W_grad_l2: 795.83032, U_grad_l2: 1147.58191, b_grad_l2: 395.39407 
[Train] epoch: 16/50, step: 81/250, loss: 1175.86328
[Training] W_grad_l2: 278.63785, U_grad_l2: 774.44995, b_grad_l2: 81.98058 
[Train] epoch: 16/50, step: 82/250, loss: 668.14661
[Training] W_grad_l2: 528.39996, U_grad_l2: 853.20856, b_grad_l2: 140.80666 
[Train] epoch: 16/50, step: 83/250, loss: 472.14325
[Training] W_grad_l2: 280.58066, U_grad_l2: 744.58728, b_grad_l2: 111.85708 
[Train] epoch: 16/50, step: 84/250, loss: 336.99432
[Training] W_grad_l2: 734.75586, U_grad_l2: 1587.87866, b_grad_l2: 232.71425 
[Train] epoch: 17/50, step: 85/250, loss: 843.78772
[Training] W_grad_l2: 542.88324, U_grad_l2: 876.07300, b_grad_l2: 273.15707 
[Train] epoch: 17/50, step: 86/250, loss: 811.20294
[Training] W_grad_l2: 705.07574, U_grad_l2: 925.60706, b_grad_l2: 212.59477 
[Train] epoch: 17/50, step: 87/250, loss: 381.89777
[Training] W_grad_l2: 639.86310, U_grad_l2: 802.36194, b_grad_l2: 168.74593 
[Train] epoch: 17/50, step: 88/250, loss: 346.09991
[Training] W_grad_l2: 952.06445, U_grad_l2: 1650.25354, b_grad_l2: 274.87582 
[Train] epoch: 17/50, step: 89/250, loss: 270.40936
[Training] W_grad_l2: 104.20509, U_grad_l2: 243.22391, b_grad_l2: 29.36130 
[Train] epoch: 18/50, step: 90/250, loss: 829.85870
[Training] W_grad_l2: 391.45886, U_grad_l2: 650.40240, b_grad_l2: 158.54816 
[Train] epoch: 18/50, step: 91/250, loss: 415.56027
[Training] W_grad_l2: 2146.91772, U_grad_l2: 3393.22778, b_grad_l2: 584.25195 
[Train] epoch: 18/50, step: 92/250, loss: 398.49045
[Training] W_grad_l2: 318.47272, U_grad_l2: 615.44531, b_grad_l2: 86.96262 
[Train] epoch: 18/50, step: 93/250, loss: 374.29639
[Training] W_grad_l2: 261.09482, U_grad_l2: 506.96725, b_grad_l2: 72.01350 
[Train] epoch: 18/50, step: 94/250, loss: 313.00717
[Training] W_grad_l2: 167.90930, U_grad_l2: 357.50668, b_grad_l2: 50.22629 
[Train] epoch: 19/50, step: 95/250, loss: 572.00531
[Training] W_grad_l2: 405.72546, U_grad_l2: 1206.76904, b_grad_l2: 95.40356 
[Train] epoch: 19/50, step: 96/250, loss: 515.00189
[Training] W_grad_l2: 319.63815, U_grad_l2: 650.64532, b_grad_l2: 79.77785 
[Train] epoch: 19/50, step: 97/250, loss: 621.61786
[Training] W_grad_l2: 457.69116, U_grad_l2: 666.76062, b_grad_l2: 106.88637 
[Train] epoch: 19/50, step: 98/250, loss: 733.97528
[Training] W_grad_l2: 394.18787, U_grad_l2: 1342.13770, b_grad_l2: 105.66525 
[Train] epoch: 19/50, step: 99/250, loss: 403.54840
[Training] W_grad_l2: 691.11353, U_grad_l2: 1204.78235, b_grad_l2: 141.16905 
[Train] epoch: 20/50, step: 100/250, loss: 1247.97180
[Training] W_grad_l2: 426.81659, U_grad_l2: 1105.08289, b_grad_l2: 105.00610 
C:\Users\LENOVO\PycharmProjects\pythonProject\深度学习\nndl.py:386: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  batch_correct = torch.sum(torch.tensor(preds == labels, dtype=torch.float32)).cpu().numpy()
[Evaluate]  dev score: 0.04000, dev loss: 505.39297
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.04000
[Train] epoch: 20/50, step: 101/250, loss: 547.77661
[Training] W_grad_l2: 346.18964, U_grad_l2: 524.33411, b_grad_l2: 98.35358 
[Train] epoch: 20/50, step: 102/250, loss: 846.12012
[Training] W_grad_l2: 194.31854, U_grad_l2: 273.80405, b_grad_l2: 47.88852 
[Train] epoch: 20/50, step: 103/250, loss: 892.87476
[Training] W_grad_l2: 475.20035, U_grad_l2: 1713.44568, b_grad_l2: 136.97630 
[Train] epoch: 20/50, step: 104/250, loss: 499.97031
[Training] W_grad_l2: 147.51025, U_grad_l2: 603.05157, b_grad_l2: 30.84999 
[Train] epoch: 21/50, step: 105/250, loss: 1552.86719
[Training] W_grad_l2: 235.35168, U_grad_l2: 791.64471, b_grad_l2: 65.59473 
[Train] epoch: 21/50, step: 106/250, loss: 963.29425
[Training] W_grad_l2: 259.70071, U_grad_l2: 916.89532, b_grad_l2: 69.03682 
[Train] epoch: 21/50, step: 107/250, loss: 1444.23035
[Training] W_grad_l2: 464.69995, U_grad_l2: 981.96106, b_grad_l2: 124.79527 
[Train] epoch: 21/50, step: 108/250, loss: 1530.28235
[Training] W_grad_l2: 429.58752, U_grad_l2: 980.34442, b_grad_l2: 118.97498 
[Train] epoch: 21/50, step: 109/250, loss: 1072.10242
[Training] W_grad_l2: 440.50739, U_grad_l2: 723.48706, b_grad_l2: 94.10834 
[Train] epoch: 22/50, step: 110/250, loss: 1689.78406
[Training] W_grad_l2: 166.98726, U_grad_l2: 518.81372, b_grad_l2: 34.64920 
[Train] epoch: 22/50, step: 111/250, loss: 1265.35181
[Training] W_grad_l2: 174.41350, U_grad_l2: 647.02039, b_grad_l2: 58.43666 
[Train] epoch: 22/50, step: 112/250, loss: 2153.06177
[Training] W_grad_l2: 324.42441, U_grad_l2: 638.97424, b_grad_l2: 97.14082 
[Train] epoch: 22/50, step: 113/250, loss: 2343.43359
[Training] W_grad_l2: 628.38593, U_grad_l2: 1030.96704, b_grad_l2: 137.09509 
[Train] epoch: 22/50, step: 114/250, loss: 1263.78577
[Training] W_grad_l2: 143.52322, U_grad_l2: 584.39697, b_grad_l2: 53.65800 
[Train] epoch: 23/50, step: 115/250, loss: 1230.48853
[Training] W_grad_l2: 648.93396, U_grad_l2: 1310.21033, b_grad_l2: 231.64182 
[Train] epoch: 23/50, step: 116/250, loss: 878.90552
[Training] W_grad_l2: 246.44838, U_grad_l2: 1001.15649, b_grad_l2: 120.50476 
[Train] epoch: 23/50, step: 117/250, loss: 809.23737
[Training] W_grad_l2: 219.87837, U_grad_l2: 476.15018, b_grad_l2: 64.08628 
[Train] epoch: 23/50, step: 118/250, loss: 1032.58118
[Training] W_grad_l2: 127.48110, U_grad_l2: 561.66168, b_grad_l2: 36.60998 
[Train] epoch: 23/50, step: 119/250, loss: 661.10901
[Training] W_grad_l2: 330.75183, U_grad_l2: 1217.51331, b_grad_l2: 97.91036 
[Train] epoch: 24/50, step: 120/250, loss: 863.09961
[Training] W_grad_l2: 369.42505, U_grad_l2: 713.46533, b_grad_l2: 82.28414 
[Train] epoch: 24/50, step: 121/250, loss: 1021.90747
[Training] W_grad_l2: 105.01623, U_grad_l2: 425.32050, b_grad_l2: 30.30401 
[Train] epoch: 24/50, step: 122/250, loss: 1239.09155
[Training] W_grad_l2: 445.28854, U_grad_l2: 773.64716, b_grad_l2: 107.62221 
[Train] epoch: 24/50, step: 123/250, loss: 1476.88464
[Training] W_grad_l2: 131.93954, U_grad_l2: 788.60736, b_grad_l2: 41.21605 
[Train] epoch: 24/50, step: 124/250, loss: 1006.27771
[Training] W_grad_l2: 171.77055, U_grad_l2: 743.10846, b_grad_l2: 54.98726 
[Train] epoch: 25/50, step: 125/250, loss: 701.93921
[Training] W_grad_l2: 966.24603, U_grad_l2: 2393.54321, b_grad_l2: 269.51151 
[Train] epoch: 25/50, step: 126/250, loss: 819.64240
[Training] W_grad_l2: 348.84335, U_grad_l2: 1067.88159, b_grad_l2: 140.21825 
[Train] epoch: 25/50, step: 127/250, loss: 470.05237
[Training] W_grad_l2: 64.67394, U_grad_l2: 214.36086, b_grad_l2: 24.85582 
[Train] epoch: 25/50, step: 128/250, loss: 1575.68433
[Training] W_grad_l2: 501.93814, U_grad_l2: 877.82660, b_grad_l2: 157.69894 
[Train] epoch: 25/50, step: 129/250, loss: 436.81424
[Training] W_grad_l2: 165.49118, U_grad_l2: 358.01071, b_grad_l2: 40.25830 
[Train] epoch: 26/50, step: 130/250, loss: 605.89368
[Training] W_grad_l2: 642.09161, U_grad_l2: 486.07925, b_grad_l2: 270.85852 
[Train] epoch: 26/50, step: 131/250, loss: 591.27081
[Training] W_grad_l2: 282.02335, U_grad_l2: 632.03088, b_grad_l2: 91.29311 
[Train] epoch: 26/50, step: 132/250, loss: 674.90289
[Training] W_grad_l2: 181.08003, U_grad_l2: 571.05566, b_grad_l2: 60.59184 
[Train] epoch: 26/50, step: 133/250, loss: 1001.81665
[Training] W_grad_l2: 1082.37488, U_grad_l2: 1251.69958, b_grad_l2: 193.65773 
[Train] epoch: 26/50, step: 134/250, loss: 442.69775
[Training] W_grad_l2: 468.00653, U_grad_l2: 861.28186, b_grad_l2: 88.10838 
[Train] epoch: 27/50, step: 135/250, loss: 616.21313
[Training] W_grad_l2: 330.53735, U_grad_l2: 375.41025, b_grad_l2: 138.19745 
[Train] epoch: 27/50, step: 136/250, loss: 1975.60278
[Training] W_grad_l2: 1340.67053, U_grad_l2: 2336.40308, b_grad_l2: 419.22269 
[Train] epoch: 27/50, step: 137/250, loss: 821.69598
[Training] W_grad_l2: 759.10223, U_grad_l2: 1922.67017, b_grad_l2: 199.33495 
[Train] epoch: 27/50, step: 138/250, loss: 461.02533
[Training] W_grad_l2: 95.15731, U_grad_l2: 224.83163, b_grad_l2: 26.42562 
[Train] epoch: 27/50, step: 139/250, loss: 867.87708
[Training] W_grad_l2: 98.64647, U_grad_l2: 328.50238, b_grad_l2: 32.05753 
[Train] epoch: 28/50, step: 140/250, loss: 748.53339
[Training] W_grad_l2: 590.19550, U_grad_l2: 795.29938, b_grad_l2: 233.56200 
[Train] epoch: 28/50, step: 141/250, loss: 630.06158
[Training] W_grad_l2: 1964.69250, U_grad_l2: 1345.07678, b_grad_l2: 654.41052 
[Train] epoch: 28/50, step: 142/250, loss: 773.83533
[Training] W_grad_l2: 164.93347, U_grad_l2: 481.55130, b_grad_l2: 45.46158 
[Train] epoch: 28/50, step: 143/250, loss: 1107.66272
[Training] W_grad_l2: 380.32953, U_grad_l2: 1234.77490, b_grad_l2: 95.56108 
[Train] epoch: 28/50, step: 144/250, loss: 442.72552
[Training] W_grad_l2: 90.87679, U_grad_l2: 387.05618, b_grad_l2: 23.27904 
[Train] epoch: 29/50, step: 145/250, loss: 765.16425
[Training] W_grad_l2: 366.85092, U_grad_l2: 1058.57983, b_grad_l2: 88.13979 
[Train] epoch: 29/50, step: 146/250, loss: 698.59540
[Training] W_grad_l2: 848.76727, U_grad_l2: 1530.36853, b_grad_l2: 163.03226 
[Train] epoch: 29/50, step: 147/250, loss: 372.57721
[Training] W_grad_l2: 1195.51794, U_grad_l2: 2699.18018, b_grad_l2: 327.45612 
[Train] epoch: 29/50, step: 148/250, loss: 373.48151
[Training] W_grad_l2: 283.80255, U_grad_l2: 621.65057, b_grad_l2: 111.53963 
[Train] epoch: 29/50, step: 149/250, loss: 284.23615
[Training] W_grad_l2: 264.47189, U_grad_l2: 579.17175, b_grad_l2: 60.17512 
[Train] epoch: 30/50, step: 150/250, loss: 806.09296
[Training] W_grad_l2: 126.99026, U_grad_l2: 626.58545, b_grad_l2: 35.33961 
[Train] epoch: 30/50, step: 151/250, loss: 761.93951
[Training] W_grad_l2: 152.30582, U_grad_l2: 740.19409, b_grad_l2: 58.29977 
[Train] epoch: 30/50, step: 152/250, loss: 778.65240
[Training] W_grad_l2: 379.97644, U_grad_l2: 1322.53809, b_grad_l2: 122.69746 
[Train] epoch: 30/50, step: 153/250, loss: 689.98779
[Training] W_grad_l2: 266.57611, U_grad_l2: 845.24890, b_grad_l2: 68.64983 
[Train] epoch: 30/50, step: 154/250, loss: 509.47617
[Training] W_grad_l2: 330.34406, U_grad_l2: 1021.60413, b_grad_l2: 83.96372 
[Train] epoch: 31/50, step: 155/250, loss: 553.55780
[Training] W_grad_l2: 456.48456, U_grad_l2: 1034.06543, b_grad_l2: 118.43233 
[Train] epoch: 31/50, step: 156/250, loss: 685.44446
[Training] W_grad_l2: 662.24908, U_grad_l2: 2324.23267, b_grad_l2: 203.40848 
[Train] epoch: 31/50, step: 157/250, loss: 476.89340
[Training] W_grad_l2: 148.15636, U_grad_l2: 459.74774, b_grad_l2: 24.91963 
[Train] epoch: 31/50, step: 158/250, loss: 1245.36243
[Training] W_grad_l2: 115.48822, U_grad_l2: 639.09906, b_grad_l2: 39.33096 
[Train] epoch: 31/50, step: 159/250, loss: 502.68091
[Training] W_grad_l2: 217.51610, U_grad_l2: 839.70593, b_grad_l2: 71.53602 
[Train] epoch: 32/50, step: 160/250, loss: 725.00269
[Training] W_grad_l2: 237.31125, U_grad_l2: 1241.15393, b_grad_l2: 61.71370 
[Train] epoch: 32/50, step: 161/250, loss: 822.74371
[Training] W_grad_l2: 349.25336, U_grad_l2: 907.40912, b_grad_l2: 83.73379 
[Train] epoch: 32/50, step: 162/250, loss: 939.57654
[Training] W_grad_l2: 184.70580, U_grad_l2: 744.31842, b_grad_l2: 42.08091 
[Train] epoch: 32/50, step: 163/250, loss: 488.65924
[Training] W_grad_l2: 146.95937, U_grad_l2: 412.70590, b_grad_l2: 32.29716 
[Train] epoch: 32/50, step: 164/250, loss: 465.53482
[Training] W_grad_l2: 83.39732, U_grad_l2: 408.37787, b_grad_l2: 22.78846 
[Train] epoch: 33/50, step: 165/250, loss: 807.61993
[Training] W_grad_l2: 346.29550, U_grad_l2: 861.38977, b_grad_l2: 114.37160 
[Train] epoch: 33/50, step: 166/250, loss: 1059.97424
[Training] W_grad_l2: 1138.62842, U_grad_l2: 3208.15112, b_grad_l2: 373.70050 
[Train] epoch: 33/50, step: 167/250, loss: 969.20715
[Training] W_grad_l2: 250.47102, U_grad_l2: 1088.16211, b_grad_l2: 85.36966 
[Train] epoch: 33/50, step: 168/250, loss: 940.06244
[Training] W_grad_l2: 148.84111, U_grad_l2: 491.93643, b_grad_l2: 43.16822 
[Train] epoch: 33/50, step: 169/250, loss: 542.85730
[Training] W_grad_l2: 136.07613, U_grad_l2: 347.37338, b_grad_l2: 29.35869 
[Train] epoch: 34/50, step: 170/250, loss: 1398.05359
[Training] W_grad_l2: 133.24699, U_grad_l2: 559.04364, b_grad_l2: 50.68082 
[Train] epoch: 34/50, step: 171/250, loss: 2362.44971
[Training] W_grad_l2: 593.96606, U_grad_l2: 2023.77905, b_grad_l2: 167.92070 
[Train] epoch: 34/50, step: 172/250, loss: 1075.27795
[Training] W_grad_l2: 167.86284, U_grad_l2: 864.18341, b_grad_l2: 81.40692 
[Train] epoch: 34/50, step: 173/250, loss: 1072.25098
[Training] W_grad_l2: 222.68735, U_grad_l2: 761.77722, b_grad_l2: 97.56877 
[Train] epoch: 34/50, step: 174/250, loss: 443.74640
[Training] W_grad_l2: 156.39125, U_grad_l2: 437.65521, b_grad_l2: 46.63094 
[Train] epoch: 35/50, step: 175/250, loss: 934.00366
[Training] W_grad_l2: 426.34015, U_grad_l2: 508.70105, b_grad_l2: 154.82057 
[Train] epoch: 35/50, step: 176/250, loss: 767.20544
[Training] W_grad_l2: 198.94003, U_grad_l2: 880.29956, b_grad_l2: 77.95716 
[Train] epoch: 35/50, step: 177/250, loss: 514.86639
[Training] W_grad_l2: 83.29305, U_grad_l2: 577.56824, b_grad_l2: 34.63567 
[Train] epoch: 35/50, step: 178/250, loss: 588.42908
[Training] W_grad_l2: 72.95541, U_grad_l2: 294.20673, b_grad_l2: 28.37526 
[Train] epoch: 35/50, step: 179/250, loss: 504.21854
[Training] W_grad_l2: 83.01884, U_grad_l2: 501.44318, b_grad_l2: 28.76788 
[Train] epoch: 36/50, step: 180/250, loss: 859.14618
[Training] W_grad_l2: 902.94116, U_grad_l2: 2357.78979, b_grad_l2: 355.04181 
[Train] epoch: 36/50, step: 181/250, loss: 1063.86267
[Training] W_grad_l2: 377.24210, U_grad_l2: 1297.74854, b_grad_l2: 100.71416 
[Train] epoch: 36/50, step: 182/250, loss: 1050.39087
[Training] W_grad_l2: 620.68335, U_grad_l2: 1751.09302, b_grad_l2: 114.54192 
[Train] epoch: 36/50, step: 183/250, loss: 917.96765
[Training] W_grad_l2: 1352.82581, U_grad_l2: 2230.63892, b_grad_l2: 307.94363 
[Train] epoch: 36/50, step: 184/250, loss: 331.55630
[Training] W_grad_l2: 3382.65674, U_grad_l2: 9267.42969, b_grad_l2: 1036.25208 
[Train] epoch: 37/50, step: 185/250, loss: 1486.62549
[Training] W_grad_l2: 12589.94922, U_grad_l2: 15315.43750, b_grad_l2: 3167.75195 
[Train] epoch: 37/50, step: 186/250, loss: 493.49661
[Training] W_grad_l2: 725.57233, U_grad_l2: 1581.73633, b_grad_l2: 182.29102 
[Train] epoch: 37/50, step: 187/250, loss: 473.93570
[Training] W_grad_l2: 643.09875, U_grad_l2: 1847.65881, b_grad_l2: 143.04935 
[Train] epoch: 37/50, step: 188/250, loss: 397.99960
[Training] W_grad_l2: 1004.22821, U_grad_l2: 1696.69275, b_grad_l2: 253.77754 
[Train] epoch: 37/50, step: 189/250, loss: 255.14909
[Training] W_grad_l2: 245.78725, U_grad_l2: 301.85455, b_grad_l2: 54.35164 
[Train] epoch: 38/50, step: 190/250, loss: 664.67700
[Training] W_grad_l2: 551.76349, U_grad_l2: 692.06317, b_grad_l2: 119.68429 
[Train] epoch: 38/50, step: 191/250, loss: 483.79172
[Training] W_grad_l2: 264.22537, U_grad_l2: 451.47888, b_grad_l2: 44.99414 
[Train] epoch: 38/50, step: 192/250, loss: 453.09241
[Training] W_grad_l2: 1599.77649, U_grad_l2: 4971.15039, b_grad_l2: 369.82593 
[Train] epoch: 38/50, step: 193/250, loss: 558.12817
[Training] W_grad_l2: 1138.55811, U_grad_l2: 1082.24731, b_grad_l2: 201.93129 
[Train] epoch: 38/50, step: 194/250, loss: 474.56961
[Training] W_grad_l2: 1029.87952, U_grad_l2: 2126.09912, b_grad_l2: 255.60767 
[Train] epoch: 39/50, step: 195/250, loss: 692.18201
[Training] W_grad_l2: 3407.49268, U_grad_l2: 3713.59131, b_grad_l2: 682.90039 
[Train] epoch: 39/50, step: 196/250, loss: 456.75909
[Training] W_grad_l2: 1061.97009, U_grad_l2: 958.72400, b_grad_l2: 245.48677 
[Train] epoch: 39/50, step: 197/250, loss: 356.80463
[Training] W_grad_l2: 277.93903, U_grad_l2: 625.47693, b_grad_l2: 58.49078 
[Train] epoch: 39/50, step: 198/250, loss: 655.48083
[Training] W_grad_l2: 145.61578, U_grad_l2: 454.72742, b_grad_l2: 31.42563 
[Train] epoch: 39/50, step: 199/250, loss: 1098.42236
[Training] W_grad_l2: 434.90698, U_grad_l2: 1626.81458, b_grad_l2: 102.41108 
[Train] epoch: 40/50, step: 200/250, loss: 517.69287
[Training] W_grad_l2: 674.11572, U_grad_l2: 922.70142, b_grad_l2: 163.90308 
[Evaluate]  dev score: 0.08000, dev loss: 664.59508
[Evaluate] best accuracy performence has been updated: 0.04000 --> 0.08000
[Train] epoch: 40/50, step: 201/250, loss: 916.76434
[Training] W_grad_l2: 500.09564, U_grad_l2: 1066.36438, b_grad_l2: 114.66583 
[Train] epoch: 40/50, step: 202/250, loss: 490.95898
[Training] W_grad_l2: 357.53867, U_grad_l2: 393.17645, b_grad_l2: 70.17840 
[Train] epoch: 40/50, step: 203/250, loss: 776.86938
[Training] W_grad_l2: 576.39240, U_grad_l2: 1664.72058, b_grad_l2: 122.90000 
[Train] epoch: 40/50, step: 204/250, loss: 494.17770
[Training] W_grad_l2: 658.90411, U_grad_l2: 1069.35730, b_grad_l2: 151.98103 
[Train] epoch: 41/50, step: 205/250, loss: 529.20032
[Training] W_grad_l2: 1539.95264, U_grad_l2: 1207.57300, b_grad_l2: 335.64594 
[Train] epoch: 41/50, step: 206/250, loss: 447.66733
[Training] W_grad_l2: 355.14087, U_grad_l2: 961.92853, b_grad_l2: 117.31064 
[Train] epoch: 41/50, step: 207/250, loss: 582.02844
[Training] W_grad_l2: 2163.38721, U_grad_l2: 1346.54028, b_grad_l2: 458.60947 
[Train] epoch: 41/50, step: 208/250, loss: 292.58459
[Training] W_grad_l2: 607.89136, U_grad_l2: 1636.53723, b_grad_l2: 147.76817 
[Train] epoch: 41/50, step: 209/250, loss: 315.05695
[Training] W_grad_l2: 123.52274, U_grad_l2: 332.91861, b_grad_l2: 28.69036 
[Train] epoch: 42/50, step: 210/250, loss: 1086.58203
[Training] W_grad_l2: 1313.23987, U_grad_l2: 1879.39819, b_grad_l2: 344.24359 
[Train] epoch: 42/50, step: 211/250, loss: 810.97430
[Training] W_grad_l2: 283.69452, U_grad_l2: 900.05188, b_grad_l2: 91.30367 
[Train] epoch: 42/50, step: 212/250, loss: 602.35590
[Training] W_grad_l2: 1166.33875, U_grad_l2: 1850.14136, b_grad_l2: 304.88800 
[Train] epoch: 42/50, step: 213/250, loss: 457.67737
[Training] W_grad_l2: 322.68567, U_grad_l2: 928.56299, b_grad_l2: 72.49948 
[Train] epoch: 42/50, step: 214/250, loss: 411.37503
[Training] W_grad_l2: 207.76605, U_grad_l2: 486.77728, b_grad_l2: 56.82355 
[Train] epoch: 43/50, step: 215/250, loss: 1502.86072
[Training] W_grad_l2: 902.42999, U_grad_l2: 966.63068, b_grad_l2: 290.39093 
[Train] epoch: 43/50, step: 216/250, loss: 963.19629
[Training] W_grad_l2: 363.04037, U_grad_l2: 532.01190, b_grad_l2: 98.25954 
[Train] epoch: 43/50, step: 217/250, loss: 728.39111
[Training] W_grad_l2: 794.58807, U_grad_l2: 2132.14551, b_grad_l2: 188.04114 
[Train] epoch: 43/50, step: 218/250, loss: 880.34436
[Training] W_grad_l2: 282.00415, U_grad_l2: 832.52838, b_grad_l2: 65.01365 
[Train] epoch: 43/50, step: 219/250, loss: 980.35223
[Training] W_grad_l2: 259.87369, U_grad_l2: 568.38208, b_grad_l2: 62.70670 
[Train] epoch: 44/50, step: 220/250, loss: 929.24286
[Training] W_grad_l2: 365.04095, U_grad_l2: 795.63385, b_grad_l2: 90.94580 
[Train] epoch: 44/50, step: 221/250, loss: 848.43701
[Training] W_grad_l2: 734.88483, U_grad_l2: 1718.30945, b_grad_l2: 274.49890 
[Train] epoch: 44/50, step: 222/250, loss: 699.81219
[Training] W_grad_l2: 775.85773, U_grad_l2: 822.70648, b_grad_l2: 132.96786 
[Train] epoch: 44/50, step: 223/250, loss: 648.74823
[Training] W_grad_l2: 327.53238, U_grad_l2: 1044.32629, b_grad_l2: 63.21377 
[Train] epoch: 44/50, step: 224/250, loss: 780.35846
[Training] W_grad_l2: 123.25651, U_grad_l2: 345.78784, b_grad_l2: 18.63264 
[Train] epoch: 45/50, step: 225/250, loss: 1588.99231
[Training] W_grad_l2: 2022.48999, U_grad_l2: 3378.66797, b_grad_l2: 475.10193 
[Train] epoch: 45/50, step: 226/250, loss: 989.68011
[Training] W_grad_l2: 994.48218, U_grad_l2: 1276.74829, b_grad_l2: 348.60364 
[Train] epoch: 45/50, step: 227/250, loss: 831.45886
[Training] W_grad_l2: 141.70856, U_grad_l2: 421.01620, b_grad_l2: 31.67917 
[Train] epoch: 45/50, step: 228/250, loss: 1178.81604
[Training] W_grad_l2: 341.65317, U_grad_l2: 1170.93591, b_grad_l2: 106.45114 
[Train] epoch: 45/50, step: 229/250, loss: 556.53815
[Training] W_grad_l2: 85.97270, U_grad_l2: 409.24551, b_grad_l2: 23.03140 
[Train] epoch: 46/50, step: 230/250, loss: 937.61121
[Training] W_grad_l2: 203.23364, U_grad_l2: 609.88489, b_grad_l2: 52.52168 
[Train] epoch: 46/50, step: 231/250, loss: 1178.28369
[Training] W_grad_l2: 835.79993, U_grad_l2: 1290.74634, b_grad_l2: 200.32069 
[Train] epoch: 46/50, step: 232/250, loss: 1007.69055
[Training] W_grad_l2: 1019.50134, U_grad_l2: 2834.08228, b_grad_l2: 262.34006 
[Train] epoch: 46/50, step: 233/250, loss: 1209.08569
[Training] W_grad_l2: 347.84598, U_grad_l2: 557.28381, b_grad_l2: 90.52836 
[Train] epoch: 46/50, step: 234/250, loss: 569.96545
[Training] W_grad_l2: 124.65165, U_grad_l2: 492.33954, b_grad_l2: 26.33843 
[Train] epoch: 47/50, step: 235/250, loss: 2601.68311
[Training] W_grad_l2: 636.66705, U_grad_l2: 3036.09888, b_grad_l2: 188.32919 
[Train] epoch: 47/50, step: 236/250, loss: 1102.72241
[Training] W_grad_l2: 455.17825, U_grad_l2: 804.82806, b_grad_l2: 103.20989 
[Train] epoch: 47/50, step: 237/250, loss: 997.77863
[Training] W_grad_l2: 313.78549, U_grad_l2: 1281.96167, b_grad_l2: 94.30825 
[Train] epoch: 47/50, step: 238/250, loss: 1197.67236
[Training] W_grad_l2: 330.73721, U_grad_l2: 535.15955, b_grad_l2: 63.83210 
[Train] epoch: 47/50, step: 239/250, loss: 1012.93890
[Training] W_grad_l2: 661.94958, U_grad_l2: 1089.16797, b_grad_l2: 138.40721 
[Train] epoch: 48/50, step: 240/250, loss: 634.36578
[Training] W_grad_l2: 206.10973, U_grad_l2: 632.88672, b_grad_l2: 53.38802 
[Train] epoch: 48/50, step: 241/250, loss: 859.29883
[Training] W_grad_l2: 2127.41016, U_grad_l2: 2900.84131, b_grad_l2: 459.68073 
[Train] epoch: 48/50, step: 242/250, loss: 592.71893
[Training] W_grad_l2: 270.27231, U_grad_l2: 1194.75854, b_grad_l2: 86.10850 
[Train] epoch: 48/50, step: 243/250, loss: 541.09607
[Training] W_grad_l2: 416.58301, U_grad_l2: 1402.15161, b_grad_l2: 143.72595 
[Train] epoch: 48/50, step: 244/250, loss: 365.77713
[Training] W_grad_l2: 62.12009, U_grad_l2: 249.83321, b_grad_l2: 20.34035 
[Train] epoch: 49/50, step: 245/250, loss: 902.62128
[Training] W_grad_l2: 170.96898, U_grad_l2: 750.60895, b_grad_l2: 50.27731 
[Train] epoch: 49/50, step: 246/250, loss: 1047.37573
[Training] W_grad_l2: 506.72690, U_grad_l2: 1303.75500, b_grad_l2: 136.10886 
[Train] epoch: 49/50, step: 247/250, loss: 576.65497
[Training] W_grad_l2: 4468.03076, U_grad_l2: 16250.94336, b_grad_l2: 1287.74719 
[Train] epoch: 49/50, step: 248/250, loss: 694.51215
[Training] W_grad_l2: 635.98553, U_grad_l2: 1135.26123, b_grad_l2: 128.67096 
[Train] epoch: 49/50, step: 249/250, loss: 394.72006
[Training] W_grad_l2: 806.89154, U_grad_l2: 3041.18774, b_grad_l2: 257.54990 
[Evaluate]  dev score: 0.02000, dev loss: 373.36994
[Train] Training done!

接下来,可以获取训练过程中关于WW,UU和bb参数梯度的L2范数,并将其绘制为图片以便展示,相应代码如下:

def plot_grad(W_list, U_list, b_list, save_path, keep_steps=40):
    # 开始绘制图片
    plt.figure()
    # 默认保留前40步的结果
    steps = list(range(keep_steps))
    plt.plot(steps, W_list[:keep_steps], "r-", color="#e4007f", label="W_grad_l2")
    plt.plot(steps, U_list[:keep_steps], "-.", color="#f19ec2", label="U_grad_l2")
    plt.plot(steps, b_list[:keep_steps], "--", color="#000000", label="b_grad_l2")

    plt.xlabel("step")
    plt.ylabel("L2 Norm")
    plt.legend(loc="upper right")
    plt.savefig(save_path)
    print("image has been saved to: ", save_path)


save_path = f"./images/6.8.pdf"
plot_grad(W_list, U_list, b_list, save_path)

 运行结果为:

上图展示了在训练过程中关于W,U和b参数梯度的L2范数,可以看到经过学习率等方式的调整,梯度范数急剧变大,而后梯度范数几乎为0. 这是因为Tanh为Sigmoid型函数,其饱和区的导数接近于0,由于梯度的急剧变化,参数数值变的较大或较小,容易落入梯度饱和区,导致梯度为0,模型很难继续训练. 

接下来,使用该模型在测试集上进行测试。

print(f"Evaluate SRN with data length {length}.")
# 加载训练过程中效果最好的模型
model_path = os.path.join(save_dir, f"srn_explosion_model_{length}.pdparams")
runner.load_model(model_path)

# 使用测试集评价模型,获取测试集上的预测准确率
score, _ = runner.evaluate(test_loader)
print(f"[SRN] length:{length}, Score: {score: .5f}")

运行结果为:

Evaluate SRN with data length 20.
[SRN] length:20, Score:  0.08000 

6.2.3 使用梯度截断解决梯度爆炸问题

梯度截断是一种可以有效解决梯度爆炸问题的启发式方法,当梯度的模大于一定阈值时,就将它截断成为一个较小的数。一般有两种截断方式:按值截断和按模截断.本实验使用按模截断的方式解决梯度爆炸问题。

按模截断是按照梯度向量g的模进行截断,保证梯度向量的模值不大于阈值b,裁剪后的梯度为:

                             

 

当梯度向量g的模不大于阈值b时,g数值不变,否则对g进行数值缩放。

在torch中,可以使用torch.nn.utils.clip_grad_norm进行按模截断. 在代码实现时,将ClipGradByNorm传入优化器,优化器在反向迭代过程中,每次梯度更新时默认可以对所有梯度裁剪。

我是看了下边的博客后才找到的。

梯度爆炸解决方案——梯度截断(gradient clip norm)_Mona-abc的博客-CSDN博客_梯度截断

在引入梯度截断之后,将重新观察模型的训练情况。这里我们重新实例化一下:模型和优化器,然后组装runner,进行训练。代码实现如下:

# 清空梯度列表
W_list.clear()
U_list.clear()
b_list.clear()
# 实例化模型
base_model = SRN(input_size, hidden_size)
model = Model_RNN4SeqClass(base_model, num_digits, input_size, hidden_size, num_classes)

# 定义clip,并实例化优化器

optimizer = torch.optim.SGD(lr=lr, params=model.parameters())
# 定义评价指标
metric = Accuracy()
# 定义损失函数
loss_fn = nn.CrossEntropyLoss(reduction="sum")

# 实例化Runner
runner = RunnerV3(model, optimizer, loss_fn, metric)

# 训练模型
model_save_path = os.path.join(save_dir, f"srn_fix_explosion_model_{length}.pdparams")
runner.train(train_loader, dev_loader, num_epochs=num_epochs, eval_steps=100, log_steps=1, save_path=model_save_path, custom_print_log=custom_print_

主要改的runnerv3函数

class RunnerV3(object):
    def __init__(self, model, optimizer, loss_fn, metric, **kwargs):
        self.model = model
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.metric = metric  # 只用于计算评价指标

        # 记录训练过程中的评价指标变化情况
        self.dev_scores = []

        # 记录训练过程中的损失函数变化情况
        self.train_epoch_losses = []  # 一个epoch记录一次loss
        self.train_step_losses = []  # 一个step记录一次loss
        self.dev_losses = []

        # 记录全局最优指标
        self.best_score = 0

    def train(self, train_loader, dev_loader=None, **kwargs):
        # 将模型切换为训练模式
        self.model.train()

        # 传入训练轮数,如果没有传入值则默认为0
        num_epochs = kwargs.get("num_epochs", 0)
        # 传入log打印频率,如果没有传入值则默认为100
        log_steps = kwargs.get("log_steps", 100)
        # 评价频率
        eval_steps = kwargs.get("eval_steps", 0)

        # 传入模型保存路径,如果没有传入值则默认为"best_model.pdparams"
        save_path = kwargs.get("save_path", "best_model.pdparams")

        custom_print_log = kwargs.get("custom_print_log", None)

        # 训练总的步数
        num_training_steps = num_epochs * len(train_loader)

        if eval_steps:
            if self.metric is None:
                raise RuntimeError('Error: Metric can not be None!')
            if dev_loader is None:
                raise RuntimeError('Error: dev_loader can not be None!')

        # 运行的step数目
        global_step = 0

        # 进行num_epochs轮训练
        for epoch in range(num_epochs):
            # 用于统计训练集的损失
            total_loss = 0
            for step, data in enumerate(train_loader):
                X, y = data
                # 获取模型预测
                logits = self.model(X)
                loss = self.loss_fn(logits, y.long())  # 默认求mean
                total_loss += loss

                # 训练过程中,每个step的loss进行保存
                self.train_step_losses.append((global_step, loss.item()))

                if log_steps and global_step % log_steps == 0:
                    print(
                        f"[Train] epoch: {epoch}/{num_epochs}, step: {global_step}/{num_training_steps}, loss: {loss.item():.5f}")

                # 梯度反向传播,计算每个参数的梯度值
                loss.backward()

                if custom_print_log:
                    custom_print_log(self)
                nn.utils.clip_grad_norm_(parameters=self.model.parameters(), max_norm=20, norm_type=2)
                # 小批量梯度下降进行参数更新
                self.optimizer.step()
                # 梯度归零
                self.optimizer.zero_grad()

                # 判断是否需要评价
                if eval_steps > 0 and global_step > 0 and \
                        (global_step % eval_steps == 0 or global_step == (num_training_steps - 1)):

                    dev_score, dev_loss = self.evaluate(dev_loader, global_step=global_step)
                    print(f"[Evaluate]  dev score: {dev_score:.5f}, dev loss: {dev_loss:.5f}")

                    # 将模型切换为训练模式
                    self.model.train()

                    # 如果当前指标为最优指标,保存该模型
                    if dev_score > self.best_score:
                        self.save_model(save_path)
                        print(
                            f"[Evaluate] best accuracy performence has been updated: {self.best_score:.5f} --> {dev_score:.5f}")
                        self.best_score = dev_score

                global_step += 1

            # 当前epoch 训练loss累计值
            trn_loss = (total_loss / len(train_loader)).item()
            # epoch粒度的训练loss保存
            self.train_epoch_losses.append(trn_loss)

        print("[Train] Training done!")

    # 模型评估阶段,使用'paddle.no_grad()'控制不计算和存储梯度
    @torch.no_grad()
    def evaluate(self, dev_loader, **kwargs):
        assert self.metric is not None

        # 将模型设置为评估模式
        self.model.eval()

        global_step = kwargs.get("global_step", -1)

        # 用于统计训练集的损失
        total_loss = 0

        # 重置评价
        self.metric.reset()

        # 遍历验证集每个批次
        for batch_id, data in enumerate(dev_loader):
            X, y = data

            # 计算模型输出
            logits = self.model(X)

            # 计算损失函数
            loss = self.loss_fn(logits, y.long()).item()
            # 累积损失
            total_loss += loss

            # 累积评价
            self.metric.update(logits, y)

        dev_loss = (total_loss / len(dev_loader))
        dev_score = self.metric.accumulate()

        # 记录验证集loss
        if global_step != -1:
            self.dev_losses.append((global_step, dev_loss))
            self.dev_scores.append(dev_score)

        return dev_score, dev_loss

    # 模型评估阶段,使用'paddle.no_grad()'控制不计算和存储梯度
    @torch.no_grad()
    def predict(self, x, **kwargs):
        # 将模型设置为评估模式
        self.model.eval()
        # 运行模型前向计算,得到预测值
        logits = self.model(x)
        return logits

    def save_model(self, save_path):
        torch.save(self.model.state_dict(), save_path)

    def load_model(self, model_path):
        state_dict = torch.load(model_path)
        self.model.load_state_dict(state_dict)

运行结果为:

[Train] epoch: 0/50, step: 0/250, loss: 189.36307
[Training] W_grad_l2: 188.90048, U_grad_l2: 0.00000, b_grad_l2: 164.87923 
[Train] epoch: 0/50, step: 1/250, loss: 1522.79529
[Training] W_grad_l2: 572.87689, U_grad_l2: 17477.95312, b_grad_l2: 528.49347 
[Train] epoch: 0/50, step: 2/250, loss: 980.12994
[Training] W_grad_l2: 347.11035, U_grad_l2: 12012.96875, b_grad_l2: 327.45187 
[Train] epoch: 0/50, step: 3/250, loss: 1272.40405
[Training] W_grad_l2: 84.58318, U_grad_l2: 3547.21411, b_grad_l2: 93.16024 
[Train] epoch: 0/50, step: 4/250, loss: 1490.01770
[Training] W_grad_l2: 251.71008, U_grad_l2: 9746.81055, b_grad_l2: 239.08662 
[Train] epoch: 1/50, step: 5/250, loss: 4520.10010
[Training] W_grad_l2: 1049.49316, U_grad_l2: 12682.69727, b_grad_l2: 1194.95288 
[Train] epoch: 1/50, step: 6/250, loss: 2660.87061
[Training] W_grad_l2: 252.72011, U_grad_l2: 3142.59619, b_grad_l2: 344.44214 
[Train] epoch: 1/50, step: 7/250, loss: 2928.97192
[Training] W_grad_l2: 284.66852, U_grad_l2: 1101.69592, b_grad_l2: 366.87042 
[Train] epoch: 1/50, step: 8/250, loss: 4419.72363
[Training] W_grad_l2: 321.09265, U_grad_l2: 3242.96558, b_grad_l2: 367.74899 
[Train] epoch: 1/50, step: 9/250, loss: 3858.53662
[Training] W_grad_l2: 1105.81604, U_grad_l2: 1614.71899, b_grad_l2: 1269.29199 
[Train] epoch: 2/50, step: 10/250, loss: 12266.65234
[Training] W_grad_l2: 1140.22498, U_grad_l2: 18303.07812, b_grad_l2: 1077.73987 
[Train] epoch: 2/50, step: 11/250, loss: 5516.11035
[Training] W_grad_l2: 390.40225, U_grad_l2: 9656.58301, b_grad_l2: 421.51816 
[Train] epoch: 2/50, step: 12/250, loss: 6241.68311
[Training] W_grad_l2: 442.70615, U_grad_l2: 2373.66479, b_grad_l2: 538.69794 
[Train] epoch: 2/50, step: 13/250, loss: 6784.60498
[Training] W_grad_l2: 829.30646, U_grad_l2: 3104.70166, b_grad_l2: 1086.21655 
[Train] epoch: 2/50, step: 14/250, loss: 4143.51123
[Training] W_grad_l2: 650.54749, U_grad_l2: 5596.41699, b_grad_l2: 613.97760 
[Train] epoch: 3/50, step: 15/250, loss: 7027.09521
[Training] W_grad_l2: 497.12949, U_grad_l2: 2973.22461, b_grad_l2: 569.08423 
[Train] epoch: 3/50, step: 16/250, loss: 1666.33496
[Training] W_grad_l2: 557.14868, U_grad_l2: 4522.83789, b_grad_l2: 617.33716 
[Train] epoch: 3/50, step: 17/250, loss: 1601.46021
[Training] W_grad_l2: 3048.60059, U_grad_l2: 13708.39258, b_grad_l2: 2391.40137 
[Train] epoch: 3/50, step: 18/250, loss: 992.77063
[Training] W_grad_l2: 624.31653, U_grad_l2: 973.57056, b_grad_l2: 218.27995 
[Train] epoch: 3/50, step: 19/250, loss: 944.33929
[Training] W_grad_l2: 896.68103, U_grad_l2: 771.59845, b_grad_l2: 456.20129 
[Train] epoch: 4/50, step: 20/250, loss: 837.69733
[Training] W_grad_l2: 275.13843, U_grad_l2: 1099.70874, b_grad_l2: 267.34207 
[Train] epoch: 4/50, step: 21/250, loss: 2979.97388
[Training] W_grad_l2: 515.46661, U_grad_l2: 1263.74084, b_grad_l2: 550.19543 
[Train] epoch: 4/50, step: 22/250, loss: 2346.96582
[Training] W_grad_l2: 510.74136, U_grad_l2: 1095.97375, b_grad_l2: 437.59647 
[Train] epoch: 4/50, step: 23/250, loss: 1223.13525
[Training] W_grad_l2: 1352.86731, U_grad_l2: 1809.89050, b_grad_l2: 772.61749 
[Train] epoch: 4/50, step: 24/250, loss: 892.03735
[Training] W_grad_l2: 442.12167, U_grad_l2: 1229.37170, b_grad_l2: 179.97942 
[Train] epoch: 5/50, step: 25/250, loss: 1261.60486
[Training] W_grad_l2: 766.11554, U_grad_l2: 1344.75708, b_grad_l2: 441.69427 
[Train] epoch: 5/50, step: 26/250, loss: 859.64691
[Training] W_grad_l2: 1248.27563, U_grad_l2: 849.87402, b_grad_l2: 776.39404 
[Train] epoch: 5/50, step: 27/250, loss: 1038.66919
[Training] W_grad_l2: 459.13583, U_grad_l2: 992.60913, b_grad_l2: 179.29964 
[Train] epoch: 5/50, step: 28/250, loss: 1343.06018
[Training] W_grad_l2: 694.43402, U_grad_l2: 1751.71545, b_grad_l2: 381.59579 
[Train] epoch: 5/50, step: 29/250, loss: 431.86508
[Training] W_grad_l2: 471.34811, U_grad_l2: 835.97241, b_grad_l2: 133.21748 
[Train] epoch: 6/50, step: 30/250, loss: 642.76025
[Training] W_grad_l2: 357.87015, U_grad_l2: 848.70959, b_grad_l2: 155.08069 
[Train] epoch: 6/50, step: 31/250, loss: 763.72748
[Training] W_grad_l2: 416.83087, U_grad_l2: 1479.77283, b_grad_l2: 225.63460 
[Train] epoch: 6/50, step: 32/250, loss: 610.18561
[Training] W_grad_l2: 271.73636, U_grad_l2: 491.29807, b_grad_l2: 93.34225 
[Train] epoch: 6/50, step: 33/250, loss: 750.96454
[Training] W_grad_l2: 1078.15417, U_grad_l2: 1429.87708, b_grad_l2: 374.56485 
[Train] epoch: 6/50, step: 34/250, loss: 576.46198
[Training] W_grad_l2: 816.87360, U_grad_l2: 992.95349, b_grad_l2: 401.42892 
[Train] epoch: 7/50, step: 35/250, loss: 408.86838
[Training] W_grad_l2: 840.02795, U_grad_l2: 847.73511, b_grad_l2: 590.32019 
[Train] epoch: 7/50, step: 36/250, loss: 318.39847
[Training] W_grad_l2: 175.33815, U_grad_l2: 345.37210, b_grad_l2: 60.79327 
[Train] epoch: 7/50, step: 37/250, loss: 623.74860
[Training] W_grad_l2: 381.74945, U_grad_l2: 701.66730, b_grad_l2: 148.51936 
[Train] epoch: 7/50, step: 38/250, loss: 959.64984
[Training] W_grad_l2: 354.04495, U_grad_l2: 494.58298, b_grad_l2: 106.81690 
[Train] epoch: 7/50, step: 39/250, loss: 1325.85327
[Training] W_grad_l2: 261.22116, U_grad_l2: 972.99634, b_grad_l2: 78.58305 
[Train] epoch: 8/50, step: 40/250, loss: 1321.12195
[Training] W_grad_l2: 897.64862, U_grad_l2: 1920.28027, b_grad_l2: 303.41150 
[Train] epoch: 8/50, step: 41/250, loss: 495.00229
[Training] W_grad_l2: 291.02243, U_grad_l2: 674.58759, b_grad_l2: 110.99571 
[Train] epoch: 8/50, step: 42/250, loss: 646.42053
[Training] W_grad_l2: 362.43802, U_grad_l2: 915.62982, b_grad_l2: 113.49941 
[Train] epoch: 8/50, step: 43/250, loss: 783.83344
[Training] W_grad_l2: 2410.09351, U_grad_l2: 2828.09546, b_grad_l2: 630.88885 
[Train] epoch: 8/50, step: 44/250, loss: 492.98846
[Training] W_grad_l2: 1835.62952, U_grad_l2: 2201.46021, b_grad_l2: 354.37305 
[Train] epoch: 9/50, step: 45/250, loss: 1185.73621
[Training] W_grad_l2: 1281.21948, U_grad_l2: 3972.60400, b_grad_l2: 397.88611 
[Train] epoch: 9/50, step: 46/250, loss: 931.98547
[Training] W_grad_l2: 2129.99219, U_grad_l2: 3940.50366, b_grad_l2: 824.40295 
[Train] epoch: 9/50, step: 47/250, loss: 357.81161
[Training] W_grad_l2: 1043.86035, U_grad_l2: 1399.87634, b_grad_l2: 390.10168 
[Train] epoch: 9/50, step: 48/250, loss: 561.49701
[Training] W_grad_l2: 630.01825, U_grad_l2: 1975.36084, b_grad_l2: 390.42245 
[Train] epoch: 9/50, step: 49/250, loss: 307.16202
[Training] W_grad_l2: 1371.97339, U_grad_l2: 1422.91626, b_grad_l2: 392.13168 
[Train] epoch: 10/50, step: 50/250, loss: 435.67426
[Training] W_grad_l2: 1120.65552, U_grad_l2: 2910.19507, b_grad_l2: 418.55988 
[Train] epoch: 10/50, step: 51/250, loss: 451.01416
[Training] W_grad_l2: 631.85095, U_grad_l2: 1046.77100, b_grad_l2: 210.77600 
[Train] epoch: 10/50, step: 52/250, loss: 431.12061
[Training] W_grad_l2: 430.12427, U_grad_l2: 617.39911, b_grad_l2: 112.01677 
[Train] epoch: 10/50, step: 53/250, loss: 333.56577
[Training] W_grad_l2: 444.70947, U_grad_l2: 970.75165, b_grad_l2: 125.75079 
[Train] epoch: 10/50, step: 54/250, loss: 261.05664
[Training] W_grad_l2: 884.73035, U_grad_l2: 1576.98901, b_grad_l2: 156.08833 
[Train] epoch: 11/50, step: 55/250, loss: 301.29266
[Training] W_grad_l2: 1163.94836, U_grad_l2: 1592.18274, b_grad_l2: 822.02435 
[Train] epoch: 11/50, step: 56/250, loss: 325.51154
[Training] W_grad_l2: 884.57166, U_grad_l2: 1385.15918, b_grad_l2: 273.91116 
[Train] epoch: 11/50, step: 57/250, loss: 309.99115
[Training] W_grad_l2: 1364.75574, U_grad_l2: 1223.24353, b_grad_l2: 297.79565 
[Train] epoch: 11/50, step: 58/250, loss: 320.26074
[Training] W_grad_l2: 383.14551, U_grad_l2: 605.36035, b_grad_l2: 105.01154 
[Train] epoch: 11/50, step: 59/250, loss: 321.23325
[Training] W_grad_l2: 1055.91772, U_grad_l2: 1353.23889, b_grad_l2: 351.45932 
[Train] epoch: 12/50, step: 60/250, loss: 347.12756
[Training] W_grad_l2: 1109.56494, U_grad_l2: 1151.83374, b_grad_l2: 411.72925 
[Train] epoch: 12/50, step: 61/250, loss: 522.52197
[Training] W_grad_l2: 2338.16943, U_grad_l2: 3877.00073, b_grad_l2: 663.73419 
[Train] epoch: 12/50, step: 62/250, loss: 625.02795
[Training] W_grad_l2: 647.82098, U_grad_l2: 1560.67920, b_grad_l2: 190.61250 
[Train] epoch: 12/50, step: 63/250, loss: 445.00262
[Training] W_grad_l2: 11735.87695, U_grad_l2: 11938.33789, b_grad_l2: 2402.81079 
[Train] epoch: 12/50, step: 64/250, loss: 329.32690
[Training] W_grad_l2: 490.54083, U_grad_l2: 857.64630, b_grad_l2: 170.38919 
[Train] epoch: 13/50, step: 65/250, loss: 530.46021
[Training] W_grad_l2: 499.90485, U_grad_l2: 1148.44629, b_grad_l2: 185.50540 
[Train] epoch: 13/50, step: 66/250, loss: 505.52908
[Training] W_grad_l2: 460.51996, U_grad_l2: 955.07385, b_grad_l2: 150.93222 
[Train] epoch: 13/50, step: 67/250, loss: 392.76013
[Training] W_grad_l2: 1191.40222, U_grad_l2: 1742.69836, b_grad_l2: 338.97672 
[Train] epoch: 13/50, step: 68/250, loss: 328.38351
[Training] W_grad_l2: 1172.47449, U_grad_l2: 2220.81934, b_grad_l2: 458.30286 
[Train] epoch: 13/50, step: 69/250, loss: 225.03595
[Training] W_grad_l2: 2012.85339, U_grad_l2: 1286.79382, b_grad_l2: 312.13992 
[Train] epoch: 14/50, step: 70/250, loss: 341.11618
[Training] W_grad_l2: 682.36084, U_grad_l2: 1259.90881, b_grad_l2: 251.71164 
[Train] epoch: 14/50, step: 71/250, loss: 437.78250
[Training] W_grad_l2: 9816.47559, U_grad_l2: 5576.03711, b_grad_l2: 3007.36011 
[Train] epoch: 14/50, step: 72/250, loss: 454.46466
[Training] W_grad_l2: 507.10818, U_grad_l2: 895.26215, b_grad_l2: 250.49866 
[Train] epoch: 14/50, step: 73/250, loss: 477.09833
[Training] W_grad_l2: 479.83920, U_grad_l2: 870.78534, b_grad_l2: 164.62653 
[Train] epoch: 14/50, step: 74/250, loss: 320.68100
[Training] W_grad_l2: 8889.80371, U_grad_l2: 3818.32422, b_grad_l2: 1443.49329 
[Train] epoch: 15/50, step: 75/250, loss: 439.47720
[Training] W_grad_l2: 388.44177, U_grad_l2: 816.78430, b_grad_l2: 166.33031 
[Train] epoch: 15/50, step: 76/250, loss: 416.03751
[Training] W_grad_l2: 3873.28882, U_grad_l2: 3330.40161, b_grad_l2: 1073.65759 
[Train] epoch: 15/50, step: 77/250, loss: 350.46719
[Training] W_grad_l2: 1250.50757, U_grad_l2: 2449.13428, b_grad_l2: 420.98492 
[Train] epoch: 15/50, step: 78/250, loss: 415.62152
[Training] W_grad_l2: 1880.75427, U_grad_l2: 3043.64233, b_grad_l2: 488.60657 
[Train] epoch: 15/50, step: 79/250, loss: 231.55853
[Training] W_grad_l2: 502.85492, U_grad_l2: 696.34729, b_grad_l2: 136.70578 
[Train] epoch: 16/50, step: 80/250, loss: 265.27301
[Training] W_grad_l2: 1928.75439, U_grad_l2: 2568.63525, b_grad_l2: 872.69519 
[Train] epoch: 16/50, step: 81/250, loss: 342.11905
[Training] W_grad_l2: 712.84625, U_grad_l2: 1544.50488, b_grad_l2: 190.38347 
[Train] epoch: 16/50, step: 82/250, loss: 340.48602
[Training] W_grad_l2: 770.04791, U_grad_l2: 2056.08350, b_grad_l2: 265.32230 
[Train] epoch: 16/50, step: 83/250, loss: 475.90088
[Training] W_grad_l2: 654.60889, U_grad_l2: 1605.66882, b_grad_l2: 247.21423 
[Train] epoch: 16/50, step: 84/250, loss: 328.02310
[Training] W_grad_l2: 592.99945, U_grad_l2: 1392.49207, b_grad_l2: 212.01782 
[Train] epoch: 17/50, step: 85/250, loss: 457.30399
[Training] W_grad_l2: 1015.43457, U_grad_l2: 2103.66919, b_grad_l2: 343.93472 
[Train] epoch: 17/50, step: 86/250, loss: 584.84686
[Training] W_grad_l2: 1632.69116, U_grad_l2: 2657.37671, b_grad_l2: 525.40015 
[Train] epoch: 17/50, step: 87/250, loss: 306.08670
[Training] W_grad_l2: 230.83795, U_grad_l2: 774.36285, b_grad_l2: 114.20924 
[Train] epoch: 17/50, step: 88/250, loss: 301.20483
[Training] W_grad_l2: 1514.28174, U_grad_l2: 2153.44971, b_grad_l2: 378.23361 
[Train] epoch: 17/50, step: 89/250, loss: 181.70825
[Training] W_grad_l2: 5305.16797, U_grad_l2: 7646.03076, b_grad_l2: 1358.83899 
[Train] epoch: 18/50, step: 90/250, loss: 424.09167
[Training] W_grad_l2: 2245.40601, U_grad_l2: 1772.69958, b_grad_l2: 1106.56238 
[Train] epoch: 18/50, step: 91/250, loss: 266.00021
[Training] W_grad_l2: 6058.23877, U_grad_l2: 4670.12842, b_grad_l2: 1402.85632 
[Train] epoch: 18/50, step: 92/250, loss: 228.72795
[Training] W_grad_l2: 1393.74280, U_grad_l2: 2171.37720, b_grad_l2: 378.52805 
[Train] epoch: 18/50, step: 93/250, loss: 254.28223
[Training] W_grad_l2: 724.79083, U_grad_l2: 1804.95642, b_grad_l2: 376.98587 
[Train] epoch: 18/50, step: 94/250, loss: 297.10388
[Training] W_grad_l2: 2193.02930, U_grad_l2: 4119.88623, b_grad_l2: 650.30157 
[Train] epoch: 19/50, step: 95/250, loss: 836.03900
[Training] W_grad_l2: 808.32141, U_grad_l2: 2274.81323, b_grad_l2: 171.86809 
[Train] epoch: 19/50, step: 96/250, loss: 934.97583
[Training] W_grad_l2: 266.54001, U_grad_l2: 928.86749, b_grad_l2: 70.20853 
[Train] epoch: 19/50, step: 97/250, loss: 3959.11523
[Training] W_grad_l2: 171.56589, U_grad_l2: 1631.96863, b_grad_l2: 46.19309 
[Train] epoch: 19/50, step: 98/250, loss: 3082.88965
[Training] W_grad_l2: 1045.38806, U_grad_l2: 4869.99609, b_grad_l2: 256.80939 
[Train] epoch: 19/50, step: 99/250, loss: 1118.75830
[Training] W_grad_l2: 841.34796, U_grad_l2: 2438.64185, b_grad_l2: 183.58231 
[Train] epoch: 20/50, step: 100/250, loss: 783.61786
[Training] W_grad_l2: 2086.98413, U_grad_l2: 5362.63379, b_grad_l2: 549.82593 
[Evaluate]  dev score: 0.06000, dev loss: 729.14609
[Evaluate] best accuracy performence has been updated: 0.00000 --> 0.06000
[Train] epoch: 20/50, step: 101/250, loss: 834.51752
[Training] W_grad_l2: 4052.14722, U_grad_l2: 6105.84863, b_grad_l2: 1238.04822 
[Train] epoch: 20/50, step: 102/250, loss: 1000.99615
[Training] W_grad_l2: 3680.18286, U_grad_l2: 4877.34521, b_grad_l2: 1250.05884 
[Train] epoch: 20/50, step: 103/250, loss: 848.61108
[Training] W_grad_l2: 868.32739, U_grad_l2: 2731.55347, b_grad_l2: 271.42313 
[Train] epoch: 20/50, step: 104/250, loss: 295.02939
[Training] W_grad_l2: 1773.42297, U_grad_l2: 5401.20557, b_grad_l2: 666.20490 
[Train] epoch: 21/50, step: 105/250, loss: 1081.06799
[Training] W_grad_l2: 4195.19043, U_grad_l2: 11158.38184, b_grad_l2: 1797.96143 
[Train] epoch: 21/50, step: 106/250, loss: 1035.65356
[Training] W_grad_l2: 3827.70654, U_grad_l2: 7535.17480, b_grad_l2: 1753.31409 
[Train] epoch: 21/50, step: 107/250, loss: 1421.60278
[Training] W_grad_l2: 3205.13647, U_grad_l2: 4613.91309, b_grad_l2: 777.04303 
[Train] epoch: 21/50, step: 108/250, loss: 733.08929
[Training] W_grad_l2: 2025.49878, U_grad_l2: 4495.05078, b_grad_l2: 675.78442 
[Train] epoch: 21/50, step: 109/250, loss: 384.24170
[Training] W_grad_l2: 750.76672, U_grad_l2: 1446.95361, b_grad_l2: 186.63704 
[Train] epoch: 22/50, step: 110/250, loss: 463.11938
[Training] W_grad_l2: 9213.70117, U_grad_l2: 30806.29492, b_grad_l2: 9807.72266 
[Train] epoch: 22/50, step: 111/250, loss: 556.48407
[Training] W_grad_l2: 15675.14648, U_grad_l2: 12304.25879, b_grad_l2: 2548.72437 
[Train] epoch: 22/50, step: 112/250, loss: 557.99725
[Training] W_grad_l2: 4947.75098, U_grad_l2: 6313.15820, b_grad_l2: 1021.21271 
[Train] epoch: 22/50, step: 113/250, loss: 930.20044
[Training] W_grad_l2: 2899.76953, U_grad_l2: 6014.67578, b_grad_l2: 734.77649 
[Train] epoch: 22/50, step: 114/250, loss: 412.54425
[Training] W_grad_l2: 953.08081, U_grad_l2: 1254.29968, b_grad_l2: 249.39722 
[Train] epoch: 23/50, step: 115/250, loss: 390.96289
[Training] W_grad_l2: 1457.48755, U_grad_l2: 1550.14111, b_grad_l2: 441.43640 
[Train] epoch: 23/50, step: 116/250, loss: 393.75330
[Training] W_grad_l2: 2907.96289, U_grad_l2: 3614.89722, b_grad_l2: 732.69128 
[Train] epoch: 23/50, step: 117/250, loss: 593.77112
[Training] W_grad_l2: 5132.90137, U_grad_l2: 7041.01025, b_grad_l2: 1144.95825 
[Train] epoch: 23/50, step: 118/250, loss: 511.28061
[Training] W_grad_l2: 6419.02246, U_grad_l2: 7725.00391, b_grad_l2: 1574.99353 
[Train] epoch: 23/50, step: 119/250, loss: 344.84210
[Training] W_grad_l2: 2356.19043, U_grad_l2: 3467.42773, b_grad_l2: 601.19165 
[Train] epoch: 24/50, step: 120/250, loss: 351.33322
[Training] W_grad_l2: 3456.03857, U_grad_l2: 3023.80127, b_grad_l2: 1136.33154 
[Train] epoch: 24/50, step: 121/250, loss: 883.39728
[Training] W_grad_l2: 8564.14941, U_grad_l2: 7824.82227, b_grad_l2: 2049.07275 
[Train] epoch: 24/50, step: 122/250, loss: 610.90997
[Training] W_grad_l2: 13495.65039, U_grad_l2: 12763.67871, b_grad_l2: 3395.93872 
[Train] epoch: 24/50, step: 123/250, loss: 588.45862
[Training] W_grad_l2: 2079.17065, U_grad_l2: 3754.22192, b_grad_l2: 439.06067 
[Train] epoch: 24/50, step: 124/250, loss: 547.74988
[Training] W_grad_l2: 2563.75195, U_grad_l2: 4413.61523, b_grad_l2: 583.89886 
[Train] epoch: 25/50, step: 125/250, loss: 892.92017
[Training] W_grad_l2: 2375.56299, U_grad_l2: 4169.04150, b_grad_l2: 593.07196 
[Train] epoch: 25/50, step: 126/250, loss: 433.68692
[Training] W_grad_l2: 1310.39587, U_grad_l2: 2038.42566, b_grad_l2: 338.32135 
[Train] epoch: 25/50, step: 127/250, loss: 343.06726
[Training] W_grad_l2: 833.14905, U_grad_l2: 1236.34790, b_grad_l2: 112.01233 
[Train] epoch: 25/50, step: 128/250, loss: 365.95389
[Training] W_grad_l2: 1232.13660, U_grad_l2: 1668.05701, b_grad_l2: 296.61862 
[Train] epoch: 25/50, step: 129/250, loss: 424.92767
[Training] W_grad_l2: 2988.01245, U_grad_l2: 4813.54443, b_grad_l2: 772.88452 
[Train] epoch: 26/50, step: 130/250, loss: 363.85394
[Training] W_grad_l2: 4720.67529, U_grad_l2: 8393.06445, b_grad_l2: 1185.61707 
[Train] epoch: 26/50, step: 131/250, loss: 373.37631
[Training] W_grad_l2: 11252.18848, U_grad_l2: 14392.63672, b_grad_l2: 2901.49341 
[Train] epoch: 26/50, step: 132/250, loss: 350.59912
[Training] W_grad_l2: 5037.97656, U_grad_l2: 4402.11035, b_grad_l2: 2820.11377 
[Train] epoch: 26/50, step: 133/250, loss: 342.09393
[Training] W_grad_l2: 2202.02759, U_grad_l2: 3207.37280, b_grad_l2: 540.12958 
[Train] epoch: 26/50, step: 134/250, loss: 243.01509
[Training] W_grad_l2: 2293.21826, U_grad_l2: 1972.47363, b_grad_l2: 440.56827 
[Train] epoch: 27/50, step: 135/250, loss: 308.92068
[Training] W_grad_l2: 5342.33154, U_grad_l2: 5181.13770, b_grad_l2: 3831.08545 
[Train] epoch: 27/50, step: 136/250, loss: 608.23785
[Training] W_grad_l2: 2959.97510, U_grad_l2: 5187.85645, b_grad_l2: 553.98987 
[Train] epoch: 27/50, step: 137/250, loss: 491.21863
[Training] W_grad_l2: 8146.84473, U_grad_l2: 10646.31055, b_grad_l2: 1721.16589 
[Train] epoch: 27/50, step: 138/250, loss: 501.31677
[Training] W_grad_l2: 16118.64453, U_grad_l2: 11699.57422, b_grad_l2: 3203.51831 
[Train] epoch: 27/50, step: 139/250, loss: 417.86887
[Training] W_grad_l2: 3738.27539, U_grad_l2: 4804.35156, b_grad_l2: 692.92957 
[Train] epoch: 28/50, step: 140/250, loss: 341.29315
[Training] W_grad_l2: 4686.30420, U_grad_l2: 4409.53516, b_grad_l2: 1367.39734 
[Train] epoch: 28/50, step: 141/250, loss: 459.89752
[Training] W_grad_l2: 4610.94092, U_grad_l2: 6874.78076, b_grad_l2: 1416.53406 
[Train] epoch: 28/50, step: 142/250, loss: 442.57227
[Training] W_grad_l2: 7915.91504, U_grad_l2: 14064.25293, b_grad_l2: 2248.35229 
[Train] epoch: 28/50, step: 143/250, loss: 1104.99646
[Training] W_grad_l2: 10902.37012, U_grad_l2: 6839.78271, b_grad_l2: 1829.09338 
[Train] epoch: 28/50, step: 144/250, loss: 271.93246
[Training] W_grad_l2: 7070.12305, U_grad_l2: 7749.60742, b_grad_l2: 1315.96045 
[Train] epoch: 29/50, step: 145/250, loss: 516.92145
[Training] W_grad_l2: 12919.32520, U_grad_l2: 11982.35059, b_grad_l2: 3066.83984 
[Train] epoch: 29/50, step: 146/250, loss: 503.67365
[Training] W_grad_l2: 12150.25879, U_grad_l2: 16830.65234, b_grad_l2: 2042.62439 
[Train] epoch: 29/50, step: 147/250, loss: 544.84601
[Training] W_grad_l2: 3174.21558, U_grad_l2: 2682.51709, b_grad_l2: 738.63794 
[Train] epoch: 29/50, step: 148/250, loss: 632.58386
[Training] W_grad_l2: 41626.60156, U_grad_l2: 58968.43359, b_grad_l2: 9514.36426 
[Train] epoch: 29/50, step: 149/250, loss: 628.44403
[Training] W_grad_l2: 25050.50000, U_grad_l2: 14683.47363, b_grad_l2: 3864.58496 
[Train] epoch: 30/50, step: 150/250, loss: 519.55743
[Training] W_grad_l2: 5653.96436, U_grad_l2: 6220.02588, b_grad_l2: 1237.43445 
[Train] epoch: 30/50, step: 151/250, loss: 781.32373
[Training] W_grad_l2: 7919.57861, U_grad_l2: 11023.53223, b_grad_l2: 1561.61475 
[Train] epoch: 30/50, step: 152/250, loss: 703.26733
[Training] W_grad_l2: 5268.80176, U_grad_l2: 4890.64355, b_grad_l2: 941.71307 
[Train] epoch: 30/50, step: 153/250, loss: 773.17853
[Training] W_grad_l2: 17962.17188, U_grad_l2: 27861.34375, b_grad_l2: 3725.93359 
[Train] epoch: 30/50, step: 154/250, loss: 720.95001
[Training] W_grad_l2: 40523.07031, U_grad_l2: 30598.54492, b_grad_l2: 5618.98682 
[Train] epoch: 31/50, step: 155/250, loss: 744.73340
[Training] W_grad_l2: 7360.79980, U_grad_l2: 13329.91016, b_grad_l2: 1625.85828 
[Train] epoch: 31/50, step: 156/250, loss: 671.90283
[Training] W_grad_l2: 7501.07812, U_grad_l2: 6373.19141, b_grad_l2: 1453.89746 
[Train] epoch: 31/50, step: 157/250, loss: 522.16193
[Training] W_grad_l2: 2393.68091, U_grad_l2: 4121.59961, b_grad_l2: 534.63226 
[Train] epoch: 31/50, step: 158/250, loss: 1089.83643
[Training] W_grad_l2: 6760.60889, U_grad_l2: 11532.13477, b_grad_l2: 1147.31726 
[Train] epoch: 31/50, step: 159/250, loss: 582.46875
[Training] W_grad_l2: 584.48553, U_grad_l2: 1235.99915, b_grad_l2: 130.80145 
[Train] epoch: 32/50, step: 160/250, loss: 2385.13086
[Training] W_grad_l2: 3158.61035, U_grad_l2: 3706.25562, b_grad_l2: 847.22638 
[Train] epoch: 32/50, step: 161/250, loss: 723.18823
[Training] W_grad_l2: 40818.01953, U_grad_l2: 69070.02344, b_grad_l2: 13005.38086 
[Train] epoch: 32/50, step: 162/250, loss: 1118.76184
[Training] W_grad_l2: 7696.22949, U_grad_l2: 12521.51660, b_grad_l2: 1558.97559 
[Train] epoch: 32/50, step: 163/250, loss: 1619.62341
[Training] W_grad_l2: 6482.91895, U_grad_l2: 10898.00879, b_grad_l2: 1599.57520 
[Train] epoch: 32/50, step: 164/250, loss: 572.91254
[Training] W_grad_l2: 5109.74414, U_grad_l2: 6788.13525, b_grad_l2: 1085.71497 
[Train] epoch: 33/50, step: 165/250, loss: 1065.59631
[Training] W_grad_l2: 11507.85742, U_grad_l2: 14727.33496, b_grad_l2: 2625.60303 
[Train] epoch: 33/50, step: 166/250, loss: 503.14658
[Training] W_grad_l2: 2704.51245, U_grad_l2: 5880.26025, b_grad_l2: 848.05267 
[Train] epoch: 33/50, step: 167/250, loss: 547.99225
[Training] W_grad_l2: 12594.24414, U_grad_l2: 16354.67871, b_grad_l2: 2971.57300 
[Train] epoch: 33/50, step: 168/250, loss: 633.45282
[Training] W_grad_l2: 7311.69189, U_grad_l2: 9864.72656, b_grad_l2: 1659.74951 
[Train] epoch: 33/50, step: 169/250, loss: 403.11621
[Training] W_grad_l2: 3589.00464, U_grad_l2: 5950.16992, b_grad_l2: 1297.85498 
[Train] epoch: 34/50, step: 170/250, loss: 679.95087
[Training] W_grad_l2: 5452.82715, U_grad_l2: 11287.89453, b_grad_l2: 1687.39990 
[Train] epoch: 34/50, step: 171/250, loss: 588.94080
[Training] W_grad_l2: 2944.71094, U_grad_l2: 5525.98291, b_grad_l2: 1023.78131 
[Train] epoch: 34/50, step: 172/250, loss: 602.05042
[Training] W_grad_l2: 4638.83105, U_grad_l2: 5996.47168, b_grad_l2: 1015.59875 
[Train] epoch: 34/50, step: 173/250, loss: 953.94470
[Training] W_grad_l2: 2141.84058, U_grad_l2: 6275.40088, b_grad_l2: 919.30182 
[Train] epoch: 34/50, step: 174/250, loss: 609.08698
[Training] W_grad_l2: 29527.77539, U_grad_l2: 30875.56055, b_grad_l2: 5274.52539 
[Train] epoch: 35/50, step: 175/250, loss: 392.02301
[Training] W_grad_l2: 20836.20898, U_grad_l2: 17904.30664, b_grad_l2: 11475.52734 
[Train] epoch: 35/50, step: 176/250, loss: 1205.22302
[Training] W_grad_l2: 7922.56641, U_grad_l2: 14515.55664, b_grad_l2: 1799.05579 
[Train] epoch: 35/50, step: 177/250, loss: 1006.28857
[Training] W_grad_l2: 18278.08594, U_grad_l2: 33067.44141, b_grad_l2: 3725.89355 
[Train] epoch: 35/50, step: 178/250, loss: 509.32004
[Training] W_grad_l2: 2667.51172, U_grad_l2: 5224.75439, b_grad_l2: 687.71985 
[Train] epoch: 35/50, step: 179/250, loss: 516.95282
[Training] W_grad_l2: 3845.79321, U_grad_l2: 7926.73633, b_grad_l2: 926.88586 
[Train] epoch: 36/50, step: 180/250, loss: 1658.47192
[Training] W_grad_l2: 6482.41797, U_grad_l2: 11765.33789, b_grad_l2: 1003.69824 
[Train] epoch: 36/50, step: 181/250, loss: 470.72495
[Training] W_grad_l2: 4380.33154, U_grad_l2: 5808.56836, b_grad_l2: 695.91699 
[Train] epoch: 36/50, step: 182/250, loss: 943.31793
[Training] W_grad_l2: 77481.97656, U_grad_l2: 38357.28906, b_grad_l2: 15919.61621 
[Train] epoch: 36/50, step: 183/250, loss: 558.41724
[Training] W_grad_l2: 564317.00000, U_grad_l2: 463355.12500, b_grad_l2: 113519.39062 
[Train] epoch: 36/50, step: 184/250, loss: 558.74213
[Training] W_grad_l2: 3189.26172, U_grad_l2: 5831.96924, b_grad_l2: 699.69177 
[Train] epoch: 37/50, step: 185/250, loss: 506.04510
[Training] W_grad_l2: 5760.73975, U_grad_l2: 7057.30908, b_grad_l2: 2655.63965 
[Train] epoch: 37/50, step: 186/250, loss: 665.33124
[Training] W_grad_l2: 4876.40234, U_grad_l2: 7919.33008, b_grad_l2: 1604.29004 
[Train] epoch: 37/50, step: 187/250, loss: 441.40875
[Training] W_grad_l2: 1977.44617, U_grad_l2: 7576.76465, b_grad_l2: 834.93201 
[Train] epoch: 37/50, step: 188/250, loss: 425.98987
[Training] W_grad_l2: 5831.16992, U_grad_l2: 7714.65088, b_grad_l2: 1895.45435 
[Train] epoch: 37/50, step: 189/250, loss: 507.93884
[Training] W_grad_l2: 1226.33594, U_grad_l2: 2202.07910, b_grad_l2: 228.02698 
[Train] epoch: 38/50, step: 190/250, loss: 877.18469
[Training] W_grad_l2: 104800.50781, U_grad_l2: 188663.00000, b_grad_l2: 19988.82422 
[Train] epoch: 38/50, step: 191/250, loss: 1000.13477
[Training] W_grad_l2: 5724.17578, U_grad_l2: 7430.44141, b_grad_l2: 1501.54956 
[Train] epoch: 38/50, step: 192/250, loss: 948.83270
[Training] W_grad_l2: 14015.39844, U_grad_l2: 12452.91895, b_grad_l2: 2542.29443 
[Train] epoch: 38/50, step: 193/250, loss: 780.42682
[Training] W_grad_l2: 2858.44067, U_grad_l2: 3461.75317, b_grad_l2: 626.30579 
[Train] epoch: 38/50, step: 194/250, loss: 559.42523
[Training] W_grad_l2: 3422.24634, U_grad_l2: 6392.28809, b_grad_l2: 586.97375 
[Train] epoch: 39/50, step: 195/250, loss: 554.57703
[Training] W_grad_l2: 20780.52539, U_grad_l2: 19360.14062, b_grad_l2: 4847.16016 
[Train] epoch: 39/50, step: 196/250, loss: 564.31030
[Training] W_grad_l2: 12187.54004, U_grad_l2: 19345.59180, b_grad_l2: 2058.75122 
[Train] epoch: 39/50, step: 197/250, loss: 598.37732
[Training] W_grad_l2: 3553.78687, U_grad_l2: 7119.09131, b_grad_l2: 804.49591 
[Train] epoch: 39/50, step: 198/250, loss: 836.90723
[Training] W_grad_l2: 69748.17969, U_grad_l2: 56384.34375, b_grad_l2: 9959.51074 
[Train] epoch: 39/50, step: 199/250, loss: 513.58929
[Training] W_grad_l2: 3529.59131, U_grad_l2: 6006.57715, b_grad_l2: 1069.64661 
[Train] epoch: 40/50, step: 200/250, loss: 462.49075
[Training] W_grad_l2: 11721.07129, U_grad_l2: 20407.21094, b_grad_l2: 4172.69922 
[Evaluate]  dev score: 0.07000, dev loss: 298.39183
[Evaluate] best accuracy performence has been updated: 0.06000 --> 0.07000
[Train] epoch: 40/50, step: 201/250, loss: 332.46274
[Training] W_grad_l2: 3682.87793, U_grad_l2: 5432.62939, b_grad_l2: 1233.18201 
[Train] epoch: 40/50, step: 202/250, loss: 340.15549
[Training] W_grad_l2: 8388.93652, U_grad_l2: 14940.29199, b_grad_l2: 3263.29980 
[Train] epoch: 40/50, step: 203/250, loss: 330.84290
[Training] W_grad_l2: 4989.25879, U_grad_l2: 10843.09863, b_grad_l2: 1208.34009 
[Train] epoch: 40/50, step: 204/250, loss: 383.41928
[Training] W_grad_l2: 3012.49805, U_grad_l2: 3031.39966, b_grad_l2: 577.86823 
[Train] epoch: 41/50, step: 205/250, loss: 384.80081
[Training] W_grad_l2: 7714.75391, U_grad_l2: 10433.58203, b_grad_l2: 2641.19385 
[Train] epoch: 41/50, step: 206/250, loss: 548.91864
[Training] W_grad_l2: 5384.80518, U_grad_l2: 8600.59668, b_grad_l2: 1314.79065 
[Train] epoch: 41/50, step: 207/250, loss: 340.43289
[Training] W_grad_l2: 2758.00171, U_grad_l2: 3762.81348, b_grad_l2: 854.52386 
[Train] epoch: 41/50, step: 208/250, loss: 683.41656
[Training] W_grad_l2: 17893.04492, U_grad_l2: 18986.11914, b_grad_l2: 5208.33691 
[Train] epoch: 41/50, step: 209/250, loss: 572.94153
[Training] W_grad_l2: 5427.98096, U_grad_l2: 5826.52637, b_grad_l2: 978.59326 
[Train] epoch: 42/50, step: 210/250, loss: 1168.02759
[Training] W_grad_l2: 15453.51367, U_grad_l2: 21465.24219, b_grad_l2: 2601.03467 
[Train] epoch: 42/50, step: 211/250, loss: 782.25922
[Training] W_grad_l2: 4581.48096, U_grad_l2: 6404.73779, b_grad_l2: 1003.96600 
[Train] epoch: 42/50, step: 212/250, loss: 494.39136
[Training] W_grad_l2: 4243.55273, U_grad_l2: 7675.44482, b_grad_l2: 1608.44397 
[Train] epoch: 42/50, step: 213/250, loss: 437.20792
[Training] W_grad_l2: 19171.79688, U_grad_l2: 19212.51758, b_grad_l2: 4405.38330 
[Train] epoch: 42/50, step: 214/250, loss: 322.18576
[Training] W_grad_l2: 10503.06250, U_grad_l2: 8793.82812, b_grad_l2: 2076.37573 
[Train] epoch: 43/50, step: 215/250, loss: 371.60056
[Training] W_grad_l2: 35007.83203, U_grad_l2: 39701.11328, b_grad_l2: 11609.23730 
[Train] epoch: 43/50, step: 216/250, loss: 530.66132
[Training] W_grad_l2: 17842.50391, U_grad_l2: 4545.31641, b_grad_l2: 2743.40283 
[Train] epoch: 43/50, step: 217/250, loss: 612.40881
[Training] W_grad_l2: 2912.42212, U_grad_l2: 5006.51074, b_grad_l2: 634.08856 
[Train] epoch: 43/50, step: 218/250, loss: 485.52069
[Training] W_grad_l2: 4380.20020, U_grad_l2: 6818.40234, b_grad_l2: 1105.80237 
[Train] epoch: 43/50, step: 219/250, loss: 292.64175
[Training] W_grad_l2: 5141.67432, U_grad_l2: 8972.63281, b_grad_l2: 1901.23022 
[Train] epoch: 44/50, step: 220/250, loss: 386.80756
[Training] W_grad_l2: 3033.26978, U_grad_l2: 3916.63965, b_grad_l2: 783.81146 
[Train] epoch: 44/50, step: 221/250, loss: 441.30475
[Training] W_grad_l2: 7908.08984, U_grad_l2: 15912.87500, b_grad_l2: 1995.98376 
[Train] epoch: 44/50, step: 222/250, loss: 307.90637
[Training] W_grad_l2: 3045.32056, U_grad_l2: 6484.02539, b_grad_l2: 959.33905 
[Train] epoch: 44/50, step: 223/250, loss: 310.64221
[Training] W_grad_l2: 10656.90527, U_grad_l2: 12787.59180, b_grad_l2: 2664.48120 
[Train] epoch: 44/50, step: 224/250, loss: 230.12305
[Training] W_grad_l2: 1664.80237, U_grad_l2: 2379.58423, b_grad_l2: 381.49350 
[Train] epoch: 45/50, step: 225/250, loss: 406.54977
[Training] W_grad_l2: 39594.92188, U_grad_l2: 39701.20312, b_grad_l2: 12149.80859 
[Train] epoch: 45/50, step: 226/250, loss: 517.69604
[Training] W_grad_l2: 167459.06250, U_grad_l2: 62252.57812, b_grad_l2: 23514.62695 
[Train] epoch: 45/50, step: 227/250, loss: 399.33813
[Training] W_grad_l2: 13503.87109, U_grad_l2: 24806.31055, b_grad_l2: 5269.89209 
[Train] epoch: 45/50, step: 228/250, loss: 638.09949
[Training] W_grad_l2: 122498.44531, U_grad_l2: 68554.35938, b_grad_l2: 22192.37109 
[Train] epoch: 45/50, step: 229/250, loss: 279.23441
[Training] W_grad_l2: 47062.62891, U_grad_l2: 73684.82031, b_grad_l2: 12529.73242 
[Train] epoch: 46/50, step: 230/250, loss: 1398.81641
[Training] W_grad_l2: 41778.82031, U_grad_l2: 39341.23047, b_grad_l2: 5967.53906 
[Train] epoch: 46/50, step: 231/250, loss: 958.36438
[Training] W_grad_l2: 10098.21582, U_grad_l2: 9259.90723, b_grad_l2: 1790.04565 
[Train] epoch: 46/50, step: 232/250, loss: 791.77686
[Training] W_grad_l2: 20390.14062, U_grad_l2: 20629.96875, b_grad_l2: 3936.32080 
[Train] epoch: 46/50, step: 233/250, loss: 373.23636
[Training] W_grad_l2: 3804.12476, U_grad_l2: 5016.23828, b_grad_l2: 866.16333 
[Train] epoch: 46/50, step: 234/250, loss: 327.32227
[Training] W_grad_l2: 31316.20508, U_grad_l2: 33332.59375, b_grad_l2: 6149.84326 
[Train] epoch: 47/50, step: 235/250, loss: 511.98196
[Training] W_grad_l2: 19010.74805, U_grad_l2: 23630.08008, b_grad_l2: 2239.82983 
[Train] epoch: 47/50, step: 236/250, loss: 469.41608
[Training] W_grad_l2: 6165.38867, U_grad_l2: 7780.47119, b_grad_l2: 1128.80994 
[Train] epoch: 47/50, step: 237/250, loss: 459.93356
[Training] W_grad_l2: 9566.00195, U_grad_l2: 9909.89160, b_grad_l2: 1411.26343 
[Train] epoch: 47/50, step: 238/250, loss: 799.38916
[Training] W_grad_l2: 4974.57568, U_grad_l2: 7284.52441, b_grad_l2: 818.27875 
[Train] epoch: 47/50, step: 239/250, loss: 760.69708
[Training] W_grad_l2: 4832.53955, U_grad_l2: 8375.73242, b_grad_l2: 828.72186 
[Train] epoch: 48/50, step: 240/250, loss: 1255.78345
[Training] W_grad_l2: 37905.00781, U_grad_l2: 48921.76562, b_grad_l2: 6621.94531 
[Train] epoch: 48/50, step: 241/250, loss: 959.00653
[Training] W_grad_l2: 74639.82812, U_grad_l2: 70127.48438, b_grad_l2: 14713.14844 
[Train] epoch: 48/50, step: 242/250, loss: 1081.98218
[Training] W_grad_l2: 20412.42188, U_grad_l2: 36168.32812, b_grad_l2: 4845.06445 
[Train] epoch: 48/50, step: 243/250, loss: 934.48956
[Training] W_grad_l2: 6994.15674, U_grad_l2: 13397.30762, b_grad_l2: 1421.17847 
[Train] epoch: 48/50, step: 244/250, loss: 839.07220
[Training] W_grad_l2: 1700.79321, U_grad_l2: 3333.19580, b_grad_l2: 263.33188 
[Train] epoch: 49/50, step: 245/250, loss: 2398.78174
[Training] W_grad_l2: 6475.91162, U_grad_l2: 8577.03418, b_grad_l2: 1008.95697 
[Train] epoch: 49/50, step: 246/250, loss: 2025.71631
[Training] W_grad_l2: 1226.13123, U_grad_l2: 1688.28711, b_grad_l2: 206.97163 
[Train] epoch: 49/50, step: 247/250, loss: 5071.26855
[Training] W_grad_l2: 1288.56567, U_grad_l2: 5692.72803, b_grad_l2: 216.08934 
[Train] epoch: 49/50, step: 248/250, loss: 3589.15674
[Training] W_grad_l2: 472.51508, U_grad_l2: 788.15002, b_grad_l2: 80.21649 
[Train] epoch: 49/50, step: 249/250, loss: 3895.26611
[Training] W_grad_l2: 315.10236, U_grad_l2: 817.18292, b_grad_l2: 44.13593 
[Evaluate]  dev score: 0.08000, dev loss: 4843.72778
[Evaluate] best accuracy performence has been updated: 0.07000 --> 0.08000
[Train] Training done! 

在引入梯度截断后,获取训练过程中关于WW,UU和bb参数梯度的L2范数,并将其绘制为图片以便展示,相应代码如下:

save_path =  f"./images/6.9.pdf"
plot_grad(W_list, U_list, b_list, save_path, keep_steps=100)

运行结果为:

image has been saved to:  ./images/6.9.pdf

 展示了引入按模截断的策略之后,模型训练时参数梯度的变化情况。可以看到,随着迭代步骤的进行,梯度始终保持在一个有值的状态,表明按模截断能够很好地解决梯度爆炸的问题.

接下来,使用梯度截断策略的模型在测试集上进行测试。

print(f"Evaluate SRN with data length {length}.")

# 加载训练过程中效果最好的模型
model_path = os.path.join(save_dir, f"srn_fix_explosion_model_{length}.pdparams")
runner.load_model(model_path)

# 使用测试集评价模型,获取测试集上的预测准确率
score, _ = runner.evaluate(test_loader)
print(f"[SRN] length:{length}, Score: {score: .5f}")

运行结果为:

Evaluate SRN with data length 20.
[SRN] length:20, Score:  0.07000 

【思考题】梯度截断解决梯度爆炸问题的原理是什么? 

      先说一下,我的理解,然后再结合资料说一下。

      我的理解是,这个相当于就是一种暴力的截断,有点像当时刚学机器学习时,一种阈值的东西,当梯度过大到达某个之后,会被强制截断,然后设定成某个特定的值,然后,用这个值来进行参数更新,那么这个值的设置就会影响模型的训练,因此,设置不同的截断后的的值,会对模型的准确度有影响,这在上边也证明了。

       课本上的解释在下边(老师也说了,课本上后边会讲到,我上边说的是,老师讲过ppt之后我的理解。)


总结

       首先,这几天真的有点累,我一开没想回家,但是学校疫情严重,我就回家了,但是,结果听说莲池区要静默,我就又回老家了,结果过了一天全保定静默,这几天也就没啥时间来写这个,幸亏之前周一写了一点,要不还真不一定能写完。

        其次,希望疫情快点过去吧,真差点写不完了,但是我相信这个是值得的。

        其次,是我学到的范数概念,以及为什么用梯度的范数,这个以前真的没有想过,这次才发现是为了保证各个值都趋近于0,排除了正负号的影响。

        其次,是明白了梯度截断是怎么回事,这个就是超过某个值后,就强制设为一个值,这个我感觉要是一种按比例的感觉会更好,而不是超过多少的都强制是一个值。

       最后,当然是谢谢老师,谢谢老师在学习和生活上的关心(哈哈哈)。

  • 2
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值