模型参数无法更新的原因:训练、预测中加入了print函数


今天下午编写代码的时候,发现不同的输入输出内容几乎一致

for epoch in range(1):
    print('epoch {}'.format(epoch+1))
    train_loss = 0
    train_acc = 0
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model.train()
    model = model.to(device)
    model = nn.DataParallel(model)
    
    for batch_token_ids,batch_segment_ids,batch_mask_ids,batch_labels in tqdm(train_loader):
        batch_token_ids = batch_token_ids.to(device)
        batch_segment_ids = batch_segment_ids.to(device)
        batch_mask_ids = batch_mask_ids.to(device)
        batch_labels = batch_labels.to(device)
        output = model(batch_token_ids,batch_segment_ids,batch_mask_ids)
        #print('###output = ###')
        #print(output)
        loss = loss_func(output,batch_labels)
        train_loss += loss
        pred = torch.max(output, 1)[1]
        train_correct = (pred == batch_labels).sum()
        train_acc += train_correct
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    #print('Train Loss: {:.6f}, Acc: {:.6f}'.format(train_loss / (len(
    #    batch_token_ids)), train_acc / (len(train_data))))
    print('Train Loss: {:.6f}, Acc: {:.6f}'.format(train_loss, train_acc/len(train_dataset)))
    
    torch.cuda.empty_cache() 
    model.eval()
    eval_loss = 0.
    eval_acc = 0.
    for batch_token_ids,batch_segment_ids,batch_mask_ids,batch_labels in test_loader:
        
        batch_token_ids = batch_token_ids.to(device)
        batch_segment_ids = batch_segment_ids.to(device)
        batch_mask_ids = batch_mask_ids.to(device)
        batch_labels = batch_labels.to(device)
        with torch.no_grad():
            output = model(batch_token_ids,batch_segment_ids,batch_mask_ids)
            print('###output = ###')
            print(output)
        loss = loss_func(output, batch_labels)
        eval_loss += loss
        pred = torch.max(output, 1)[1]
        num_correct = (pred == batch_labels).sum()
        eval_acc += num_correct
    torch.cuda.empty_cache() 
    print('Test Loss: {:.6f}, Acc: {:.6f}'.format(eval_loss, eval_acc/len(test_dataset)))

输出的内容为

###output = ###
tensor([[-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504],
        [-0.2677,  0.4505],
        [-0.2677,  0.4504],
        [-0.2677,  0.4504]], device='cuda:0')

不同的输入参数输出的内容竟然一样,经过反复研究之后,发现是由于bert之中的word_embeddings参数维度过小,由于segment_embeddings和position_embeddings在这里都是一样的,word_embeddings参数过小就导致参数对于最终的影响因子过小,导致最终的参数看起来是一样的(实际上有细微差别,由于小数点过少没有显示出来)

所以,这里最好调用预训练好的bert参数,能够保证维度差不多,便于进一步进行操作,即使自己初始化,也需要看一下别人的初始化过程的参数
如果光训练不预测的时候,训练会达到不同的logits,但是一旦进入预测状态输出的参数就一样,如果不训练的话光预测的话,预测会输出不同的logits参数

经过不断地输出网络层之中取出的内容,发现是由于取出第一个维度的数值造成的

class ClassificationModel(nn.Module):
    def __init__(self,model,config,n_labels):
        super(ClassificationModel,self).__init__()
        self.model = bert
        self.fc1 = nn.Linear(config.embedding_size,config.embedding_size)
        self.activation = nn.Tanh()
        self.dropout = nn.Dropout(0.2)
        self.fc2 = nn.Linear(config.embedding_size,n_labels)
    
    def forward(self,input_ids,segment_ids,input_mask):
        #forward中传入的三个参数与return tuple(tensor[index] for tensor in self.tensors)
        #之中传入的参数相对应
        outputs = self.model(input_ids,segment_ids,input_mask)
        #[64,128,768]
        outputs = outputs[:,0]
        #[64,,768]
        outputs = self.fc1(outputs)
        outputs = self.activation(outputs)
        outputs = self.dropout(outputs)
        outputs = self.fc2(outputs)
        return outputs

这里网络层之中的

outputs = outputs[:,0]

由于数值未被完全的交互,导致参数类似

tensor([[[-0.2047, -0.1559,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2047, -0.1559,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2047, -0.1559,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         ...,
         [-0.2047, -0.1560,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2047, -0.1560,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2047, -0.1560,  0.1625,  ...,  0.9564,  0.2497, -0.8757]],

        [[-0.2046, -0.1559,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         [-0.2046, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         [-0.2046, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         ...,
         [-0.2046, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         [-0.2045, -0.1560,  0.1626,  ...,  0.9564,  0.2496, -0.8757],
         [-0.2046, -0.1559,  0.1626,  ...,  0.9565,  0.2496, -0.8757]],

        [[-0.2047, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         [-0.2047, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         [-0.2046, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         ...,
         [-0.2047, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757],
         [-0.2047, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8758],
         [-0.2047, -0.1560,  0.1626,  ...,  0.9565,  0.2496, -0.8757]],

        ...,

        [[-0.2046, -0.1561,  0.1626,  ...,  0.9565,  0.2496, -0.8758],
         [-0.2046, -0.1561,  0.1626,  ...,  0.9565,  0.2496, -0.8758],
         [-0.2046, -0.1561,  0.1626,  ...,  0.9565,  0.2496, -0.8758],
         ...,
         [-0.2046, -0.1561,  0.1626,  ...,  0.9565,  0.2496, -0.8758],
         [-0.2046, -0.1561,  0.1625,  ...,  0.9565,  0.2496, -0.8758],
         [-0.2046, -0.1561,  0.1626,  ...,  0.9565,  0.2496, -0.8758]],

        [[-0.2047, -0.1560,  0.1627,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2046, -0.1560,  0.1626,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2047, -0.1560,  0.1627,  ...,  0.9564,  0.2497, -0.8757],
         ...,
         [-0.2046, -0.1560,  0.1626,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2047, -0.1560,  0.1627,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2046, -0.1560,  0.1627,  ...,  0.9564,  0.2497, -0.8757]],

        [[-0.2046, -0.1560,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2046, -0.1559,  0.1626,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2046, -0.1559,  0.1626,  ...,  0.9564,  0.2497, -0.8758],
         ...,
         [-0.2046, -0.1559,  0.1626,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2046, -0.1560,  0.1625,  ...,  0.9564,  0.2497, -0.8757],
         [-0.2046, -0.1559,  0.1625,  ...,  0.9564,  0.2497, -0.8757]]],
       device='cuda:0')

** zzz**

$$$outputs = $$$
tensor([[[ 0.2848, -0.4215, -0.2901,  ..., -0.1036,  0.0794,  0.2820],
         [-0.1348,  0.0131, -0.1860,  ...,  0.5096,  0.7875,  0.2537],
         [-0.5072,  0.0169, -0.0083,  ...,  0.3212,  0.4602, -0.9336],
         ...,
         [ 0.4556, -0.5518,  0.3407,  ..., -0.7521,  0.0717,  0.7361],
         [ 0.3190, -0.5898,  0.1338,  ..., -0.6915,  0.0208,  0.5439],
         [ 0.5231, -0.4503,  0.3048,  ..., -0.2850, -0.3230,  0.4492]],

        [[ 0.2848, -0.4215, -0.2901,  ..., -0.1036,  0.0794,  0.2820],
         [ 0.9524,  0.5670, -0.2149,  ...,  0.2403,  0.3633,  0.4568],
         [-0.9436, -0.0290, -0.2395,  ..., -0.7782, -0.3521, -1.0194],
         ...,
         [ 0.4556, -0.5518,  0.3407,  ..., -0.7521,  0.0717,  0.7361],
         [ 0.3190, -0.5898,  0.1338,  ..., -0.6915,  0.0208,  0.5439],
         [ 0.5231, -0.4503,  0.3048,  ..., -0.2850, -0.3230,  0.4492]],

        [[ 0.2848, -0.4215, -0.2901,  ..., -0.1036,  0.0794,  0.2820],
         [ 0.8063,  0.4222, -1.1415,  ...,  0.6901,  0.6352, -0.3319],
         [-0.5839,  0.3276, -0.4839,  ...,  0.4400,  0.5085,  0.5786],
         ...,
         [ 0.4556, -0.5518,  0.3407,  ..., -0.7521,  0.0717,  0.7361],
         [ 0.3190, -0.5898,  0.1338,  ..., -0.6915,  0.0208,  0.5439],
         [ 0.5231, -0.4503,  0.3048,  ..., -0.2850, -0.3230,  0.4492]],

        ...,

        [[ 0.2848, -0.4215, -0.2901,  ..., -0.1036,  0.0794,  0.2820],
         [ 0.8063,  0.4222, -1.1415,  ...,  0.6901,  0.6352, -0.3319],
         [-0.5706,  0.3244, -0.9336,  ...,  0.4432,  0.7614, -0.0888],
         ...,
         [ 0.4556, -0.5518,  0.3407,  ..., -0.7521,  0.0717,  0.7361],
         [ 0.3190, -0.5898,  0.1338,  ..., -0.6915,  0.0208,  0.5439],
         [ 0.5231, -0.4503,  0.3048,  ..., -0.2850, -0.3230,  0.4492]],

        [[ 0.2848, -0.4215, -0.2901,  ..., -0.1036,  0.0794,  0.2820],
         [-0.0036,  0.5475, -0.1394,  ..., -0.2208,  0.4325, -0.4380],
         [ 0.7087, -0.2818, -0.8927,  ...,  0.6469,  1.0082, -0.7571],
         ...,
         [ 0.4556, -0.5518,  0.3407,  ..., -0.7521,  0.0717,  0.7361],
         [ 0.3190, -0.5898,  0.1338,  ..., -0.6915,  0.0208,  0.5439],
         [ 0.5231, -0.4503,  0.3048,  ..., -0.2850, -0.3230,  0.4492]],

        [[ 0.2848, -0.4215, -0.2901,  ..., -0.1036,  0.0794,  0.2820],
         [-0.0036,  0.5475, -0.1394,  ..., -0.2208,  0.4325, -0.4380],
         [-0.1104, -0.3923, -1.2508,  ..., -0.5276, -0.5741, -0.5231],
         ...,
         [ 0.4556, -0.5518,  0.3407,  ..., -0.7521,  0.0717,  0.7361],
         [ 0.3190, -0.5898,  0.1338,  ..., -0.6915,  0.0208,  0.5439],
         [ 0.5231, -0.4503,  0.3048,  ..., -0.2850, -0.3230,  0.4492]]],
       device='cuda:0')

!!!attention different!!!

777outputs = 777

tensor([[[-0.5465, -0.3685, -0.0388,  ...,  0.1478,  0.4534, -0.1577],
         [-0.5740, -0.3679, -0.0508,  ...,  0.1276,  0.4721, -0.1519],
         [-0.5719, -0.3697, -0.0515,  ...,  0.1254,  0.4745, -0.1484],
         ...,
         [-0.5450, -0.3638, -0.0377,  ...,  0.1465,  0.4493, -0.1580],
         [-0.5472, -0.3668, -0.0363,  ...,  0.1462,  0.4492, -0.1589],
         [-0.5475, -0.3658, -0.0363,  ...,  0.1470,  0.4485, -0.1586]]],
       device='cuda:0', grad_fn=<NativeLayerNormBackward>)
7777777777777777
transformer1111111111111
tensor([[[-1.1861, -0.5837, -1.6331,  ...,  1.1560,  1.4577,  1.5189],
         [-1.1860, -0.5837, -1.6329,  ...,  1.1559,  1.4574,  1.5186],
         [-1.1860, -0.5837, -1.6329,  ...,  1.1559,  1.4574,  1.5186],
         ...,
         [-1.1861, -0.5837, -1.6331,  ...,  1.1560,  1.4577,  1.5189],
         [-1.1861, -0.5837, -1.6331,  ...,  1.1560,  1.4577,  1.5189],
         [-1.1861, -0.5837, -1.6331,  ...,  1.1560,  1.4578,  1.5189]]],
       device='cuda:0', grad_fn=<ViewBackward>)
111111111111111111111111

注释掉结构的方法

采用注释掉transformer所有的结构的方法,发现注释掉所有的结构之后,只用bertembeddings训练之后输出的内容也一样,说明训练过程有问题
最终发现,还是与取出第0维的内容有关,模型在训练的时候没有很好地交互起来,而第一维度的数值始终保持不变。

model.eval()
output = model(torch.tensor([[ 101, 5672, 2033, 2011, 2151, 3793, 2017, 1005, 1040, 2066, 1012,  102],
                             [ 101, 102 , 103,  104 , 105 , 106 , 107 , 108 , 109 , 110,  111,   112]]),
          torch.tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
                        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
          torch.tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]))
print(output)

输出的内容为

$$$outputs = $$$
tensor([[[ 0.0009, -0.0245, -0.0229,  ...,  0.0035,  0.0037,  0.0127],
         [-0.0436, -0.0116,  0.0493,  ..., -0.0380,  0.0292,  0.0141],
         [-0.0011,  0.0014, -0.0595,  ...,  0.0207,  0.0283,  0.0081],
         ...,
         [ 0.0093,  0.0217,  0.0145,  ...,  0.0025,  0.0356,  0.0086],
         [-0.0207, -0.0020, -0.0118,  ...,  0.0128,  0.0200,  0.0259],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015]],

        [[ 0.0009, -0.0245, -0.0229,  ...,  0.0035,  0.0037,  0.0127],
         [-0.0145, -0.0100,  0.0060,  ..., -0.0250,  0.0046, -0.0015],
         [ 0.0037, -0.0069,  0.0087,  ...,  0.0054, -0.0043, -0.0004],
         ...,
         [-0.0125, -0.0543, -0.0213,  ..., -0.0135, -0.0328, -0.0145],
         [-0.0163, -0.0521, -0.0353,  ..., -0.0093, -0.0322, -0.0109],
         [-0.0117, -0.0531, -0.0262,  ..., -0.0174, -0.0372, -0.0176]]],
       device='cuda:0', grad_fn=<EmbeddingBackward>)

可以看到模型的第0维度内容差不多,后面的内容参数不一样,另外如果输入的input_id相同的时候,输出的内容一样!!!说明模型在训练的过程之中,没有进行很好的交互

排查出错误

多次排查之后,感觉参数输出一致的问题还是出现在网络结构之中,训练和预测的代码应该没有太大的变化。
经过多方排查,最终采用的排查出来错误的方法是保留12个网络层的情况下依次注释掉对应的网络结构,查看对应的预测参数的方法,最终锁定问题出现在attention网络层之中。
锁定attention网络层的原因在于对面注释掉attention后面的网络层再进行预测形成的参数差不多,而我的attention网络层注释掉attention后面的网络层再进行预测刑场的参数差别较大(都是在12个layer的情况下),由此锁定了我的attention网络层出现了问题。
同时将这部分的代码替换成为对方的代码,发现问题仍然存在

最终排查

仔细观察了训练的数据之后,我又有了新的猜想,在前两个epoch之中,对应的数值都能够不同的正常显示,而越往后数值越固定,由此引起了我的质疑:会不会是由于优化器训练中的学习率过大造成的???
原先没有指定学习率,代码为

optimizer = torch.optim.Adam(model.parameters())

可见Adam初始化中的默认学习率较大
这里我将对应的学习率改小

optimizer = torch.optim.Adam(model.parameters(),lr=0.00001)

程序又能够正常地运行数值了!!!
由此可见,优化器初始化的学习率过大,所以选择小的学习率很重要
当参数选为0.001的时候,程序又会出现上述情况,由此可见,选择合适的学习率很重要!!!

进一步排查错误

进一步排查错误发现,虽然修改学习率之后,模型的参数能够在预测的时候有所变化了,但是实际上的变化都是在同一个维度上面变换的,本质上预测的结果都是一样的。
标准的bert(与我的bert只有attention网络层不同)运行了1个epoch之后的内容

###output = ###
tensor([[ 0.2700,  0.2923],
        [ 0.1380,  0.4510],
        [-0.4351,  1.1153],
        [-0.5227,  1.1540],
        [-0.3554,  1.0272],
        [ 0.1704,  0.2805],
        [-0.1508,  0.7339],
        [-0.3214,  0.9973],
        [-0.3231,  0.9882],
        [-0.2027,  0.9249],
        [ 0.2438,  0.1267],
        [-0.4552,  1.0580],
        [ 0.3534, -0.1432],
        [ 0.1472,  0.3750],
        [-0.6435,  1.1881],
        [-0.6448,  1.2098],
        [-0.5907,  1.1442],
        [-0.6157,  1.1829],
        [-0.5695,  1.1348],
        [-0.0073,  0.5067],
        [-0.5783,  1.1514],
        [ 0.1465,  0.1472],
        [ 0.4039, -0.6722],
        [-0.5679,  1.1422],
        [-0.5323,  1.1588],
        [-0.6333,  1.1807],
        [-0.6263,  1.1737],
        [-0.6299,  1.1925],
        [ 0.1615,  0.1347]], device='cuda:0')

我的bert运行了1个epoch之后的内容

###output = ###
tensor(
       [[-0.2258,  0.3044],
        [-0.3485,  0.2230],
        [-0.2349,  0.2664],
        [-0.0914,  0.3775],
        [-0.3134,  0.3017],
        [-0.2691,  0.2938],
        [-0.2980,  0.6075],
        [-0.2243,  0.4173],
        [-0.2265,  0.4223],
        [-0.2487,  0.2989],
        [-0.1592,  0.4314],
        [-0.3178,  0.3550],
        [-0.2610,  0.3582],
        [-0.3295,  0.2310],
        [-0.3012,  0.5738],
        [-0.1087,  0.4670],
        [-0.3594,  0.6535],
        [-0.3408,  0.2189],
        [-0.0856,  0.4437],
        [-0.3300,  0.2931],
        [-0.3805,  0.1971],
        [-0.2661,  0.5115],
        [-0.2539,  0.4829],
        [-0.2352,  0.3633],
        [-0.3658,  0.7181],
        [ 0.0233,  0.4239],
        [-0.2298,  0.3939],
        [-0.3269,  0.2274],
        [-0.2886,  0.3332]], device='cuda:0')

可以看出来,标准的bert数值较为松散,而我的bert数值较为集中,我的bert之中计算出来的标签都为1,并且右边的数值比左边优势较大,而标准的bert预测有0有1,符合真实的情况,这让我不得不再一次审视自己bert之中的attention网络层的内容。

loss的数值一致???

进一步观察发现,每次训练完成之后得到的对应loss一致,这里把每次训练之后得到的loss贴取出来

loss = 
tensor(5.9077, device='cuda:0', grad_fn=<MeanBackward0>)
loss = 
tensor(5.8792, device='cuda:0', grad_fn=<MeanBackward0>)
loss = 
tensor(5.8079, device='cuda:0', grad_fn=<MeanBackward0>)
loss = 
tensor(5.6600, device='cuda:0', grad_fn=<MeanBackward0>)

也就是说,这里引出一个很重要的问题,随着模型的不断训练的过程,loss的值并没有随之下降!!!
尝试去除transformer,得到如下的结果

logits = 
tensor([[-0.5117,  0.1378],
        [-0.5493,  0.1922],
        [-0.6277,  0.1356],
        [-0.5761,  0.2244],
        [-0.6132,  0.2341],
        [-0.5065,  0.2059],
        [-0.5965,  0.1784],
        [-0.6142,  0.2813],
        [-0.4949,  0.1311],
        [-0.5907,  0.2098],
        [-0.5602,  0.2436],
        [-0.5699,  0.0812],
        [-0.5364,  0.2125],
        [-0.4902,  0.1389],
        [-0.5639,  0.2263],
        [-0.6449,  0.1833],
        [-0.6300,  0.2147],
        [-0.4811,  0.1607],
        [-0.5663,  0.2478],
        [-0.5297,  0.1633]]

加上了一个transformer之后,对应的输出为

logits = 
tensor([[-0.6423,  0.5467],
        [-0.1368,  0.1510],
        [-0.7693,  0.7139],
        [-0.5271,  0.5275],
        [-0.7232,  0.7063],
        [-0.5627,  0.6834],
        [ 0.2435, -0.1741],
        [-1.1279,  1.1999],
        [-0.5683,  0.4834],
        [-0.7618,  0.6984],
        [-0.5251,  0.4246],
        [ 0.7410, -0.7918],
        [ 0.3518, -0.3879],
        [-0.5383,  0.5312],
        [-0.8535,  0.8815],
        [ 0.0776, -0.0912],
        [-0.2774,  0.3317],
        [-0.5682,  0.4882],
        [-0.4891,  0.4092],
        [-0.4869,  0.4494]], device='cuda:0', grad_fn=<AddmmBackward>)

可以看出,还是模型内部的结构出现了重大的问题

进一步排查问题来源:预处理之中的标签处理出现错误!!!

然而,排查之后发现,即使修改了预处理之中的标签,也无法得到训练结果

灵感:model.train()和model.eval()的不同输出

实验之中,发现model.train()有的时候能够预测出来不同的结果,但是model.eval()预测出来的结果始终保持一致,同时发现,如果将bert改为普通的embedding网络层,能够正常地进行训练和预测,也就是说bert之中的有些结果需要自己实现一下,比如layernormalization或者激活函数,并且最终出现问题的部位仍然锁定为bert模型之中

再度回归bert模型进行修改

对照大佬的bert结构,首先修改了layer_normalization变成自己实现的layer_normalization,发现输出的结果仍然一致,接着修改了激活函数gelu,此时发现模型能够输出不同的结果了!!!
之后迁移到我的新的训练之中,发现这里面后面的网络层使用了pytorch中的tanh激活函数之后,又变得不好使了,将pytorch中的tanh激活

关于pytorch之中训练有效的激活函数

pytorch训练之中行之有效的激活函数,在transformer的结构之中已经进行了行之有效的概括

def mish(x):
    return x * torch.tanh(torch.nn.functional.softplus(x))

def linear_act(x):
    return x

def gelu_new(x):
    """
    Implementation of the GELU activation function currently in Google BERT repo (identical to OpenAI GPT). Also see
    the Gaussian Error Linear Units paper: https://arxiv.org/abs/1606.08415
    """
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))

def gelu_fast(x):
    return 0.5 * x * (1.0 + torch.tanh(x * 0.7978845608 * (1.0 + 0.044715 * x * x)))


ACT2FN = {
    "relu": F.relu,
    "silu": F.silu,
    "swish": F.silu,
    "gelu": gelu,
    "tanh": torch.tanh,
    "gelu_new": gelu_new,
    "gelu_fast": gelu_fast,
    "mish": mish,
    "linear": linear_act,
    "sigmoid": torch.sigmoid,
}

尝试失败,继续回到原先的模型之中

这个模型在另外一个数据集上可以跑通,然而通过分析发现,另外一个数据集的id出了一些问题,开头部分的id不是[CLS]的对应标志,而是打头的第一个单词的id,这就导致了训练中产生不同的标签,所以还需要回到原来的模型之中去发现和修改。

只用embedding+linear失效的思考,还是以修改模型中的内容为主

为了定位训练过程之中究竟哪一步出现了差错,先尝试修改模型的结构,模型的对应结构修改如下:

class ClassificationModel(nn.Module):
    def __init__(self,model,config,n_labels):
        super(ClassificationModel,self).__init__()
        self.embedding = nn.Embedding(30522,768)
        #self.model = model
        self.dropout1 = nn.Dropout(0.2)
        self.fc2 = nn.Linear(config.embedding_size,n_labels)
        
    def forward(self,input_ids,segment_ids,input_mask):
        outputs = self.embedding(input_ids)
        #outputs = self.model(input_ids)
        #[64,128,768]
        outputs = outputs[:,0]
        outputs = self.dropout1(outputs)
        outputs = self.fc2(outputs)
        return outputs

仔细思考之后发现,这里出现的问题在于outputs = outputs[:,0],也就是说,无论什么样的数据放入到网络层中之后,经过了

outputs = outputs[:,0]

这一网络层之后输出的内容都是一样的,因为开头的标志都为[cls],也就是无论怎样训练,输入的数据到了网络层中的某一层的结果都是一样的,因此这个网络层无法实现正常的训练

由上述修改网络层结构引发的训练结果的不同,认为训练输出的所有标签一样的问题还是在网络结构之中出现了问题

修改attention中的过程,发现能够正常地训练

修改了attention的过程之后,发现bert模型可以正常地进行训练

意想不到的错误:排查print

灵感的出现来源于之前使用了一版别人家的bert,发现这个bert有的时候能够预测出不一样的标签,但是当比较train()或者eval()的结果的时候,发现中间的结果一样。
比较别人家的bert和我方bert的各种精细的结构之后,终于
去除掉所有的print,只在预测之中加上print之后,大功告成,程序可以正常运行!!!
在pytorch训练和预测过程中,网络层,包括最终的结果,尽量不要有任何的输出,否则就会可能会发生模型预测都是一个标签的现象!!!
猜想:错误发生的原因可能在于模型运行的结果应该是在gpu上面,而输出最终的过程在cpu上面,cpu与gpu有着矛盾与冲突,所以中途输出相应参数会导致最终预测结果出错,都是一个标签

预测第一波的时候可能标签没有发生修改

此外,预测第一波数据的时候,标签有可能没有发生过修改,此时需要多观察几个epoch之后再进行判断
xiamianmoxingyouwenti

class ClassificationModel(nn.Module):
    def __init__(self,model,config,n_labels):
        super(ClassificationModel,self).__init__()
        #self.embedding = nn.Embedding(30522,768)
        self.model = model
        self.fc1 = nn.Linear(config.embedding_size,config.embedding_size)
        self.dropout1 = nn.Dropout(0.2)
        self.activation = torch.tanh
        self.fc2 = nn.Linear(config.embedding_size,n_labels)
        
    def forward(self,input_ids,segment_ids,input_mask):
        #outputs = self.embedding(input_ids)
        outputs = self.model(input_ids)
        #[64,128,768]
        #print('...outputs = ...')
        #print(outputs)
        #print('................')
        outputs = outputs[:,0]
        outputs = self.fc1(outputs)
        outputs = self.activation(outputs)
        outputs = self.dropout1(outputs)
        outputs = self.fc2(outputs)
        #outputs = F.softmax(outputs)
        return outputs

jiashang tanh activation will cause problem and the label is constant
and the before embedding+linear will also cause problem

总结:1.与模型有关,与其他因素目前尚未发现关联2.之前只玩了一个epoch,有的时候多玩几个epoch才能看出效果

print的妙用

    for batch_token_ids,batch_labels in tqdm(test_loader,bar_format='{l_bar}%s{bar}%s{r_bar}' % (Fore.BLUE, Fore.RESET)):
        batch_token_ids = batch_token_ids.to(device)
        batch_labels = batch_labels.to(device)
        #eval_true_label.extend(batch_labels)
        with torch.no_grad():
            output = model(batch_token_ids,None,None)
        #pred = torch.max(output,axis=-1)[1]
        pred = torch.max(output, 1)[1]
        print('pred = ')
        print(pred)
        pred = pred.cpu()
        pred = pred.tolist()
        print('---pred = ---')
        print(pred)
        print('-------------')
        eval_predict_label.extend(pred)
        eval_true_label.extend(batch_labels.cpu())

这里如果不输出第一个pred,

print('pred = ')
print(pred)

模型的效果很差,这里应该与cpu()和gpu()之间的数据转换有关,考虑到python的浅拷贝和深拷贝,这里可以尝试着使用深拷贝。

总结:目前已知的让模型有好的实验效果的方法

训练完成之后:

        with torch.no_grad():
            output = model(batch_token_ids,None,None)
        #pred = torch.max(output,axis=-1)[1]
        pred = torch.max(output, 1)[1]
        print('pred = ')
        print(pred)
        eval_predict_label.extend(pred.cpu())
        eval_true_label.extend(batch_labels.cpu())

一个是pred必须要有,二是必须在with torch.no_grad外面,三是eval_predict_label.extend(pred.cpu())中的pred.cpu()后面不能加.tolist(),四是模型不能出现问题
1.尝试去除掉print的内容

    for batch_token_ids,batch_labels in tqdm(test_loader,bar_format='{l_bar}%s{bar}%s{r_bar}' % (Fore.BLUE, Fore.RESET)):
        batch_token_ids = batch_token_ids.to(device)
        batch_labels = batch_labels.to(device)
        #eval_true_label.extend(batch_labels)
        with torch.no_grad():
            output = model(batch_token_ids,None,None)
        #pred = torch.max(output,axis=-1)[1]
        pred = torch.max(output, 1)[1]
        #print('pred = ')
        #print(pred)
            #predlabel = torch.clone(pred)
            #predlabel = predlabel.cpu()
            #predlabel = predlabel.tolist()
        eval_predict_label.extend(pred.cpu())
        eval_true_label.extend(batch_labels.cpu())

此种情况时好时坏
2.尝试加入.tolist()

    for batch_token_ids,batch_labels in tqdm(test_loader,bar_format='{l_bar}%s{bar}%s{r_bar}' % (Fore.BLUE, Fore.RESET)):
        batch_token_ids = batch_token_ids.to(device)
        batch_labels = batch_labels.to(device)
        with torch.no_grad():
            output = model(batch_token_ids,None,None)
        pred = torch.max(output, 1)[1]
eval_predict_label.extend(pred.cpu())
        eval_true_label.extend(batch_labels.cpu())

结果并不好
3.尝试放入到with torch.no_grad()的范围之中

    for batch_token_ids,batch_labels in tqdm(test_loader,bar_format='{l_bar}%s{bar}%s{r_bar}' % (Fore.BLUE, Fore.RESET)):
        batch_token_ids = batch_token_ids.to(device)
        batch_labels = batch_labels.to(device)
        with torch.no_grad():
            output = model(batch_token_ids,None,None)
        		 pred = torch.max(output, 1)[1]
eval_predict_label.extend(pred.cpu())
            eval_true_label.extend(batch_labels.cpu())

此外,如果上一波训练的效果不好的话,可能会殃及到下一波的数据
如果前面代码为:

for batch_token_ids,batch_labels in test_loader:

替换为

for batch_token_ids,batch_labels in tqdm(test_loader,bar_format='{l_bar}%s{bar}%s{r_bar}' % (Fore.BLUE, Fore.RESET)):

有可能训练的效果会出现转机

优化器的优化方向问题

感觉同一模型在不同的状态下训练的结果不同是由于优化器优化方向的问题,所以这里选用adamw优化器进行训练

  • 5
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值