基于一个多分类问题比较bert单任务训练和多任务训练

Jiawen9

已于 2022-08-25 14:18:36 修改

阅读量1.2k

点赞数 4

文章标签： bert 深度学习自然语言处理

于 2022-08-25 14:04:03 首次发布

本文链接：https://blog.csdn.net/cjw838982809/article/details/126522421

版权

笔者在学习基于huggingface实现多分类问题时，使用了kaggle比赛中的Feedback Prize - Predicting Effective Arguments中的数据集。

Feedback Prize - Predicting Effective Arguments/Dataset

本次比赛的目标是将学生写作中的争论元素分类为“有效”、“充分”或“无效” 。

提供的数据集包含美国 6-12 年级学生撰写的议论文。这些文章由专家评分者注释，用于议论文中常见的话语元素：

Lead - 以统计数据、引文、描述或其他一些手段开始的介绍

Position - 对主要问题的意见或结论

Claim - 支持该立场的主张

Counterclaim - 反驳另一项主张或对该立场提出相反理由的主张

Rebuttal- 反驳反诉的主张

Evidence - 支持主张、反诉或反驳的想法或例子

Concluding Statement - 重申声明的结论性声明

参赛者的任务是预测每个话语元素的质量等级。人类读者将每个修辞或论证元素按质量递增的顺序评为以下之一： Ineffective Adequate Effective

我们这里主要使用这个数据集的train.csv和test.csv，它们的内容如下：

train.csv ...
   discourse_id      essay_id                                     discourse_text discourse_type discourse_effectiveness
0  0013cc385424  007ACE74B050  Hi, i'm Isaac, i'm going to be writing about h...           Lead                Adequate
1  9704a709b505  007ACE74B050  On my perspective, I think that the face is a ...       Position                Adequate
2  c22adee811b6  007ACE74B050  I think that the face is a natural landform be...          Claim                Adequate
3  a10d361e54e4  007ACE74B050  If life was on Mars, we would know by now. The...       Evidence                Adequate
4  db3e453ec4e2  007ACE74B050  People thought that the face was formed by ali...   Counterclaim                Adequate
test.csv ...
   discourse_id      essay_id                                     discourse_text discourse_type
0  a261b6e14276  D72CB1C11673  Making choices in life can be very difficult. ...           Lead
1  5a88900e7dc1  D72CB1C11673  Seeking multiple opinions can help a person ma...       Position
2  9790d835736b  D72CB1C11673                     it can decrease stress levels           Claim
3  75ce6d68b67b  D72CB1C11673             a great chance to learn something new           Claim
4  93578d946723  D72CB1C11673               can be very helpful and beneficial.           Claim

当然，针对这个比赛不是仅仅只用一个预训练模型可以解决的，我们这里主要是借这个数据集来简单做一个bert多分类的尝试。

那么一个思路就是使用bert微调，在bert的输出层中的分类头[CLS]取出，再映射到一层MLP中，例如在这个任务中，我们想要完成一个三分类的任务，希望我们输入的句子最后被分为Ineffective，Adequate，Effective三类。

那么直接上代码，这里的预训练模型我使用的是microsoft/deberta-base，在使用的角度来说，我们不是为了比较deberta和bert，所以就把它当做bert就行。

from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch
import time
from transformers import get_cosine_schedule_with_warmup
from d2l import torch as d2l
import pandas as pd

# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self, checkpoint, config):
        super().__init__()
        self.pretrained = AutoModel.from_pretrained(checkpoint, config=config)
        self.fc = torch.nn.Sequential(torch.nn.Linear(768, 3))

    def forward(self, input_ids, attention_mask, token_type_ids):
        logits = self.pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        logits = logits.last_hidden_state[:, 0]
        logits = self.fc(logits)
        logits = logits.softmax(dim=1)
        
        return logits

# 定义数据集
class myDataset(torch.utils.data.Dataset):
    def __init__(self, sentences, attention_mask, token_type_ids,label ):
        super(myDataset, self).__init__()
        self.sentences = torch.tensor(sentences)
        self.attention_mask = torch.tensor(attention_mask)
        self.token_type_ids = torch.tensor(token_type_ids)
        self.label = torch.tensor(label)

    def __len__(self):
        return self.sentences.shape[0]
    
    def __getitem__(self, idx):
        return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx], self.label[idx]

# 定义测试数据集
class testDataset(torch.utils.data.Dataset):
    def __init__(self, sentences, attention_mask, token_type_ids):
        super(testDataset, self).__init__()
        self.sentences = torch.tensor(sentences)
        self.attention_mask = torch.tensor(attention_mask)
        self.token_type_ids = torch.tensor(token_type_ids)

    def __len__(self):
        return self.sentences.shape[0]
    
    def __getitem__(self, idx):
        return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx]

# 读数据文件
def load_data(file_path, tokenizer):
    df = pd.read_csv(file_path)
    sentences = df['discourse_text'].tolist()
    label_effectiveness = df['discourse_effectiveness'].replace({'Adequate':0, 'Effective':1, 'Ineffective':2}).tolist()

    token_type_ids, attention_mask, input_ids = [], [], []
    for sentence in sentences:
        encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
        input_ids.append(encode_dict["input_ids"])
        token_type_ids.append(encode_dict["token_type_ids"])
        attention_mask.append(encode_dict["attention_mask"])
    return input_ids, label_effectiveness, token_type_ids, attention_mask

# 训练函数
def train(net, train_iter, lr, weight_decay, num_epochs, devices):
    total_time = 0
    train_len = len(Inputid_train)
    train_loss, train_acc = [], []
    net = torch.nn.DataParallel(net.to(devices[0]))
    loss = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(net.parameters(), lr=lr, weight_decay=weight_decay)
    schedule = get_cosine_schedule_with_warmup(
        optimizer, num_warmup_steps=len(train_iter), num_training_steps=num_epochs*len(train_iter)
    )
    for epoch in range(num_epochs):
        start_of_epoch = time.time()
        cor = 0
        loss_sum = 0
        net.train()
        for idx,(ids,att_mask,type,y) in enumerate(train_iter):
            optimizer.zero_grad()
            ids, att_mask,type, y = ids.to(devices[0]), att_mask.to(devices[0]),type.to(devices[0]),y.to(devices[0])
            out_train = net(ids,att_mask,type)
            l = loss(out_train, y)
            l.backward()
            optimizer.step()
            schedule.step()
            loss_sum += l.item()
            if(idx + 1) % 20 == 0:
                print("Epoch {:04d} | Step {:06d}/{:06d} | Loss {:.4f} | Time {:.0f}".format(epoch + 1, idx + 1, len(train_iter), loss_sum / (idx + 1), time.time() - start_of_epoch))
            out_train = out_train.argmax(dim=1)
            cor += (out_train == y).sum()
            cor = float(cor)
        
        acc = float(cor /train_len) 
        print(acc)

        if epoch % 1 == 0:
            print(f'epoch {epoch + 1}, train_loss {loss_sum / (len(train_iter))},  train_acc {acc}')
            train_loss.append(loss_sum / len(train_iter))
            train_acc.append(acc)
    
        end_of_epoch = time.time()
        print("epoch {} duration:".format(epoch + 1), end_of_epoch - start_of_epoch)
        total_time += end_of_epoch - start_of_epoch
 
    print("total training time: ",total_time)

# 测试函数
def eval(test_path, net, devices, test_batch_size):
    df = pd.read_csv(test_path)
    sentences = df['discourse_text'].tolist()
    token_type_ids, attention_mask, input_ids = [], [], []
    for sentence in sentences:
        encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
        input_ids.append(encode_dict["input_ids"])
        token_type_ids.append(encode_dict["token_type_ids"])
        attention_mask.append(encode_dict["attention_mask"])
    test_iter = torch.utils.data.DataLoader(testDataset(input_ids, attention_mask, token_type_ids), test_batch_size, True)
    net.eval()
    with torch.no_grad():
        for ids, att, tpe in test_iter:
            ids, att, tpe = ids.to(devices[0]), att.to(devices[0]), tpe.to(devices[0])
            out_test = net(ids , att , tpe)
    
    return out_test

checkpoint = 'microsoft/deberta-base'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
train_path = '/home/cjw/kaggle/feedback/train.csv'
test_path = '/home/cjw/kaggle/feedback/test.csv'
Inputid_train, Labelid_train, typeids_train, inputmask_train = load_data(train_path, tokenizer)
batch_size = 8
dataset = myDataset(Inputid_train, inputmask_train, typeids_train, Labelid_train)
train_iter = torch.utils.data.DataLoader(dataset, batch_size, True)
net = Model(checkpoint, config)
num_epochs, lr, weight_decay, devices = 10, 2e-5, 1e-4, d2l.try_all_gpus()
print("baseline:",checkpoint)
print("training...")
train(net, train_iter, lr, weight_decay, num_epochs, devices)
print("evaling...")
predictions = eval(test_path, net, devices, 10).cpu()
submission = pd.read_csv('/home/cjw/kaggle/feedback/sample_submission.csv')
submission['Adequate'] = predictions[:, 0]
submission['Effective'] = predictions[:, 1]
submission['Ineffective'] = predictions[:, 2]
print(submission)
submission.to_csv('submission.csv', index=False)

这里只跑了10个epoch，因为数据集中text的长度大，跑起来也费时间，可以从结果看出loss和accuracy还在继续优化。

结果如下：

# 单任务
baseline: microsoft/deberta-base
training...
...
epoch 6, train_loss 0.9447965523255608,  train_acc 0.6066639466884265
epoch 7, train_loss 0.934909532415649,  train_acc 0.6165374677002584
epoch 8, train_loss 0.9426709712978235,  train_acc 0.6087583299333605
epoch 9, train_loss 0.922136064427536,  train_acc 0.6262749898000816
epoch 10, train_loss 0.9017313893615317,  train_acc 0.6450700394396844
evaling...
   discourse_id   Ineffective  Adequate  Effective
0  a261b6e14276  7.290736e-07  0.997184   0.002815
1  5a88900e7dc1  9.837714e-07  0.013967   0.986032
2  9790d835736b  3.147095e-07  0.999499   0.000500
3  75ce6d68b67b  3.179148e-07  0.999486   0.000514
4  93578d946723  3.764216e-07  0.999225   0.000775
5  2e214524dbe3  3.639196e-07  0.999282   0.000717
6  84812fc2ab9f  3.567264e-06  0.959672   0.040324
7  c668ff840720  3.426721e-06  0.143535   0.856462
8  739a6d00f44a  2.438440e-06  0.978529   0.021468
9  bcfae2c9a244  2.324955e-06  0.070954   0.929044

紧接着我们来这样考虑一下，句子的“有效”、“充分”或“无效”三种情况与句子本身处于的话语元素肯定也是相关的，那么我们可不可以在训练中既将句子分类成为“有效”、“充分”或“无效”三种类别，并且也把它们分为七种话语元素类别呢，答案是肯定的。我们知道像bert在预训练时进行的就是一个多任务训练，mask language model任务和sequence prediction任务。

我们这样简单尝试一下：两个任务都是分类，一般情况下，loss值应该是接近的，收敛速度可能也差不多，我们将两个分类任务的loss直接相加，然后优化。在模型上，我们设置两个MLP层，两个分类任务各使用一个MLP层。

接下来验证我们的想法：

from transformers import AutoConfig, AutoModel, AutoTokenizer
import torch
import time
from transformers import get_cosine_schedule_with_warmup
from d2l import torch as d2l
import pandas as pd

# 定义下游任务模型
class Model(torch.nn.Module):
    def __init__(self, checkpoint, config):
        super().__init__()
        self.pretrained = AutoModel.from_pretrained(checkpoint, config=config)
        self.fc_a = torch.nn.Sequential(torch.nn.Linear(768, 3))
        self.fc_b = torch.nn.Sequential(torch.nn.Linear(768, 7))

    def forward(self, input_ids, attention_mask, token_type_ids, class_num=3):
        logits = self.pretrained(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        logits = logits.last_hidden_state[:, 0]
        if class_num == 3:
            logits = self.fc_a(logits)
        elif class_num == 7:
            logits = self.fc_b(logits)
        logits = logits.softmax(dim=1)
        return logits

# 定义数据集
class myDataset(torch.utils.data.Dataset):
    def __init__(self, sentences, attention_mask, token_type_ids, label_effectiveness, label_type):
        super(myDataset, self).__init__()
        self.sentences = torch.tensor(sentences)
        self.attention_mask = torch.tensor(attention_mask)
        self.token_type_ids = torch.tensor(token_type_ids)
        self.label_effectiveness = torch.tensor(label_effectiveness)
        self.label_type = torch.tensor(label_type)

    def __len__(self):
        return self.sentences.shape[0]

    def __getitem__(self, idx):
        return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx], self.label_effectiveness[idx], self.label_type[idx]

# 定义测试数据集
class testDataset(torch.utils.data.Dataset):
    def __init__(self, sentences, attention_mask, token_type_ids):
        super(testDataset, self).__init__()
        self.sentences = torch.tensor(sentences)
        self.attention_mask = torch.tensor(attention_mask)
        self.token_type_ids = torch.tensor(token_type_ids)

    def __len__(self):
        return self.sentences.shape[0]

    def __getitem__(self, idx):
        return self.sentences[idx], self.attention_mask[idx], self.token_type_ids[idx]

# 读数据文件
def load_data(file_path, tokenizer):
    df = pd.read_csv(file_path)
    sentences = df['discourse_text'].tolist()
    label_effectiveness = df['discourse_effectiveness'].replace({'Adequate':0, 'Effective':1, 'Ineffective':2}).tolist()
    label_type = df['discourse_type'].replace({'Lead':0, 'Position':1, 'Claim':2, 'Counterclaim':3, 'Rebuttal':4, 'Evidence':5, 'Concluding Statement':6})

    token_type_ids, attention_mask, input_ids = [], [], []
    for sentence in sentences:
        encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
        input_ids.append(encode_dict["input_ids"])
        token_type_ids.append(encode_dict["token_type_ids"])
        attention_mask.append(encode_dict["attention_mask"])
    return input_ids, label_effectiveness, token_type_ids, attention_mask, label_type

# 训练函数
def train(net, train_iter, lr, weight_decay, num_epochs, devices):
    total_time = 0
    train_len = len(Inputid_train)
    net = torch.nn.DataParallel(net.to(devices[0]))
    loss = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.AdamW(net.parameters(), lr=lr, weight_decay=weight_decay)
    schedule = get_cosine_schedule_with_warmup(
        optimizer, num_warmup_steps=len(train_iter), num_training_steps=num_epochs*len(train_iter)
    )
    for epoch in range(num_epochs):
        start_of_epoch = time.time()
        loss_sum = 0
        cor_a = 0
        net.train()
        for idx,(ids,att_mask,type,y_a, y_b) in enumerate(train_iter):
            optimizer.zero_grad()
            ids, att_mask,type, y_a, y_b = ids.to(devices[0]), att_mask.to(devices[0]), type.to(devices[0]), y_a.to(devices[0]), y_b.to(devices[0])
            output_a = net(ids, att_mask, type)
            output_b = net(ids, att_mask, type, class_num = 7)
            l_a = loss(output_a, y_a)
            l_b = loss(output_b, y_b)
            l = l_a + l_b
            l.backward()
            optimizer.step()
            schedule.step()
            loss_sum += l.item()
            if(idx + 1) % 20 == 0:
                print("Epoch {:04d} | Step {:06d}/{:06d} | Loss {:.4f} | Time {:.0f}".format(epoch + 1, idx + 1, len(train_iter), loss_sum / (idx + 1), time.time() - start_of_epoch))
            output_a = output_a.argmax(dim=1)
            cor_a += (output_a == y_a).sum()
        
        acc_a = float(cor_a /train_len)

        if epoch % 1 == 0:
            print(f'epoch {epoch + 1}, train_loss {loss_sum / (len(train_iter))}, train_acc_a {acc_a}')

        end_of_epoch = time.time()
        print("epoch {} duration:".format(epoch + 1), end_of_epoch - start_of_epoch)
        total_time += end_of_epoch - start_of_epoch
 
    print("total training time: ",total_time)

# 测试函数
def eval(test_path, net, devices, test_batch_size):
    df = pd.read_csv(test_path)
    sentences = df['discourse_text'].tolist()
    token_type_ids, attention_mask, input_ids = [], [], []
    for sentence in sentences:
        encode_dict = tokenizer.encode_plus(sentence, max_length=512, padding="max_length", truncation=True)
        input_ids.append(encode_dict["input_ids"])
        token_type_ids.append(encode_dict["token_type_ids"])
        attention_mask.append(encode_dict["attention_mask"])
    test_iter = torch.utils.data.DataLoader(testDataset(input_ids, attention_mask, token_type_ids), test_batch_size, True)
    net.eval()
    with torch.no_grad():
        for ids, att, tpe in test_iter:
            ids, att, tpe = ids.to(devices[0]), att.to(devices[0]), tpe.to(devices[0])
            out_test = net(ids , att , tpe)
    
    return out_test

checkpoint = 'microsoft/deberta-base'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
config = AutoConfig.from_pretrained(checkpoint)
train_path = '/home/cjw/kaggle/feedback/train.csv'
test_path = '/home/cjw/kaggle/feedback/test.csv'
Inputid_train, Labelid_train, typeids_train, inputmask_train, label_type = load_data(train_path, tokenizer)
batch_size = 8
train_iter = torch.utils.data.DataLoader(myDataset(Inputid_train, inputmask_train, typeids_train, Labelid_train, label_type), batch_size, drop_last = True)
net = Model(checkpoint, config)
num_epochs, lr, weight_decay, devices = 10, 2e-5, 1e-4, d2l.try_all_gpus()
print("baseline:",checkpoint)
print("training...")
train(net, train_iter, lr, weight_decay, num_epochs, devices)
print("evaling...")
predictions = eval(test_path, net, devices, 10).cpu()
submission = pd.read_csv('/home/cjw/kaggle/feedback/sample_submission.csv')
submission['Adequate'] = predictions[:, 0]
submission['Effective'] = predictions[:, 1]
submission['Ineffective'] = predictions[:, 2]
print(submission)
submission.to_csv('submission.csv', index=False)

同样只跑了10个epoch，我们控制变量，两个代码只有单任务和多任务的区别，也可以从结果看出，loss和accuracy还在继续优化，但我们这里只是比较一下单任务和多任务，所以不关注它最好的结果。

# 多任务
baseline: microsoft/deberta-base
training...
...
epoch 6, train_loss 2.396573264969835, train_acc_a 0.7231334447860718
epoch 7, train_loss 2.3707293589823393, train_acc_a 0.7420101165771484
epoch 8, train_loss 2.350535287120267, train_acc_a 0.7575955390930176
epoch 9, train_loss 2.3366618679968134, train_acc_a 0.768774688243866
epoch 10, train_loss 2.3300860160063865, train_acc_a 0.7741058468818665
evaling...
   discourse_id  Ineffective  Adequate  Effective
0  a261b6e14276     0.000032  0.749933   0.250035
1  5a88900e7dc1     0.000004  0.000244   0.999752
2  9790d835736b     0.000010  0.999960   0.000030
3  75ce6d68b67b     0.000047  0.993598   0.006354
4  93578d946723     0.000030  0.726694   0.273276
5  2e214524dbe3     0.000015  0.000096   0.999889
6  84812fc2ab9f     0.000065  0.988569   0.011366
7  c668ff840720     0.000003  0.000084   0.999913
8  739a6d00f44a     0.000016  0.999982   0.000002
9  bcfae2c9a244     0.000053  0.888959   0.110988

我们通过对比两个结果的准确率，通过多任务训练的结果要比单任务训练的结果高13个百分点，可以得出在这个场景中，多提取一维特征对任务带来的提升。

当然在此重申，我们这里主要是为了学习利用预训练模型进行多任务学习，如果去进行比赛或者落地时这样做太过奢侈，最好先进行特征过程再输入预训练模型，不然一定会超出显存设置或者要跑很长时间。

另外如果两个任务差别比较大，学习率等超参数不一定要设置为一样。