基于预训练模型的文本回归任务解决方案

赛题背景

阅读是取得学术成功的一项基本技能。当学生联系阅读挑战性的文章时,他们自然就会培养阅读能力。

当前教育课本使用传统的可读性方法与读者进行匹配。但是它们缺乏构造和理论有效性。CommonLit 是一家非营利性教育技术组织,为超过2000万名师生提供3至12年级的免费数字阅读和写作课程。

赛题任务

在本竞赛中,您将构建算法来评估3-12年级课堂使用的阅读文章的复杂程度。赛题数据集包括来自各个年龄段的读者以及来自各个领域的大量文本。获胜的模型将确保结合文本衔接和语义。

本竞赛开发的算法将帮助教师和学生能够快速准确地评估课堂作业,学生同时也更容易提高基本的阅读技能。

数据介绍

为3-12年级的阅读内容划分等级

其中每个字段的描述如下:

  • id:每个不同专家的id
  • url_legal:代表数据的来源,测试集中为空白字符,避免选手知道数据的来源
  • license:数据使用许可协议,测试集为空
  • target - 可阅读性的分数,越低代表可阅读性差
  • standard_error- 衡量每个摘录的多个评分者之间的分数分布。不包括测试数据。

数据如下

主要用到的为文本excerpt和目标target,要求选手通过文本构建模一个回归模型,来推断出具体的分数。就好比如好多学生在写作文,把写好的作文给其他人阅读,评估下每个人写好的论文可阅读性,是否能够让人通俗理解。

评分指标

提交分数是根据均方根误差进行计算预测值与真实值评分的。 RMSE 定义为:

RMSE = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} RMSE=n1i=1n(yiy^i)2

基于预训练模型的文本回归任务解决方案

数据分析
train_df = pd.read_csv("../input/commonlitreadabilityprize/train.csv")
test_df = pd.read_csv("../input/commonlitreadabilityprize/test.csv")


首先我们看下目标值target的具体分布,其中大部分值集中在-1左右,最小值为-4左右,最大值为2:

另外我们可以看下整体语料中的,经常实现的词以及词组有哪些:

common_words = get_top_n_words(train_df['excerpt_preprocessed'], 20)
common_words_df1 = DataFrame(common_words,columns=['word','freq'])
plt.figure(figsize=(16, 8))
ax = sns.barplot(x='freq', y='word', data=common_words_df1,facecolor=(0, 0, 0, 0),linewidth=3,edgecolor=sns.color_palette("ch:start=3, rot=.1",20))

plt.title("Top 20 unigrams",font='Serif')
plt.xlabel("Frequency", fontsize=14)
plt.yticks(fontsize=13)
plt.xticks(rotation=45, fontsize=13)
plt.ylabel("");

common_words_df2 = plot_bt(get_top_n_bigram,"bigrams","ch:rot=-.5")
common_words_df3 = plot_bt(get_top_n_trigram,"trigrams","ch:start=-1, rot=-.6")




因为是小学年级的文本语料,所以可以看到大多数词汇还是入门级别的常用词。

预训练模型-继续预训练

首先导入所需要的包:

import argparse
import logging
import math
import os
import random

import datasets
from datasets import load_dataset
from tqdm.auto import tqdm
from accelerate import Accelerator

import torch
from torch.utils.data import DataLoader

import transformers
from transformers import (
    CONFIG_MAPPING, 
    MODEL_MAPPING, 
    AdamW, 
    AutoConfig, 
    AutoModelForMaskedLM, 
    AutoTokenizer, 
    DataCollatorForLanguageModeling, 
    SchedulerType, 
    get_scheduler, 
    set_seed
)

logger = logging.getLogger(__name__)
MODEL_CONFIG_CLASSES = list(MODEL_MAPPING.keys())
MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)


预训练主要参数设置

class TrainConfig:
    train_file= 'mlm_data.csv'
    validation_file = 'mlm_data.csv'
    validation_split_percentage= 5
    pad_to_max_length= True
    model_name_or_path= 'roberta-base'
    config_name= 'roberta-base'
    tokenizer_name= 'roberta-base'
    use_slow_tokenizer= True
    per_device_train_batch_size= 8
    per_device_eval_batch_size= 8
    learning_rate= 5e-5
    weight_decay= 0.0
    num_train_epochs= 1 # change to 5
    max_train_steps= None
    gradient_accumulation_steps= 1
    lr_scheduler_type= 'constant_with_warmup'
    num_warmup_steps= 0
    output_dir= 'output'
    seed= 2021
    model_type= 'roberta'
    max_seq_length= None
    line_by_line= False
    preprocessing_num_workers= 4
    overwrite_cache= True
    mlm_probability= 0.15

config = TrainConfig()

预训练任务MLM的训练如下:

  for epoch in range(args.num_train_epochs):
        model.train()
        for step, batch in enumerate(train_dataloader):
            outputs = model(**batch)
            loss = outputs.loss
            loss = loss / args.gradient_accumulation_steps
            accelerator.backward(loss)
            if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
                optimizer.step()
                lr_scheduler.step()
                optimizer.zero_grad()
                progress_bar.update(1)
                completed_steps += 1

            if completed_steps >= args.max_train_steps:
                break

        model.eval()
        losses = []
        for step, batch in enumerate(eval_dataloader):
            with torch.no_grad():
                outputs = model(**batch)

            loss = outputs.loss
            losses.append(accelerator.gather(loss.repeat(args.per_device_eval_batch_size)))

        losses = torch.cat(losses)
        losses = losses[: len(eval_dataset)]
        perplexity = math.exp(torch.mean(losses))

        logger.info(f"epoch {epoch}: perplexity: {perplexity}")

    if args.output_dir is not None:
        accelerator.wait_for_everyone()
        unwrapped_model = accelerator.unwrap_model(model)
        unwrapped_model.save_pretrained(args.output_dir, save_function=accelerator.save)
预训练模型-回归任务微调

导入所需要的包

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim.optimizer import Optimizer
import torch.optim.lr_scheduler as lr_scheduler
from torch.utils.data import (
    Dataset, DataLoader, 
    SequentialSampler, RandomSampler
)
from transformers import AutoConfig
from transformers import (
    get_cosine_schedule_with_warmup, 
    get_cosine_with_hard_restarts_schedule_with_warmup,
    get_linear_schedule_with_warmup
)
from transformers import AdamW
from transformers import AutoTokenizer
from transformers import AutoModel
from transformers import MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING
from IPython.display import clear_output
from tqdm import tqdm, trange

自定义数据集加载器:

class DatasetRetriever(Dataset):
    def __init__(self, data, tokenizer, max_len, is_test=False):
        self.data = data
        if 'excerpt' in self.data.columns:
            self.excerpts = self.data.excerpt.values.tolist()
        else:
            self.excerpts = self.data.text.values.tolist()
        self.targets = self.data.target.values.tolist()
        self.tokenizer = tokenizer
        self.is_test = is_test
        self.max_len = max_len
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, item):
        excerpt, label = self.excerpts[item], self.targets[item]
        features = convert_examples_to_features(
            excerpt, self.tokenizer, 
            self.max_len, self.is_test
        )
        return {
            'input_ids':torch.tensor(features['input_ids'], dtype=torch.long),
            'token_type_ids':torch.tensor(features['token_type_ids'], dtype=torch.long),
            'attention_mask':torch.tensor(features['attention_mask'], dtype=torch.long),
            'label':torch.tensor(label, dtype=torch.double),
        }

回归模型构建

class CommonLitModel(nn.Module):
    def __init__(
        self, 
        model_name, 
        config,  
        multisample_dropout=False,
        output_hidden_states=False
    ):
        super(CommonLitModel, self).__init__()
        self.config = config
        self.roberta = AutoModel.from_pretrained(
            model_name, 
            output_hidden_states=output_hidden_states
        )
        self.layer_norm = nn.LayerNorm(config.hidden_size)
        if multisample_dropout:
            self.dropouts = nn.ModuleList([
                nn.Dropout(0.5) for _ in range(5)
            ])
        else:
            self.dropouts = nn.ModuleList([nn.Dropout(0.3)])
        #self.regressor = nn.Linear(config.hidden_size*2, 1)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self._init_weights(self.layer_norm)
        self._init_weights(self.regressor)
 
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
 
    def forward(
        self, 
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        labels=None
    ):
        outputs = self.roberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )
        sequence_output = outputs[1]
        sequence_output = self.layer_norm(sequence_output)
 
        # max-avg head
        # average_pool = torch.mean(sequence_output, 1)
        # max_pool, _ = torch.max(sequence_output, 1)
        # concat_sequence_output = torch.cat((average_pool, max_pool), 1)
 
        # multi-sample dropout
        for i, dropout in enumerate(self.dropouts):
            if i == 0:
                logits = self.regressor(dropout(sequence_output))
            else:
                logits += self.regressor(dropout(sequence_output))
        
        logits /= len(self.dropouts)
 
        # calculate loss
        loss = None
        if labels is not None:
            # regression task
            loss_fn = torch.nn.MSELoss()
            logits = logits.view(-1).to(labels.dtype)
            loss = torch.sqrt(loss_fn(logits, labels.view(-1)))
        
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

总结

基于baseline的思路,模型多样性越多精度更好
模型多样性指:Bert,DistilRoberta,Roberta
模型参数多样性:层随机初始化,超参数差异

完整代码可通过私信联系小编获取
署名作者:小李飞刀

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值