【NLP】(task9)Transformers解决机器翻译任务

山顶夕景

已于 2023-03-28 20:37:18 修改

阅读量5.5k

点赞数 7

分类专栏：自然语言处理文章标签：机器翻译自然语言处理深度学习

于 2021-08-31 18:33:20 首次发布

本文链接：https://blog.csdn.net/qq_35812205/article/details/120002522

版权

自然语言处理专栏收录该内容

88 篇文章 57 订阅

订阅专栏

学习总结

（1）之前的任务都是语言理解类任务（如文本分类、抽取式问答、序列标注等），而对于文本生成任务（如机器翻译、文本摘要等），这类任务不仅需要对作为条件的输入有较好的表示能力（编码器），同时需要较强大的（序列）解码器生成目标文本。沿着解码器利用自监督学习的方式进行预训练，大佬们提出了一些预训练生成模型：BART、UniLM、T5和GPT-3等，更多机器翻译论文可参考清华大学NLP组整理的论文——https://github.com/THUNLP-MT/MT-Reading-List。

（2）本次学习用Helsinki-NLP/opus-mt-en-ro预训练模型（含mBart模型等）+微调解决机器翻译任务的方法及步骤，步骤主要分为加载数据、数据预处理、微调预训练模型。

在加载数据阶段中，使用WMT数据集；
在数据预处理阶段中，对tokenizer分词器的建模，使用as_target_tokenizer控制target对应的特殊token，并完成数据集中所有样本的预处理；
在微调预训练模型阶段，使用Seq2SeqTrainingArguments对模型参数进行设置，并构建Seq2SeqTrainer训练器，进行模型训练和评估。

（3）问：同样的分类任务，有的是用BertTokenizer, BertForSequenceClassification 有的是用AutoTokenizer，AutoModelForSequenceClassification 这俩有什么区别吗？
答：Auto的字面含义就是：自动帮你寻找合适的模型，根据你传入的模型名字。比如你传了了一个bert-base-uncased的，它就会帮你找到对应的BertForSequenceClassification

（4）NLP生成任务用什么预训练模型效果更好，BERT、Roberta，GPT，T5？

1）RoBERTa（Robustly Optimized BERT Pre-training Approch）与BERT有何不同？
Facebook AI研究团队改进了BERT的训练，以进一步优化它：
1、他们使用160GB的文本，而不是最初用于训练BERT的16GB数据集。
2、将迭代次数从100K增加到300K，然后进一步增加到500K。
3、动态更改应用于训练数据的mask 模式。
4、从训练过程中删除下一个序列预测目标。

2）T5（Text-to-Text Transfer Transformer）使用同一套模型参数完成多项不同的条件式生成任务：
1、需要给模型注入任务信息，使其能够按照特定任务生成目标文本；
2、使用自然语言描述或简短提示（Prompt）作为输入文本的前缀表示目标任务。

3）GPT-3同样是将不同形式的NLP任务重定义为文本生成实现模型的通用化：
GPT-3主要展现的是超大规模语言模型的小样本学习（Few-shot learning）能力。其模型的输入不仅以自然语言或指令作为前缀表征目标任务，还使用少量的目标任务标注样本为条件上下文。

文章目录

建议直接使用google colab notebook打开本教程，可以快速下载相关数据集和模型。
如果正在google的colab中打开这个notebook，可能需要安装Transformers和🤗Datasets库。将以下命令取消注释即可安装。

! pip install datasets transformers sacrebleu sentencepiece

确保安装了transformer-quick-start-zh的readme文件中的所有依赖库。也可以在这里找到本notebook的多GPU分布式训练版本。

【注意】sacrebleu需要安装1.5.1版本，不然会报scr.DEFAULT_TOKENIZER找不到的错误：

!pip install sacrebleu==1.5.1

任务：微调transformer模型解决翻译任务

在这个notebook中，使用🤗 Transformers代码库中的模型来解决自然语言处理中的翻译任务。我们将会使用WMT dataset数据集。这是翻译任务最常用的数据集之一。

下面展示了一个例子：
在这里插入图片描述

对于翻译任务，我们将展示如何使用简单的加载数据集，同时针对相应的仍无使用transformer中的Trainer接口对模型进行微调。

model_checkpoint = "Helsinki-NLP/opus-mt-en-ro" 
# 选择一个模型checkpoint

只要预训练的transformer模型包含seq2seq结构的head层，那么本notebook理论上可以使用各种各样的transformer模型模型面板，解决任何翻译任务。

下面使用已经训练好的Helsinki-NLP/opus-mt-en-ro checkpoint来做翻译任务。

一、加载数据

我们将会使用🤗 Datasets库来加载数据和对应的评测方式。数据加载和评测方式加载只需要简单使用load_dataset和load_metric即可。我们使用WMT数据集中的English/Romanian双语翻译。

from datasets import load_dataset, load_metric

raw_datasets = load_dataset("wmt16", "ro-en")
metric = load_metric("sacrebleu")

这个datasets对象本身是一种DatasetDict数据结构. 对于训练集、验证集和测试集，只需要使用对应的key（train，validation，test）即可得到相应的数据。

raw_datasets

    DatasetDict({
        train: Dataset({
            features: ['translation'],
            num_rows: 610320
        })
        validation: Dataset({
            features: ['translation'],
            num_rows: 1999
        })
        test: Dataset({
            features: ['translation'],
            num_rows: 1999
        })
    })

给定一个数据切分的key（train、validation或者test）和下标即可查看数据。

raw_datasets["train"][0]
# 我们可以看到一句英语en对应一句罗马尼亚语言ro

    {'translation': {'en': 'Membership of Parliament: see Minutes',
      'ro': 'Componenţa Parlamentului: a se vedea procesul-verbal'}}

为了能够进一步理解数据长什么样子，下面的函数将从数据集里随机选择几个例子进行展示。

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

show_random_elements(raw_datasets["train"])

	translation
0	{'en': 'The Bulgarian gymnastics team won the gold medal at the traditional Grand Prix series competition in Thiais, France, which wrapped up on Sunday (March 30th).', 'ro': 'Echipa bulgară de gimnastică a câştigat medalia de aur la tradiţionala competiţie Grand Prix din Thiais, Franţa, care s-a încheiat duminică (30 martie).'}
1	{'en': 'Being on that committee, however, you will know that this was a very hot topic in negotiations between Norway and some Member States.', 'ro': 'Totuşi, făcând parte din această comisie, ştiţi că acesta a fost un subiect foarte aprins în negocierile dintre Norvegia şi unele state membre.'}
2	{'en': 'The overwhelming vote shows just this.', 'ro': 'Ceea ce demonstrează şi votul favorabil.'}
3	{'en': '[Photo illustration by Catherine Gurgenidze for Southeast European Times]', 'ro': '[Ilustraţii foto de Catherine Gurgenidze pentru Southeast European Times]'}
4	{'en': '(HU) Mr President, today the specific text of the agreement between the Hungarian Government and the European Commission has been formulated.', 'ro': '(HU) Domnule președinte, textul concret al acordului dintre guvernul ungar și Comisia Europeană a fost formulat astăzi.'}

metric是datasets.Metric类的一个实例，查看metric和使用的例子:

metric

    Metric(name: "sacrebleu", features: {'predictions': Value(dtype='string', id='sequence'), 'references': Sequence(feature=Value(dtype='string', id='sequence'), length=-1, id='references')}, usage: """
    Produces BLEU scores along with its sufficient statistics
    from a source against one or more references.
    
    Args:
        predictions: The system stream (a sequence of segments)
        references: A list of one or more reference streams (each a sequence of segments)
        smooth: The smoothing method to use
        smooth_value: For 'floor' smoothing, the floor to use
        force: Ignore data that looks already tokenized
        lowercase: Lowercase the data
        tokenize: The tokenizer to use
    Returns:
        'score': BLEU score,
        'counts': Counts,
        'totals': Totals,
        'precisions': Precisions,
        'bp': Brevity penalty,
        'sys_len': predictions length,
        'ref_len': reference length,
    Examples:
    
        >>> predictions = ["hello there general kenobi", "foo bar foobar"]
        >>> references = [["hello there general kenobi", "hello there !"], ["foo bar foobar", "foo bar foobar"]]
        >>> sacrebleu = datasets.load_metric("sacrebleu")
        >>> results = sacrebleu.compute(predictions=predictions, references=references)
        >>> print(list(results.keys()))
        ['score', 'counts', 'totals', 'precisions', 'bp', 'sys_len', 'ref_len']
        >>> print(round(results["score"], 1))
        100.0
    """, stored examples: 0)

我们使用compute方法来对比predictions和labels，从而计算得分。predictions和labels都需要是一个list。具体格式见下面的例子：

fake_preds = ["hello there", "general kenobi"]
fake_labels = [["hello there"], ["general kenobi"]]
metric.compute(predictions=fake_preds, references=fake_labels)

    {'bp': 1.0,
     'counts': [4, 2, 0, 0],
     'precisions': [100.0, 100.0, 0.0, 0.0],
     'ref_len': 4,
     'score': 0.0,
     'sys_len': 4,
     'totals': [4, 2, 0, 0]}

二、数据预处理

2.1 构建模型对应的tokenizer

在将数据喂入模型之前，我们需要对数据进行预处理。预处理的工具叫Tokenizer。Tokenizer首先对输入进行tokenize，然后将tokens转化为预模型中需要对应的token ID，再转化为模型需要的输入格式。

为了达到数据预处理的目的，我们使用AutoTokenizer.from_pretrained方法实例化我们的tokenizer，这样可以确保：

我们得到一个与预训练模型一一对应的tokenizer。
使用指定的模型checkpoint对应的tokenizer的时候，我们也下载了模型需要的词表库vocabulary，准确来说是tokens vocabulary。

这个被下载的tokens vocabulary会被缓存起来，从而再次使用的时候不会重新下载。

from transformers import AutoTokenizer
# 需要安装`sentencepiece`： pip install sentencepiece
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

以我们使用的mBART模型为例，我们需要正确设置source语言和target语言。如果您要翻译的是其他双语语料，请查看这里。我们可以检查source和target语言的设置：

if "mbart" in model_checkpoint:
    tokenizer.src_lang = "en-XX"
    tokenizer.tgt_lang = "ro-RO"

tokenizer既可以对单个文本进行预处理，也可以对一对文本进行预处理，tokenizer预处理后得到的数据满足预训练模型输入格式

tokenizer("Hello, this one sentence!")

    {'input_ids': [125, 778, 3, 63, 141, 9191, 23, 0], 
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

上面看到的token IDs也就是input_ids一般来说随着预训练模型名字的不同而有所不同。原因是不同的预训练模型在预训练的时候设定了不同的规则。但只要tokenizer和model的名字一致，那么tokenizer预处理的输入格式就会满足model需求的。关于预处理更多内容参考这个教程

除了可以tokenize一句话，我们也可以tokenize一个list的句子。

tokenizer(["Hello, this one sentence!", "This is another sentence."])

    {'input_ids': [[125, 778, 3, 63, 141, 9191, 23, 0], 
    [187, 32, 716, 9191, 2, 0]], 
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
    [1, 1, 1, 1, 1, 1]]}

注意：为了给模型准备好翻译的targets，我们使用as_target_tokenizer来控制targets所对应的特殊token：

with tokenizer.as_target_tokenizer():
    print(tokenizer("Hello, this one sentence!"))
    model_input = tokenizer("Hello, this one sentence!")
    tokens = tokenizer.convert_ids_to_tokens(model_input['input_ids'])
    # 打印看一下special toke
    print('tokens: {}'.format(tokens))

    {'input_ids': [10334, 1204, 3, 15, 8915, 27, 452, 59, 29579, 581, 23, 0], 
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
    tokens: ['▁Hel', 'lo', ',', '▁', 'this', '▁o', 'ne', '▁se', 'nten', 'ce', '!', '</s>']

如果使用的是T5预训练模型的checkpoints，需要对特殊的前缀进行检查。T5使用特殊的前缀来告诉模型具体要做的任务，具体前缀例子如下：

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "translate English to Romanian: "
else:
    prefix = ""

2.2 整合预处理函数

现在我们可以把所有内容放在一起组成我们的预处理函数了。我们对样本进行预处理的时候，我们还会truncation=True这个参数来确保我们超长的句子被截断。默认情况下，对与比较短的句子我们会自动padding。

max_input_length = 128
max_target_length = 128
source_lang = "en"
target_lang = "ro"

def preprocess_function(examples):
    inputs = [prefix + ex[source_lang] for ex in examples["translation"]]
    targets = [ex[target_lang] for ex in examples["translation"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

以上的预处理函数可以处理一个样本，也可以处理多个样本exapmles。如果是处理多个样本，则返回的是多个样本被预处理之后的结果list。

preprocess_function(raw_datasets['train'][:2])

    {'input_ids': [[393, 4462, 14, 1137, 53, 216, 28636, 0], 
    [24385, 14, 28636, 14, 4646, 4622, 53, 216, 28636, 0]], 
    'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 
    'labels': [[42140, 494, 1750, 53, 8, 59, 903, 3543, 9, 15202, 0], 
    [36199, 6612, 9, 15202, 122, 568, 35788, 21549, 53, 8, 59, 903, 3543, 9, 15202, 0]]}

2.3 对数据集datasets所有样本进行预处理

接下来对数据集datasets里面的所有样本进行预处理，处理的方式是使用map函数，将预处理函数prepare_train_features应用到（map)所有样本上。

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

更好的是，返回的结果会自动被缓存，避免下次处理的时候重新计算（但是也要注意，如果输入有改动，可能会被缓存影响！）。

datasets库函数会对输入的参数进行检测，判断是否有变化，如果没有变化就使用缓存数据，如果有变化就重新处理。但如果输入参数不变，想改变输入的时候，最好清理调这个缓存。清理的方式是使用load_from_cache_file=False参数。
上面使用到的batched=True这个参数是tokenizer的特点，以为这会使用多线程同时并行对输入进行处理。

三、微调transformer模型

3.1 加载seq2seq模型

既然数据已经准备好了，现在我们需要下载并加载我们的预训练模型，然后微调预训练模型。既然我们是做seq2seq任务，那么我们需要一个能解决这个任务的模型类。我们使用AutoModelForSeq2SeqLM这个类。和tokenizer相似，from_pretrained方法同样可以帮助我们下载并加载模型，同时也会对模型进行缓存，就不会重复下载模型啦。

Auto的字面含义就是：自动帮你寻找合适的模型，根据你传入的模型名字。比如你传了了一个bert-base-uncased的，它就会帮你找到对应的BertForSequenceClassification

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model

可以看到上面的model的结构如下：

MarianMTModel(
  (model): MarianModel(
    (shared): Embedding(59543, 512, padding_idx=59542)
    (encoder): MarianEncoder(
      (embed_tokens): Embedding(59543, 512, padding_idx=59542)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (2): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (3): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (4): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (5): MarianEncoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
    (decoder): MarianDecoder(
      (embed_tokens): Embedding(59543, 512, padding_idx=59542)
      (embed_positions): MarianSinusoidalPositionalEmbedding(512, 512)
      (layers): ModuleList(
        (0): MarianDecoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (1): MarianDecoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (2): MarianDecoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (3): MarianDecoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (4): MarianDecoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
        (5): MarianDecoderLayer(
          (self_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (encoder_attn): MarianAttention(
            (k_proj): Linear(in_features=512, out_features=512, bias=True)
            (v_proj): Linear(in_features=512, out_features=512, bias=True)
            (q_proj): Linear(in_features=512, out_features=512, bias=True)
            (out_proj): Linear(in_features=512, out_features=512, bias=True)
          )
          (encoder_attn_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (fc1): Linear(in_features=512, out_features=2048, bias=True)
          (fc2): Linear(in_features=2048, out_features=512, bias=True)
          (final_layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        )
      )
    )
  )
  (lm_head): Linear(in_features=512, out_features=59543, bias=False)
)

由于我们微调的任务是机器翻译，而我们加载的是预训练的seq2seq模型，所以不会提示我们加载模型的时候扔掉了一些不匹配的神经网络参数（比如：预训练语言模型的神经网络head被扔掉了，同时随机初始化了机器翻译的神经网络head）。

3.2 设定训练参数

为了能够得到一个Seq2SeqTrainer训练工具，我们还需要3个要素，其中最重要的是训练的设定/参数Seq2SeqTrainingArguments。这个训练设定包含了能够定义训练过程的所有属性

batch_size = 16
args = Seq2SeqTrainingArguments(
    "test-translation",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=False,
)

上面evaluation_strategy = "epoch"参数告诉训练代码：我们每个epcoh会做一次验证评估。

上面batch_size在这个notebook之前定义好了。

由于我们的数据集比较大，同时Seq2SeqTrainer会不断保存模型，所以我们需要告诉它至多保存save_total_limit=3个模型。

最后我们需要一个数据收集器data collator，将我们处理好的输入喂给模型。

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

3.3 数据后处理

设置好Seq2SeqTrainer还剩最后一件事情，那就是我们需要定义好评估方法。我们使用metric来完成评估。将模型预测送入评估之前，我们也会做一些数据后处理：

import numpy as np

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [[label.strip()] for label in labels]

    return preds, labels

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    if isinstance(preds, tuple):
        preds = preds[0]
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    result = {"bleu": result["score"]}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
    result["gen_len"] = np.mean(prediction_lens)
    result = {k: round(v, 4) for k, v in result.items()}
    return result

3.4 训练模型

最后将所有的参数/数据/模型传给Seq2SeqTrainer即可

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

调用train方法进行微调训练。

trainer.train()

最后的训练结果应该是这个亚子：
在这里插入图片描述

最后别忘了，查看如何上传模型，上传模型到到🤗 Model Hub。随后您就可以像这个notebook一开始一样，直接用模型名字就能使用您的模型啦。

Reference

（1）BERT应用在四大任务上的栗子：https://github.com/huggingface/transformers/tree/master/examples/pytorch
（2）datawhale course
（3）数据预处理：https://huggingface.co/transformers/preprocessing.html
（4）https://relph1119.github.io/my-team-learning/#/transformers_nlp28/task09
（5）小桐博客：https://blog.csdn.net/qq_40990057/article/details/120040161
（6）Hugging Face教程 - 7.3、使用huggingface做主流NLP训练任务（文本翻译模型，也可以是序列到序列模型）

山顶夕景

关注

7
点赞
踩
28

收藏

觉得还不错? 一键收藏
打赏
2
评论
【NLP】(task9)Transformers解决机器翻译任务

学习总结（1）之前的任务都是语言理解类任务（如文本分类、抽取式问答、序列标注等），而对于文本生成任务（如机器翻译、文本摘要等），这类任务不仅需要对作为条件的输入有较好的表示能力（编码器），同时需要较强大的（序列）解码器生成目标文本。沿着解码器利用自监督学习的方式进行预训练，大佬们提出了一些预训练生成模型：BART、UniLM、T5和GPT-3等，更多机器翻译论文可参考清华大学NLP组整理的论文——https://github.com/THUNLP-MT/MT-Reading-List。（2）本次学习用H
复制链接

扫一扫