【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

T5论文提出将所有文本处理问题视为文本到文本的问题,使用统一的Transformer模型进行处理。研究对比了不同的预训练方法,重点介绍了大型无标签数据集C4,并使用去噪目标进行预训练。实验表明,这种预训练策略在各种下游任务中表现出色,尤其是当模型结构为编码器-解码器并采用共享参数时。
摘要由CSDN通过智能技术生成

Foreword

Intro

Basic idea: treat every text processing problem as a text-to-text problem

  • taking text as input and producing new text as output
  • The text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider

Goal: not to propose new methods but instead to provide a comprehensive perspective on where the field stands

Data: In order to perform experiments at this scale(11 B parameters), we introduce the “Colossal Clean Crawled Corpus” (C4),

An interesting and straightforward conclusion:

  • transfer learning to computer vision , pre-training is typically done via supervised learning on a large labeled data set
    like ImageNet
  • In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data

Setting

Model

Transformer-based model

The Colossal Clean Crawled Corpus

leverage Common Crawl as a source of text scraped from the web

  • The majority of the resulting text is not natural language
  • To address these issues, we clean up Common Crawl's web extracted text

Download the web extracted text from April 2019 and applied the aforementioned filtering

  • produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text

Common Crawl is a publicly-available web archive that provides “web extracted text” by removing markup and other non-text content from the scraped HTML files

  • produce 20TB each month

Downstream Tasks

Input and Output Format

在这里插入图片描述

The model is trained with a maximum likelihood objective regardless of the task

To specify which task the model should perform, we add a task-specific (text) prefix to the original input sequence before feeding it to the model.

  • we convert all tasks to the format which can be casted as a text-to-text format

Note: The choice of text prefix used for a given task is essentially a hyperparameter

Experiments

Baselines

Pre-train a standard Transformer (described in Section 2.1) using a simple denoising objective and then separately fine-tune on each of our downstream tasks

Model

A standard encoder-decoder Transformer

  • the encoder and decoder are each similar in size and configuration to a “ B E R T B A S E BERT_{BASE} BERTBASE
  • this results in a model with about 220 million parameters(twice the number of B E R T B A S E BERT_{BASE} BERTBASE)
Training

Train: Use standard maximum likelihood(teacher forcing + cross-entropy loss)

Test: greedy decoding(choosing the highest-probability logit at every timestep)

Pre-train data: C4

  • steps: 2 19 2^{19} 219 steps
  • maximum seq length: 512
  • batch size: 128 seqs
  • total: 2 35 2^{35} 235 tokens(34B), just a fraction of the entire C4(BERT 137B, RoBERTa 2.2T)

Fine-tune:

  • steps: 2 18 2^{18} 218 steps
  • seq len & batch size: as before
Unsupervised Objective

Use a denoising objective(produce better performance): predict missing or otherwise corrupted tokens in the input

  • Randomly samples and then drops out 15% of tokens in the input sequence

在这里插入图片描述

Baseline Performance

在这里插入图片描述

  • train our baseline model 10 times from scratch (i.e. with different random initializations and data set shuffling)

Architecture

Model Structure

在这里插入图片描述

在这里插入图片描述

Comparing Different Model Structure

An encoder-decoder model with L layers in the encoder and L layers in the decoder has approximately the same number of parameters as a language model with 2L layers

  • However, they have approximately the same computational cost

  • Because, the encoder is only applied to the input sequence and the decoder is only applied to the output sequence, while

    the language model must be applied to both the input and output sequence,

Objectives

consider both a basic language modeling objective as well as our baseline denoising objective

  • we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions
    • For the standard language model, we train the model to predict the entire span from beginning to end
    • For denoising objective(adapted to a language model): concatenate the inputs and targets

Result

在这里插入图片描述

  • For all tasks, the encoder-decoder architecture with the denoising objective performed best.
  • sharing parameters across the encoder and decoder performed nearly as well
    • We also note that the shared parameter encoder-decoder outperforms the decoder-only prefix LM
  • using a denoising objective always results in better downstream task performance compared to a language modeling objective
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

长命百岁️

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值