【论文阅读 T5】Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

长命百岁️

已于 2023-01-16 00:38:50 修改

阅读量955

点赞数

分类专栏：信息检索论文阅读文章标签：论文阅读 transformer 深度学习

于 2023-01-16 00:37:42 首次发布

本文链接：https://blog.csdn.net/qq_52852138/article/details/128699577

版权

论文阅读同时被 2 个专栏收录

35 篇文章

订阅专栏

信息检索

31 篇文章

订阅专栏

T5论文提出将所有文本处理问题视为文本到文本的问题，使用统一的Transformer模型进行处理。研究对比了不同的预训练方法，重点介绍了大型无标签数据集C4，并使用去噪目标进行预训练。实验表明，这种预训练策略在各种下游任务中表现出色，尤其是当模型结构为编码器-解码器并采用共享参数时。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

Foreword

The paper is famous and we call it T5
The paper is too long so i’m stilling reading
I think this paper is a survey, there are many proposed approaches being compared in this papre. This is very helpful for us to connect some of the previous knowledge we have learned
Original paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Intro

Basic idea: treat every text processing problem as a text-to-text problem

taking text as input and producing new text as output
The text-to-text framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task we consider

Goal: not to propose new methods but instead to provide a comprehensive perspective on where the field stands

Data: In order to perform experiments at this scale(11 B parameters), we introduce the “Colossal Clean Crawled Corpus” (C4),

An interesting and straightforward conclusion:

transfer learning to computer vision , pre-training is typically done via supervised learning on a large labeled data set
like ImageNet
In contrast, modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data

Setting

Model

Transformer-based model

The Colossal Clean Crawled Corpus

leverage Common Crawl as a source of text scraped from the web

The majority of the resulting text is not natural language
To address these issues, we clean up Common Crawl's web extracted text

Download the web extracted text from April 2019 and applied the aforementioned filtering

produces a collection of text that is not only orders of magnitude larger than most data sets used for pre-training (about 750 GB) but also comprises reasonably clean and natural English text

Common Crawl is a publicly-available web archive that provides “web extracted text” by removing markup and other non-text content from the scraped HTML files

produce 20TB each month

Downstream Tasks

Input and Output Format

在这里插入图片描述

The model is trained with a maximum likelihood objective regardless of the task

To specify which task the model should perform, we add a task-specific (text) prefix to the original input sequence before feeding it to the model.

we convert all tasks to the format which can be casted as a text-to-text format

Note: The choice of text prefix used for a given task is essentially a hyperparameter

Experiments

Baselines

Pre-train a standard Transformer (described in Section 2.1) using a simple denoising objective and then separately fine-tune on each of our downstream tasks

Model

A standard encoder-decoder Transformer

the encoder and decoder are each similar in size and configuration to a “ $BERT_{BASE}$ ”
this results in a model with about 220 million parameters(twice the number of $BERT_{BASE}$ )

Training

Train: Use standard maximum likelihood(teacher forcing + cross-entropy loss)

Test: greedy decoding(choosing the highest-probability logit at every timestep)

Pre-train data: C4

steps: $2^{19}$ steps
maximum seq length: 512
batch size: 128 seqs
total: $2^{35}$ tokens(34B), just a fraction of the entire C4(BERT 137B, RoBERTa 2.2T)

Fine-tune:

steps: $2^{18}$ steps
seq len & batch size: as before

Unsupervised Objective

Use a denoising objective(produce better performance): predict missing or otherwise corrupted tokens in the input

Randomly samples and then drops out 15% of tokens in the input sequence

在这里插入图片描述

Baseline Performance

在这里插入图片描述

train our baseline model 10 times from scratch (i.e. with different random initializations and data set shuffling)

Architecture

Model Structure

在这里插入图片描述

Comparing Different Model Structure

An encoder-decoder model with L layers in the encoder and L layers in the decoder has approximately the same number of parameters as a language model with 2L layers

However, they have approximately the same computational cost
Because, the encoder is only applied to the input sequence and the decoder is only applied to the output sequence, while

the language model must be applied to both the input and output sequence,

Objectives

consider both a basic language modeling objective as well as our baseline denoising objective

we sample a span of text from our unlabeled data set and choose a random point to split it into prefix and target portions
- For the standard language model, we train the model to predict the entire span from beginning to end
- For denoising objective(adapted to a language model): concatenate the inputs and targets

Result

在这里插入图片描述

For all tasks, the encoder-decoder architecture with the denoising objective performed best.
sharing parameters across the encoder and decoder performed nearly as well
- We also note that the shared parameter encoder-decoder outperforms the decoder-only prefix LM
using a denoising objective always results in better downstream task performance compared to a language modeling objective