BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

big_matster

于 2022-12-12 21:46:37 发布

阅读量139

点赞数

分类专栏：模块复现文章标签：人工智能

本文链接：https://blog.csdn.net/kuxingseng123/article/details/128292217

版权

模块复现专栏收录该内容

27 篇文章 2 订阅

订阅专栏

摘要

a new language representation model called BERT.
BERT stand for Bidirectional Encoder Representations from Transformers。
BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. （jointly conditioning 共同微调）
应用: answering question、language inference、
eleven natural language processing tasks :11个自然语言任务。得到优异的结果。

介绍

sentence-level task：
- natural language inference、
- paraphrasing
token-level tasks：
- named entity recognition
- question answering

将预训练模型应用到下游任务两种策略:

feature-based
- ELMo
fine-tuning
- the Generative Pre-trained Transformer (OpenAI GPT)

两种方法共同使用：unidirectional language models（单向语言模型）

去学习一般的语言表示。

限制

standard language models are unidirectional

BeRT

pre-training
- the model is trained on unlabeled data over different pre-training tasks
fine-tuning
- the BERT model is first initialized with the pre-trained parameters
- all of the parameters are fine-tuned using labeled data from the downstream tasks.
- Each downstream task has separate fine-tuned models.
- BERT is its unified arachitecture across different tasks.

模型架构

BERT is a multi-layer bidirectional Transformer encoder based on the original implementation.
the number of layers as $L$
the hidden size as $H$
the number of self-attention heads as $A$
the BERT Transformer 😗*
- uses bidirectional self-attention**
the GPT Transformer
- constrained self-attention where every
  token can only attend to context to its left

Input/Output Representations

WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary
input embedding as $E$
the final hidden vector of the special [CLS] token as $\in R^{H}$
the final hidden vector for the $i^{th}$ input token as
$T_i \in R^{H}$

Pre-training BERT

two unsupervised tasks

Masked LM(任务1）

mask some percentage of the input tokens at random
[MASK] token.

Next Sentence Prediction (NSP)\

许多重要的下游任务： $Q A$ 、 $N L I$

based on understanding the relationship between two sentences.
a binarized next sentence prediction task.
any monolingual corpus 任意语料库。
sentence embeddings are transferred to down-stream tasks
BERT transfers all parameters to initialize end-task model parameters

Pre-training data

the BooksCorpus (800M words)
English Wikipedia (2500M words)
long contiguous sequences 长连续序列。

Fine-tuning BERT

实验

GLUE

$\in R^{H}$ corresponding to the first input token.
classification layer weights $\in R^{K \times H}$
$log(softmax(CW^{T}))$

GLUE tasks

a batch size of 32
fine-tune for epochs。

经典数据集

The Stanford Question Answering Dataset
$the [CLS] $ toeken

Ablation Studies

Effect of Model Size

在这里插入图片描述

结论

在这里插入图片描述
Recent empirical improvements due to transfer
learning with language models have demonstrated
that rich, unsupervised pre-training is an integral
part of many language understanding systems.