BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

摘要

  • a new language representation model called BERT.
  • BERT stand for Bidirectional Encoder Representations from Transformers。
  • BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. (jointly conditioning 共同微调)
  • 应用: answering question、language inference、
  • eleven natural language processing tasks :11个自然语言任务。得到优异的结果。

介绍

  • sentence-level task:
    • natural language inference
    • paraphrasing
  • token-level tasks:
    • named entity recognition
    • question answering

将预训练模型应用到下游任务两种策略:

  • feature-based

    • ELMo
  • fine-tuning

    • the Generative Pre-trained Transformer (OpenAI GPT)

两种方法共同使用:unidirectional language models(单向语言模型)

去学习一般的语言表示。

限制

  • standard language models are unidirectional

相关工作

Unsupervised Feature-based Approaches

  • Learning widely applicable representations of words
    (活跃研究领域) include non-neural and neural method

  • Pre-trained word embeddings: NLP中完整的部分。

  • coarser granularities: 更粗粒度。:sentence embeddings。
    (句子级的嵌入)paragraph embeddings (段落级别嵌入

Unsupervised Fine-tuning Approaches

  • OpenAI GPT
  • the GLUE benchmark
  • Left-to-right language modeling
  • auto-encoder objectives

Transfer Learning from Supervised Data

  • natural language inference
  • machine translation
  • Computer vision research

BeRT

  • pre-training
    • the model is trained on unlabeled data over different pre-training tasks
  • fine-tuning
    • the BERT model is first initialized with the pre-trained parameters
    • all of the parameters are fine-tuned using labeled data from the downstream tasks.
    • Each downstream task has separate fine-tuned models.
    • BERT is its unified arachitecture across different tasks.

模型架构

  • BERT is a multi-layer bidirectional Transformer encoder based on the original implementation.
  • the number of layers as L L L
  • the hidden size as H H H
  • the number of self-attention heads as A A A
  • 在这里插入图片描述
  • the BERT Transformer 😗*
    • uses bidirectional self-attention**
  • the GPT Transformer
    • constrained self-attention where every
      token can only attend to context to its left

Input/Output Representations

  • WordPiece embeddings (Wu et al., 2016) with a 30,000 token vocabulary
  • input embedding as E E E
  • the final hidden vector of the special [CLS] token as C ∈ R H C \in R^{H} CRH
  • the final hidden vector for the i t h i^{th} ith input token as
    T i ∈ R H T_i \in R^{H} TiRH

Pre-training BERT

  • two unsupervised tasks

Masked LM(任务1)

  • mask some percentage of the input tokens at random
  • [MASK] token.

Next Sentence Prediction (NSP)\

许多重要的下游任务: Q A QA QA N L I NLI NLI

  • based on understanding the relationship between two sentences.
  • a binarized next sentence prediction task.
  • any monolingual corpus 任意语料库。
    在这里插入图片描述
  • sentence embeddings are transferred to down-stream tasks
  • BERT transfers all parameters to initialize end-task model parameters

Pre-training data

  • the BooksCorpus (800M words)
  • English Wikipedia (2500M words)
  • long contiguous sequences 长连续序列。

Fine-tuning BERT

实验

GLUE

  • C ∈ R H C \in R^{H} CRH corresponding to the first input token.
  • classification layer weights W ∈ R K × H W \in R^{K \times H} WRK×H
  • l o g ( s o f t m a x ( C W T ) ) log(softmax(CW^{T})) log(softmax(CWT))
    在这里插入图片描述

GLUE tasks

  • a batch size of 32
  • fine-tune for epochs。

经典数据集

  • The Stanford Question Answering Dataset
  • 在这里插入图片描述
    在这里插入图片描述
    在这里插入图片描述
    $the [CLS] $ toeken

Ablation Studies

Effect of Model Size

在这里插入图片描述

结论

在这里插入图片描述
Recent empirical improvements due to transfer
learning with language models have demonstrated
that rich, unsupervised pre-training is an integral
part of many language understanding systems.

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

总结

慢慢的将代码跑通。会自己将 B E R T BERT BERT给其全部都将其搞透彻,研究彻底都行啦的回事与打算。慢慢的都将其研究好都行啦的样子。

  • 会自己慢慢的将Bert代码给其研究透彻,会自己清楚什么是细颗粒度任务和粗颗粒度任务。全部都将其搞定都行啦的回事与打算
  • 慢慢的将自己研究透彻,研究彻底,

总之

BERT模型都是含义都是:

  • the BERT Transformer uses bidirectional self-attention
  • a multi-layer bidirectional Transformer encoder based on the original implementation
    在这里插入图片描述

多层transformer encoder

在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

big_matster

您的鼓励,是给予我最大的动力!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值