XLNet

最新推荐文章于 2021-11-13 08:59:57 发布

红酒暖心也暖胃

最新推荐文章于 2021-11-13 08:59:57 发布

阅读量184

点赞数

分类专栏： nlp

本文链接：https://blog.csdn.net/zpp13hao1/article/details/112836889

版权

nlp 专栏收录该内容

16 篇文章 0 订阅

订阅专栏

XLNet:Generalized Autogressive Pretraining for Language Understanding

https://github.com/zihangdai/xlnet

摘要

类似于Bert基于上下文进行建模的DAE（denoising autoencoding）的预训练模型比基于AR（autoRegressive）的语言模型得到了更好的效果。然而，Bert忽视了mask之间的依赖关系，并且预训练和微调之间的不一致（微调的时候没有mask），基于这样的优缺点（可以看到上下文），我们提出XLNet，通过最大化排列组合的因式分解的极大似然估计学习上下文的信息；通过使用AR，克服Bert的缺点。继承了transformer-xl的思想到预训练中。在20个任务上，xlnet比Bert表现好，包括qa、nli、情感分析等。

Introduction

先在未标注的文本进行预训练，后在下游任务中进行微调，两阶段的训练方式在nlp领域取得了巨大的成功。AR语言模型和AE是两个最成功的预训练目标。
AR语言模型计算文本的概率，给定一个文本 $x=(x_1, x_2, ...,x_T)$ ，AR语言模型计算一个前向的似然估计 $p(x)=\prod_{t=1}^Tp(x_t|x_{<t})$ 或者一个后向的似然估计 $p(x)=\prod_{t=T}^1p(x_t|x_{>t})$ 。AR模型只能编码单向（前向或后向）的文本，在下游语言理解任务时，需要获取双向的文本信息。
AE语言模型从损坏的输入中重建数据，比如Bert，给定的输入中，一定量的词被mask替换，恢复原来的词是训练目标，Bert允许从上下文获取信息。然而，mask只出现在预训练过程中，造成了预训练-微调的不对等；Bert不能像AR模型使用链式法则计算联合概率。换句话说：Bert预测的每个mask的词和unmask的词都是独立的。

结合AR和AE的优点，克服缺点：

XLNet使用分解因子的所有排列组合，替换AR模型中的前向或后向，基于排列组合，每个位置的文本可以看到左边和右边的词。
AR模型没有对数据进行破坏，所以没有预训练-微调的不对等。基于AR模型，克服了Bert的缺点

proposed method

background

AR语言模型
$\max_{\theta} \log p_{\theta}(x)=\sum_{t=1}^T \log p_{\theta}(x_t|x_{<t})=\sum_{t=1}^T \log \frac {\exp (h_{\theta}(x_{1:t-1})^Te(x_t))}{\sum_{x^{'}}\exp (h_{\theta}(x_{1:t-1})^Te(x^{'}))}$

其中 $h_{\theta}(x_{1:t-1})$ 是语言模型的文本表达（RNN、transformer）， $e (x)$ 是x的向量。
Bert基于DAE，一个文本 $x$ ，随机替换以后成 $\hat x$ ，替换的单词是 $x^{*}$ ，训练的目标函数是
$\max_{\theta} \log p_{\theta}(x^{*}|\hat x) \thickapprox \sum_{t=1}^T m_t \log p_{\theta}(x_t|\hat x)=\sum_{t=1}^T \log \frac {\exp (H_{\theta}(\hat x)_t^Te(x_t))}{\sum_{x^{'}}\exp (H_{\theta}(\hat x)_t^Te(x^{'}))}$

当 $x_t$ 是mask时， $m_t=1$ 。 $H_{\theta}$ 是 $x$ 经过transformer以后的隐向量， $H_{\theta}(x)=[H_{\theta}(x)_1, ...,H_{\theta}(x)_T]$

优缺点

独立假设 Bert假设所有mask的单词都是独立的
输入噪音 mask符号在下游任务中不会出现
文本独立 AR只接受左边的词，而Bert获取两边的词

objective：permutation language modeling

给定一个长度为T的句子x，有 $T!$ 种排列组合， $Z_T$ 是所有排列组合的集合，使用 $z_t$ 和 $z_{<t}$ 代表第t个元素和前t-1个元素， $\in Z_T$ 我们的目标函数可以写成（期望）
$\max_{\theta} E_{z \in Z_T}[\sum _{t=1}^T\log p_{\theta}(x_{z_t}|x_{z_{<t}})]$
对于句子x，一次举例排列组合z，根据排列组合计算似然对数。可以看到上下文。因为基于AR语言模型，避免了Bert的缺点

remark on permutation

提出的目标函数仅置换分解顺序，而不置换句子顺序。换句话说，保持原有的句子顺序，使用与原始句子对应的位置编码，依靠transformer中的proper attention mask来实现分解顺序的替换。这样是有必要的，因为模型在微调期间只会遇到自然顺序的文本序列。
给个例子： $x_3$ 在相同输入句子x，不同排列组合下的情况
不同排列组合下，可以看到前置的词

architecture：two-stream self-attention for target-aware representations

a concrete example of how standard LM parameterization fails

在permutation 目标下，传统的transformer不能达到最终目的，为什么呢？
假设 $z_{<t}^{(1)}=z_{<t}^{(2)}=z_{<t}$ (前面的单词以及顺序都是一样的)，但是 $z_{t}^{(1)}=i \neq j=z_{t}^{(2)}$ （预测的词位置不一样），在这种情况下，有
$\underbrace {p_{\theta}(X_i=x|x_{z_{<t}})}_{z_t^{(1)}=i,z_{<t}^{(1)}=z_{<t}}=\underbrace {p_{\theta}(X_j=x|x_{z_{<t}})}_{z_t^{(1)}=j,z_{<t}^{(2)}=z_{<t}}=\frac {\exp (h(x_{<t})^Te(x))}{\sum_{x^{'}}\exp (h(x_{<t})^Te(x^{'}))}$
从上式可以看出，两个不同位置的目标有同样的模型预测结果，然而，这两者应该是不同的。所以传统的transformer会失败
提出新的目标函数target-aware representations解决上述问题：

$p_{\theta}(X_{z_t}=x|x_{z_{<t}})=\frac {\exp (g_{\theta}(x_{<t},z_t)^Te(x))}{\sum_{x^{'}}\exp (g_{\theta}(x_{<t},z_t)^Te(x^{'}))}$
$g_{\theta}(x_{<t},z_t)$ 的输入包含了目标的位置 $z_t$

two-stream self-attention

怎么定义 $g_{\theta}(x_{<t},z_t)$ 是一个问题

预测词 $x_{z_t}$ 时，只使用 ${z_t}$ 的位置信息，而不使用内容 $x_{z_t}$ —(query representation)
预测词 $x_{z_j}--j>t$ 时，应该既编码 $x_{z_t}$ 的位置，也要编码内容—(content representation)

在这里插入图片描述

第一层初始化可训练向量 $g_i(0)=w$ ，内容流使用词向量 $h_i(0)=e(x_i)$
后面每一层按照
$g_{z_t}^{m}=Attention(A=g_{z_t}^{m-1},KV=h_{z_{<t}}^{m-1};\theta) ....(query-stream-only-use-position)\\ h_{z_t}^{m}=Attention(A=h_{z_t}^{m-1},KV=h_{z_{\leq t}}^{m-1};\theta) ....(content-stream-can-use-content)$

两条网络使用同一组参数，更新规则和传统的transformer一样
在微调时，去掉query流，只保留content流
最后，使用最后一层的表达 $g_{z_t}^{M}$ 计算概率

partial prediction

虽然permutation LM挺好的，但是如何优化呢？为了减少优化的难度，在一个因式分解中，我们只选择预测最后的几个单词。
分割句子Z成 $z_{>c}$ (目标子句子)和 $z_{\leq c}$ (非目标子句子)，c是切割点。目标函数是
$\max _{\theta}E_{z \in Z_T}==E_{z \in Z_T}[\sum _{t=c+1}^{|z|}\log p_{\theta}(x_{z_t}|x_{z_{\leq t}})]$
$z_{>c}$ 是给定排列组合后的z，能处理较长的文本，吸收较多的信息。
给定超参数K，使得 $∣ z ∣ / (∣ z ∣ - c) = K$ 。比如句子长是100，K＝７，则只选后面的14个词
没有选中的词不需要计算query representations，节省速度和内存

incorporation ideas from transformer-xl

整合Transformer-xl的思想到预训练框架中，并以之命名模型名xlnet（同一个作者，同一种命名）
合并两种重要的思想：

相对位置编码方案（the relative positional encoding scheme）
段循环机制（the segment recurrence mechanism）

假设有两个段落 $\hat x=s_{1:T}$ 和 $x=s_{T:2T}$ ，使用 $\hat z$ 和 $z$ 是两个段落的排列组合。基于 $\hat z$ ，可以计算第一个段落，每一层m得到内容表达（content representation） $\hat h^{(m)}$
对于下一个段落 $x$ ，注意力更新可以被写作：
$h_{z_t}^{(m)} \leftarrow Attention(Q=h_{z_t}^{(m-1)}, KV=[\hat h^{(m-1)}, h_{z_{\leq t}}^{(m-1)}], \theta)$
注意：位置编码只依赖于句子中的真实位置，因此，一旦文本表达 $\hat h^{(m)}$ 获得了，上述的注意力更新与 $\hat z$ 是独立的。这就允许在不知道前面段落排列组合的情况下，重复使用内存。query流可以以同样的方式进行计算。
$g_{z_t}^{(m)} \leftarrow Attention(Q=g_{z_t}^{(m-1)}, KV=[\hat h^{(m-1)}, h_{z_{<t}}^{(m-1)}], \theta)$
在这里插入图片描述

modeling multiple segments

[CLS, A, SEP, B, SEP]
两句话输入时，和Bert是一样的输入
虽然在XLNet-large中，取消了NSP（next sentence prediction）这个任务

relative segment encodings Bert在词向量中，加了一个绝对位置向量（absolute segment embedding），xlnet采用了相对位置编码（relative encoding），这和transformer-xl是一样的

discussion

预测【new york is a city】，Bert和xlnet都选择new York两个单词作为预测词，xlnet的排列组合顺序是【is a city new york】，Bert和xlnet的目标函数分布如下
$T_{BERT} = \log p(new|is,a, city )+\log p(york|is,a, city )\\ T_{xlnet} = \log p(new|is,a, city )+\log p(york|new, is,a, city )$

代码

run_classifier.py-训练测试代码，需要指定预加载模型、测试数据等

import function_builder # 引进modeling
class InputExample(object)# 文本分类的单一样本
class DataProcessor(object)# 数据处理的基类
class GLUEProcessor(DataProcessor)# GLUE
class Yelp5Processor(DataProcessor)# Yelp5
class ImdbProcessor(DataProcessor)# Imdb
class MnliMatchedProcessor(GLUEProcessor)# Mnli
class MnliMismatchedProcessor(MnliMatchedProcessor)# MuliMismatched
class StsbProcessor(GLUEProcessor)# Stsb--均与Bert类似，比Bert多了几种类型
def file_based_convert_examples_to_features# 转换InputExample到TFRecord file
def file_based_input_fn_builder# 产生input_fn
def get_model_fn# 
def main(_):

xlnet.py-模型、运行、xlnet的配置文件

class XLNetConfig(object):
 """XLNetConfig contains hyperparameters that are specific to a model checkpoint;
 i.e., these hyperparameters should be the same between
 pretraining and finetuning.

 The following hyperparameters are defined:
   n_layer: int, the number of layers.
   d_model: int, the hidden size.
   n_head: int, the number of attention heads.
   d_head: int, the dimension size of each attention head.
   d_inner: int, the hidden size in feed-forward layers.
   ff_activation: str, "relu" or "gelu".
   untie_r: bool, whether to untie the biases in attention.
   n_token: int, the vocab size.
 """

class RunConfig(object):
 """RunConfig contains hyperparameters that could be different
 between pretraining and finetuning.
 These hyperparameters can also be changed from run to run.
 We store them separately from XLNetConfig for flexibility.
 Args:
     is_training: bool, whether in training mode.
     use_tpu: bool, whether TPUs are used.
     use_bfloat16: bool, use bfloat16 instead of float32.
     dropout: float, dropout rate.
     dropatt: float, dropout rate on attention probabilities.
     init: str, the initialization scheme, either "normal" or "uniform".
     init_range: float, initialize the parameters with a uniform distribution
       in [-init_range, init_range]. Only effective when init="uniform".
     init_std: float, initialize the parameters with a normal distribution
       with mean 0 and stddev init_std. Only effective when init="normal".
     mem_len: int, the number of tokens to cache.
     reuse_len: int, the number of tokens in the currect batch to be cached
       and reused in the future.
     bi_data: bool, whether to use bidirectional input pipeline.
       Usually set to True during pretraining and False during finetuning.
     clamp_len: int, clamp all relative distances larger than clamp_len.
       -1 means no clamping.
     same_length: bool, whether to use the same attention length for each token.
"""

class XLNetModel(object):
 """A wrapper of the XLNet model used during both pretraining and finetuning."""

 def __init__(self, xlnet_config, run_config, input_ids, seg_ids, input_mask,
              mems=None, perm_mask=None, target_mapping=None, inp_q=None,
              **kwargs):
   """
   Args:
     xlnet_config: XLNetConfig,
     run_config: RunConfig,
     input_ids: int32 Tensor in shape [len, bsz], the input token IDs.
     seg_ids: int32 Tensor in shape [len, bsz], the input segment IDs.
     input_mask: float32 Tensor in shape [len, bsz], the input mask.
       0 for real tokens and 1 for padding.
     mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
       from previous batches. The length of the list equals n_layer.
       If None, no memory is used.
     perm_mask: float32 Tensor in shape [len, len, bsz].
       If perm_mask[i, j, k] = 0, i attend to j in batch k;
       if perm_mask[i, j, k] = 1, i does not attend to j in batch k.
       If None, each position attends to all the others.
     target_mapping: float32 Tensor in shape [num_predict, len, bsz].
       If target_mapping[i, j, k] = 1, the i-th predict in batch k is
       on the j-th token.
       Only used during pretraining for partial prediction.
       Set to None during finetuning.
     inp_q: float32 Tensor in shape [len, bsz].
       1 for tokens with losses and 0 for tokens without losses.
       Only used during pretraining for two-stream attention.
       Set to None during finetuning.
   """
 def get_pooled_out(self, summary_type, use_summ_proj=True):
   """
   Args:
     summary_type: str, "last", "first", "mean", or "attn". The method
       to pool the input to get a vector representation.
     use_summ_proj: bool, whether to use a linear projection during pooling.

   Returns:
     float32 Tensor in shape [bsz, d_model], the pooled representation.
   """

 def get_sequence_output(self):
   """
   Returns:
     float32 Tensor in shape [len, bsz, d_model]. The last layer hidden
     representation of XLNet.
   """

 def get_new_memory(self):
   """
   Returns:
     list of float32 Tensors in shape [mem_len, bsz, d_model], the new
     memory that concatenates the previous memory with the current input
     representations.
     The length of the list equals n_layer.
   """

 def get_embedding_table(self):
   """
   Returns:
     float32 Tensor in shape [n_token, d_model]. The embedding lookup table.
     Used for tying embeddings between input and output layers.
   """
 def get_initializer(self):
   """
   Returns:
     A tf initializer. Used to initialize variables in layers on top of XLNet.
   """

function_builder.py

红酒暖心也暖胃

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
XLNet

XLNet:Generalized Autogressive Pretraining for Language Understanding摘要类似于Bert基于上下文进行建模的DAE（denoising autoencoding）的预训练模型比基于AR（autoRegressive）的语言模型得到了更好的效果。然而，Bert乎是了mask之间的依赖关系，并且预训练和微调之间的不一致（微调的时候没有mask），基于这样的优缺点，我们提出XLNet，通过最大化排列组合的因式分解的极大似然估计学习上下文的信息
复制链接

扫一扫