XLNet

XLNet:Generalized Autogressive Pretraining for Language Understanding

https://github.com/zihangdai/xlnet

摘要

类似于Bert基于上下文进行建模的DAE(denoising autoencoding)的预训练模型比基于AR(autoRegressive)的语言模型得到了更好的效果。然而,Bert忽视了mask之间的依赖关系,并且预训练和微调之间的不一致(微调的时候没有mask),基于这样的优缺点(可以看到上下文),我们提出XLNet,通过最大化排列组合的因式分解的极大似然估计学习上下文的信息;通过使用AR,克服Bert的缺点。继承了transformer-xl的思想到预训练中。在20个任务上,xlnet比Bert表现好,包括qa、nli、情感分析等。

Introduction

先在未标注的文本进行预训练,后在下游任务中进行微调,两阶段的训练方式在nlp领域取得了巨大的成功。AR语言模型和AE是两个最成功的预训练目标。
AR语言模型计算文本的概率,给定一个文本 x = ( x 1 , x 2 , . . . , x T ) x=(x_1, x_2, ...,x_T) x=(x1,x2,...,xT),AR语言模型计算一个前向的似然估计 p ( x ) = ∏ t = 1 T p ( x t ∣ x < t ) p(x)=\prod_{t=1}^Tp(x_t|x_{<t}) p(x)=t=1Tp(xtx<t)或者一个后向的似然估计 p ( x ) = ∏ t = T 1 p ( x t ∣ x > t ) p(x)=\prod_{t=T}^1p(x_t|x_{>t}) p(x)=t=T1p(xtx>t)。AR模型只能编码单向(前向或后向)的文本,在下游语言理解任务时,需要获取双向的文本信息。
AE语言模型从损坏的输入中重建数据,比如Bert,给定的输入中,一定量的词被mask替换,恢复原来的词是训练目标,Bert允许从上下文获取信息。然而,mask只出现在预训练过程中,造成了预训练-微调的不对等;Bert不能像AR模型使用链式法则计算联合概率。换句话说:Bert预测的每个mask的词和unmask的词都是独立的。

结合AR和AE的优点,克服缺点:

  • XLNet使用分解因子的所有排列组合,替换AR模型中的前向或后向,基于排列组合,每个位置的文本可以看到左边和右边的词。
  • AR模型没有对数据进行破坏,所以没有预训练-微调的不对等。基于AR模型,克服了Bert的缺点

proposed method

background

AR语言模型
max ⁡ θ log ⁡ p θ ( x ) = ∑ t = 1 T log ⁡ p θ ( x t ∣ x < t ) = ∑ t = 1 T log ⁡ exp ⁡ ( h θ ( x 1 : t − 1 ) T e ( x t ) ) ∑ x ′ exp ⁡ ( h θ ( x 1 : t − 1 ) T e ( x ′ ) ) \max_{\theta} \log p_{\theta}(x)=\sum_{t=1}^T \log p_{\theta}(x_t|x_{<t})=\sum_{t=1}^T \log \frac {\exp (h_{\theta}(x_{1:t-1})^Te(x_t))}{\sum_{x^{'}}\exp (h_{\theta}(x_{1:t-1})^Te(x^{'}))} θmaxlogpθ(x)=t=1Tlogpθ(xtx<t)=t=1Tlogxexp(hθ(x1:t1)Te(x))exp(hθ(x1:t1)Te(xt))

其中 h θ ( x 1 : t − 1 ) h_{\theta}(x_{1:t-1}) hθ(x1:t1)是语言模型的文本表达(RNN、transformer), e ( x ) e(x) e(x)是x的向量。
Bert基于DAE,一个文本 x x x,随机替换以后成 x ^ \hat x x^, 替换的单词是 x ∗ x^{*} x,训练的目标函数是
max ⁡ θ log ⁡ p θ ( x ∗ ∣ x ^ ) ≈ ∑ t = 1 T m t log ⁡ p θ ( x t ∣ x ^ ) = ∑ t = 1 T log ⁡ exp ⁡ ( H θ ( x ^ ) t T e ( x t ) ) ∑ x ′ exp ⁡ ( H θ ( x ^ ) t T e ( x ′ ) ) \max_{\theta} \log p_{\theta}(x^{*}|\hat x) \thickapprox \sum_{t=1}^T m_t \log p_{\theta}(x_t|\hat x)=\sum_{t=1}^T \log \frac {\exp (H_{\theta}(\hat x)_t^Te(x_t))}{\sum_{x^{'}}\exp (H_{\theta}(\hat x)_t^Te(x^{'}))} θmaxlogpθ(xx^)t=1Tmtlogpθ(xtx^)=t=1Tlogxexp(Hθ(x^)tTe(x))exp(Hθ(x^)tTe(xt))

x t x_t xt是mask时, m t = 1 m_t=1 mt=1 H θ H_{\theta} Hθ x x x经过transformer以后的隐向量, H θ ( x ) = [ H θ ( x ) 1 , . . . , H θ ( x ) T ] H_{\theta}(x)=[H_{\theta}(x)_1, ...,H_{\theta}(x)_T] Hθ(x)=[Hθ(x)1,...,Hθ(x)T]

优缺点

  • 独立假设 Bert假设所有mask的单词都是独立的
  • 输入噪音 mask符号在下游任务中不会出现
  • 文本独立 AR只接受左边的词,而Bert获取两边的词

objective:permutation language modeling

给定一个长度为T的句子x,有 T ! T! T!种排列组合, Z T Z_T ZT是所有排列组合的集合,使用 z t z_t zt z < t z_{<t} z<t代表第t个元素和前t-1个元素, z ∈ Z T z \in Z_T zZT 我们的目标函数可以写成(期望)
max ⁡ θ E z ∈ Z T [ ∑ t = 1 T log ⁡ p θ ( x z t ∣ x z < t ) ] \max_{\theta} E_{z \in Z_T}[\sum _{t=1}^T\log p_{\theta}(x_{z_t}|x_{z_{<t}})] θmaxEzZT[t=1Tlogpθ(xztxz<t)]
对于句子x,一次举例排列组合z,根据排列组合计算似然对数。可以看到上下文。因为基于AR语言模型,避免了Bert的缺点

remark on permutation

提出的目标函数仅置换分解顺序,而不置换句子顺序。换句话说,保持原有的句子顺序,使用与原始句子对应的位置编码,依靠transformer中的proper attention mask来实现分解顺序的替换。这样是有必要的,因为模型在微调期间只会遇到自然顺序的文本序列。
给个例子: x 3 x_3 x3在相同输入句子x,不同排列组合下的情况
不同排列组合下,可以看到前置的词

architecture:two-stream self-attention for target-aware representations

a concrete example of how standard LM parameterization fails

在permutation 目标下,传统的transformer不能达到最终目的,为什么呢?
假设 z < t ( 1 ) = z < t ( 2 ) = z < t z_{<t}^{(1)}=z_{<t}^{(2)}=z_{<t} z<t(1)=z<t(2)=z<t(前面的单词以及顺序都是一样的),但是 z t ( 1 ) = i ≠ j = z t ( 2 ) z_{t}^{(1)}=i \neq j=z_{t}^{(2)} zt(1)=i=j=zt(2)(预测的词位置不一样),在这种情况下,有
p θ ( X i = x ∣ x z < t ) ⏟ z t ( 1 ) = i , z < t ( 1 ) = z < t = p θ ( X j = x ∣ x z < t ) ⏟ z t ( 1 ) = j , z < t ( 2 ) = z < t = exp ⁡ ( h ( x < t ) T e ( x ) ) ∑ x ′ exp ⁡ ( h ( x < t ) T e ( x ′ ) ) \underbrace {p_{\theta}(X_i=x|x_{z_{<t}})}_{z_t^{(1)}=i,z_{<t}^{(1)}=z_{<t}}=\underbrace {p_{\theta}(X_j=x|x_{z_{<t}})}_{z_t^{(1)}=j,z_{<t}^{(2)}=z_{<t}}=\frac {\exp (h(x_{<t})^Te(x))}{\sum_{x^{'}}\exp (h(x_{<t})^Te(x^{'}))} zt(1)=i,z<t(1)=z<t pθ(Xi=xxz<t)=zt(1)=j,z<t(2)=z<t pθ(Xj=xxz<t)=xexp(h(x<t)Te(x))exp(h(x<t)Te(x))
从上式可以看出,两个不同位置的目标有同样的模型预测结果,然而,这两者应该是不同的。所以传统的transformer会失败
提出新的目标函数target-aware representations解决上述问题:

p θ ( X z t = x ∣ x z < t ) = exp ⁡ ( g θ ( x < t , z t ) T e ( x ) ) ∑ x ′ exp ⁡ ( g θ ( x < t , z t ) T e ( x ′ ) ) p_{\theta}(X_{z_t}=x|x_{z_{<t}})=\frac {\exp (g_{\theta}(x_{<t},z_t)^Te(x))}{\sum_{x^{'}}\exp (g_{\theta}(x_{<t},z_t)^Te(x^{'}))} pθ(Xzt=xxz<t)=xexp(gθ(x<t,zt)Te(x))exp(gθ(x<t,zt)Te(x))
g θ ( x < t , z t ) g_{\theta}(x_{<t},z_t) gθ(x<t,zt)的输入包含了目标的位置 z t z_t zt

two-stream self-attention

怎么定义 g θ ( x < t , z t ) g_{\theta}(x_{<t},z_t) gθ(x<t,zt)是一个问题

  • 预测词 x z t x_{z_t} xzt时,只使用 z t {z_t} zt的位置信息,而不使用内容 x z t x_{z_t} xzt—(query representation)
  • 预测词 x z j − − j > t x_{z_j}--j>t xzjj>t时,应该既编码 x z t x_{z_t} xzt的位置,也要编码内容—(content representation)

在这里插入图片描述
image
第一层初始化可训练向量 g i ( 0 ) = w g_i(0)=w gi(0)=w,内容流使用词向量 h i ( 0 ) = e ( x i ) h_i(0)=e(x_i) hi(0)=e(xi)
后面每一层按照
g z t m = A t t e n t i o n ( A = g z t m − 1 , K V = h z < t m − 1 ; θ ) . . . . ( q u e r y − s t r e a m − o n l y − u s e − p o s i t i o n ) h z t m = A t t e n t i o n ( A = h z t m − 1 , K V = h z ≤ t m − 1 ; θ ) . . . . ( c o n t e n t − s t r e a m − c a n − u s e − c o n t e n t ) g_{z_t}^{m}=Attention(A=g_{z_t}^{m-1},KV=h_{z_{<t}}^{m-1};\theta) ....(query-stream-only-use-position)\\ h_{z_t}^{m}=Attention(A=h_{z_t}^{m-1},KV=h_{z_{\leq t}}^{m-1};\theta) ....(content-stream-can-use-content) gztm=Attention(A=gztm1,KV=hz<tm1;θ)....(querystreamonlyuseposition)hztm=Attention(A=hztm1,KV=hztm1;θ)....(contentstreamcanusecontent)

两条网络使用同一组参数,更新规则和传统的transformer一样
在微调时,去掉query流,只保留content流
最后,使用最后一层的表达 g z t M g_{z_t}^{M} gztM计算概率

partial prediction

虽然permutation LM挺好的,但是如何优化呢?为了减少优化的难度,在一个因式分解中,我们只选择预测最后的几个单词。
分割句子Z成 z > c z_{>c} z>c(目标子句子)和 z ≤ c z_{\leq c} zc(非目标子句子),c是切割点。目标函数是
max ⁡ θ E z ∈ Z T = = E z ∈ Z T [ ∑ t = c + 1 ∣ z ∣ log ⁡ p θ ( x z t ∣ x z ≤ t ) ] \max _{\theta}E_{z \in Z_T}==E_{z \in Z_T}[\sum _{t=c+1}^{|z|}\log p_{\theta}(x_{z_t}|x_{z_{\leq t}})] θmaxEzZT==EzZT[t=c+1zlogpθ(xztxzt)]
z > c z_{>c} z>c是给定排列组合后的z,能处理较长的文本,吸收较多的信息。
给定超参数K,使得 ∣ z ∣ / ( ∣ z ∣ − c ) = K |z|/(|z|-c)=K z/(zc)=K。比如句子长是100,K=7,则只选后面的14个词
没有选中的词不需要计算query representations,节省速度和内存

incorporation ideas from transformer-xl

整合Transformer-xl的思想到预训练框架中,并以之命名模型名xlnet(同一个作者,同一种命名)
合并两种重要的思想:

  • 相对位置编码方案(the relative positional encoding scheme)
  • 段循环机制(the segment recurrence mechanism)

假设有两个段落 x ^ = s 1 : T \hat x=s_{1:T} x^=s1:T x = s T : 2 T x=s_{T:2T} x=sT:2T,使用 z ^ \hat z z^ z z z是两个段落的排列组合。基于 z ^ \hat z z^,可以计算第一个段落,每一层m得到内容表达(content representation) h ^ ( m ) \hat h^{(m)} h^(m)
对于下一个段落 x x x,注意力更新可以被写作:
h z t ( m ) ← A t t e n t i o n ( Q = h z t ( m − 1 ) , K V = [ h ^ ( m − 1 ) , h z ≤ t ( m − 1 ) ] , θ ) h_{z_t}^{(m)} \leftarrow Attention(Q=h_{z_t}^{(m-1)}, KV=[\hat h^{(m-1)}, h_{z_{\leq t}}^{(m-1)}], \theta) hzt(m)Attention(Q=hzt(m1),KV=[h^(m1),hzt(m1)],θ)
注意:位置编码只依赖于句子中的真实位置,因此,一旦文本表达 h ^ ( m ) \hat h^{(m)} h^(m)获得了,上述的注意力更新与 z ^ \hat z z^是独立的。这就允许在不知道前面段落排列组合的情况下,重复使用内存。query流可以以同样的方式进行计算。
g z t ( m ) ← A t t e n t i o n ( Q = g z t ( m − 1 ) , K V = [ h ^ ( m − 1 ) , h z < t ( m − 1 ) ] , θ ) g_{z_t}^{(m)} \leftarrow Attention(Q=g_{z_t}^{(m-1)}, KV=[\hat h^{(m-1)}, h_{z_{<t}}^{(m-1)}], \theta) gzt(m)Attention(Q=gzt(m1),KV=[h^(m1),hz<t(m1)],θ)
在这里插入图片描述

modeling multiple segments

[CLS, A, SEP, B, SEP]
两句话输入时,和Bert是一样的输入
虽然在XLNet-large中,取消了NSP(next sentence prediction)这个任务

  • relative segment encodings Bert在词向量中,加了一个绝对位置向量(absolute segment embedding),xlnet采用了相对位置编码(relative encoding),这和transformer-xl是一样的

discussion

预测【new york is a city】,Bert和xlnet都选择new York两个单词作为预测词,xlnet的排列组合顺序是【is a city new york】,Bert和xlnet的目标函数分布如下
T B E R T = log ⁡ p ( n e w ∣ i s , a , c i t y ) + log ⁡ p ( y o r k ∣ i s , a , c i t y ) T x l n e t = log ⁡ p ( n e w ∣ i s , a , c i t y ) + log ⁡ p ( y o r k ∣ n e w , i s , a , c i t y ) T_{BERT} = \log p(new|is,a, city )+\log p(york|is,a, city )\\ T_{xlnet} = \log p(new|is,a, city )+\log p(york|new, is,a, city ) TBERT=logp(newis,a,city)+logp(yorkis,a,city)Txlnet=logp(newis,a,city)+logp(yorknew,is,a,city)

代码

run_classifier.py-训练测试代码,需要指定预加载模型、测试数据等

import function_builder # 引进modeling
class InputExample(object)# 文本分类的单一样本
class DataProcessor(object)# 数据处理的基类
class GLUEProcessor(DataProcessor)# GLUE
class Yelp5Processor(DataProcessor)# Yelp5
class ImdbProcessor(DataProcessor)# Imdb
class MnliMatchedProcessor(GLUEProcessor)# Mnli
class MnliMismatchedProcessor(MnliMatchedProcessor)# MuliMismatched
class StsbProcessor(GLUEProcessor)# Stsb--均与Bert类似,比Bert多了几种类型
def file_based_convert_examples_to_features# 转换InputExample到TFRecord file
def file_based_input_fn_builder# 产生input_fn
def get_model_fn# 
def main(_):

xlnet.py-模型、运行、xlnet的配置文件

class XLNetConfig(object):
 """XLNetConfig contains hyperparameters that are specific to a model checkpoint;
 i.e., these hyperparameters should be the same between
 pretraining and finetuning.

 The following hyperparameters are defined:
   n_layer: int, the number of layers.
   d_model: int, the hidden size.
   n_head: int, the number of attention heads.
   d_head: int, the dimension size of each attention head.
   d_inner: int, the hidden size in feed-forward layers.
   ff_activation: str, "relu" or "gelu".
   untie_r: bool, whether to untie the biases in attention.
   n_token: int, the vocab size.
 """

class RunConfig(object):
 """RunConfig contains hyperparameters that could be different
 between pretraining and finetuning.
 These hyperparameters can also be changed from run to run.
 We store them separately from XLNetConfig for flexibility.
 Args:
     is_training: bool, whether in training mode.
     use_tpu: bool, whether TPUs are used.
     use_bfloat16: bool, use bfloat16 instead of float32.
     dropout: float, dropout rate.
     dropatt: float, dropout rate on attention probabilities.
     init: str, the initialization scheme, either "normal" or "uniform".
     init_range: float, initialize the parameters with a uniform distribution
       in [-init_range, init_range]. Only effective when init="uniform".
     init_std: float, initialize the parameters with a normal distribution
       with mean 0 and stddev init_std. Only effective when init="normal".
     mem_len: int, the number of tokens to cache.
     reuse_len: int, the number of tokens in the currect batch to be cached
       and reused in the future.
     bi_data: bool, whether to use bidirectional input pipeline.
       Usually set to True during pretraining and False during finetuning.
     clamp_len: int, clamp all relative distances larger than clamp_len.
       -1 means no clamping.
     same_length: bool, whether to use the same attention length for each token.
"""

class XLNetModel(object):
 """A wrapper of the XLNet model used during both pretraining and finetuning."""

 def __init__(self, xlnet_config, run_config, input_ids, seg_ids, input_mask,
              mems=None, perm_mask=None, target_mapping=None, inp_q=None,
              **kwargs):
   """
   Args:
     xlnet_config: XLNetConfig,
     run_config: RunConfig,
     input_ids: int32 Tensor in shape [len, bsz], the input token IDs.
     seg_ids: int32 Tensor in shape [len, bsz], the input segment IDs.
     input_mask: float32 Tensor in shape [len, bsz], the input mask.
       0 for real tokens and 1 for padding.
     mems: a list of float32 Tensors in shape [mem_len, bsz, d_model], memory
       from previous batches. The length of the list equals n_layer.
       If None, no memory is used.
     perm_mask: float32 Tensor in shape [len, len, bsz].
       If perm_mask[i, j, k] = 0, i attend to j in batch k;
       if perm_mask[i, j, k] = 1, i does not attend to j in batch k.
       If None, each position attends to all the others.
     target_mapping: float32 Tensor in shape [num_predict, len, bsz].
       If target_mapping[i, j, k] = 1, the i-th predict in batch k is
       on the j-th token.
       Only used during pretraining for partial prediction.
       Set to None during finetuning.
     inp_q: float32 Tensor in shape [len, bsz].
       1 for tokens with losses and 0 for tokens without losses.
       Only used during pretraining for two-stream attention.
       Set to None during finetuning.
   """
 def get_pooled_out(self, summary_type, use_summ_proj=True):
   """
   Args:
     summary_type: str, "last", "first", "mean", or "attn". The method
       to pool the input to get a vector representation.
     use_summ_proj: bool, whether to use a linear projection during pooling.

   Returns:
     float32 Tensor in shape [bsz, d_model], the pooled representation.
   """

 def get_sequence_output(self):
   """
   Returns:
     float32 Tensor in shape [len, bsz, d_model]. The last layer hidden
     representation of XLNet.
   """

 def get_new_memory(self):
   """
   Returns:
     list of float32 Tensors in shape [mem_len, bsz, d_model], the new
     memory that concatenates the previous memory with the current input
     representations.
     The length of the list equals n_layer.
   """

 def get_embedding_table(self):
   """
   Returns:
     float32 Tensor in shape [n_token, d_model]. The embedding lookup table.
     Used for tying embeddings between input and output layers.
   """
 def get_initializer(self):
   """
   Returns:
     A tf initializer. Used to initialize variables in layers on top of XLNet.
   """

function_builder.py


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值