Top 2 Language Models

ULMFiT/AWD-LSTM

2018年语言模型预训练特定任务迁移学习的nlp新范式开始成为主流,ULMFiT(Universal Language Model Fine-tuning for Text Classification)是开山鼻祖之一。模型流程如下,包括基础语言模型训练,目标语言模型迁移,分类器微调,语言模型是基于AWD-LSTM(Regularizing and Optimizing LSTM Language Models),分类器是在语言模型上拼接了pooling和linear layers:
-c800
AWD-LSTM(Merity et al. 2017)语言模型细节如下:
-c260
该模型的精髓是dropout everywhere,基础LSTM加以许多正则和优化tricks达到了当时的SOTA:

  • 在LSTM的hidden to hidden weight matrices上使用DropConnect regularization(传统Dropout会破坏长依赖,DropConnect从随机置零部分activations改为随机置零部分网络weights,不破坏标准cuDNN LSTM实现,更加高效)-c300
  • 在embedding层和LSTM层后加入Dropout,防止过拟合(传统Dropout mask每次都会采样,LockedDropout只采样一次,锁定复用,更加高效)
  • Embedding层和softmax层共享参数,以减少参数总数,并防止去学习输入和输出的一一对应关系,对语言模型的学习有比较大的帮助
  • Activation Regularization(AR)/Temporal AR:在LSTM的最后一层输出上,L2作用于activation让其接近于0,不同时间output的差值也做L2

ULMFiT在AWD-LSTM的基础上加入了不少训练的技巧:

  • Slanted Triangular Learning Rates (STLR):lr先线性增加后线性减少,增加区间短减少区间长,先快速收敛,后微调。
  • Discriminative Fine-Tuning:网络不同层提取不同的信息,从一般到特定。先确定最后层lr后,其它层按比例递减 η l − 1 = η l / 2.6 \eta^{l-1} = \eta^l / 2.6 ηl1=ηl/2.6
  • Gradual Unfreezing:从上往下逐层放开训练,防止灾难性遗忘
  • Variable Length BPTT: BPTT长度在默认70上加入随机性,以达到类似shuffling的效果
  • Concat Pooling: 拼接hidden state,max pooling和mean pooling,一定程度解决长依赖问题
ELMo

基础的word vectors(Word2vec,GloVe,fastText)对于每一个词都只有一个唯一的表示,而现实中词在不同语义环境下含义会不同,这就是nlp最根本的难点,歧义(ambiguity),因此contextualized word embeddings尝试根据词在不同上下文中产生不同的词表示,ELMo(Embeddings from Language Models)是其中最成功的模型。ELMo基于2层双向LSTM的语言模型biLM,相比Word2vec就可以使用更长的context(从context window到了whole sentence),使用每层LSTM的internal states作为词表征,每个词向量是由所在句子动态计算的。ELMo are a function of all of the internal layers of the biLM, a linear combination of the vectors stacked above each input word for each end task.
-c418

  • Use character CNN to build initial word representation (only) 2048charn-gramfiltersand2highwaylayers,512dimprojection
  • User 4096 dim hidden/cell LSTM states with 512 dim
    projections to next input
    • Use a residual connection
    • Tie parameters of token input and output (softmax) and tie these between forward and backward LMs

EMLo有效地证明了,暴露deep internals of the pre-trained network是非常有效,给下游任务不同level的semi-supervision signals:

  • Lower-level states is better for lower-level syntax: Part-of-speech tagging, syntactic dependencies, NER
  • Higher-level states is better for higher-level semantics: Sentiment, Semantic role labeling, question answering, SNLI

Transformer-based Neural LMs

Transformer Done!

GPT

Bert

beginning of a new era in NLP

GPT-2

Reference

  • Books and Tutorials
    • Jurafsky, Dan. Speech and Language Processing (3rd ed. 2019)
    • CS224n: Natural Language Processing with Deep Learning, Winter 2019
    • CMU11-747: Neural Nets for NLP, Spring 2019
    • Goldberg, Yoav. Neural Network Methods for Natural Language Processing. (2017)
    • Neubig, Graham. “Neural machine translation and sequence-to-sequence models: A tutorial.” arXiv preprint arXiv:1703.01619 (2017).
  • Papers
    • Chen, Stanley F., and Joshua Goodman. “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language 13.4 (1999): 359-394.
    • Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. (original NNLM paper)
    • Mikolov, Tomáš, et al. “Recurrent neural network based language model.” Eleventh annual conference of the international speech communication association. 2010. (original RNNLM paper)
    • Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017). (original AWD-LSTM paper)
    • Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018). (original ULMFiT paper)
    • Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). (original ELMo paper)
    • Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). (original Bert paper)
  • Blogs
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值