Top 2 Language Models

最新推荐文章于 2024-03-23 00:00:00 发布

CrazyTensor

最新推荐文章于 2024-03-23 00:00:00 发布

阅读量491

点赞数 1

分类专栏： ML读书笔记

本文链接：https://blog.csdn.net/CrazyTensor/article/details/93630857

版权

ML读书笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

ULMFiT/AWD-LSTM

2018年语言模型预训练特定任务迁移学习的nlp新范式开始成为主流，ULMFiT(Universal Language Model Fine-tuning for Text Classification)是开山鼻祖之一。模型流程如下，包括基础语言模型训练，目标语言模型迁移，分类器微调，语言模型是基于AWD-LSTM(Regularizing and Optimizing LSTM Language Models)，分类器是在语言模型上拼接了pooling和linear layers：
-c800
AWD-LSTM(Merity et al. 2017)语言模型细节如下：
-c260
该模型的精髓是dropout everywhere，基础LSTM加以许多正则和优化tricks达到了当时的SOTA:

在LSTM的hidden to hidden weight matrices上使用DropConnect regularization(传统Dropout会破坏长依赖，DropConnect从随机置零部分activations改为随机置零部分网络weights，不破坏标准cuDNN LSTM实现，更加高效)
在embedding层和LSTM层后加入Dropout，防止过拟合(传统Dropout mask每次都会采样，LockedDropout只采样一次，锁定复用，更加高效)
Embedding层和softmax层共享参数，以减少参数总数，并防止去学习输入和输出的一一对应关系，对语言模型的学习有比较大的帮助
Activation Regularization(AR)/Temporal AR：在LSTM的最后一层输出上，L2作用于activation让其接近于0，不同时间output的差值也做L2

ULMFiT在AWD-LSTM的基础上加入了不少训练的技巧:

Slanted Triangular Learning Rates (STLR)：lr先线性增加后线性减少，增加区间短减少区间长，先快速收敛，后微调。
Discriminative Fine-Tuning：网络不同层提取不同的信息，从一般到特定。先确定最后层lr后，其它层按比例递减 $\eta^{l-1} = \eta^l / 2.6$
Gradual Unfreezing:从上往下逐层放开训练，防止灾难性遗忘
Variable Length BPTT: BPTT长度在默认70上加入随机性，以达到类似shuffling的效果
Concat Pooling: 拼接hidden state，max pooling和mean pooling，一定程度解决长依赖问题

ELMo

基础的word vectors(Word2vec,GloVe,fastText)对于每一个词都只有一个唯一的表示，而现实中词在不同语义环境下含义会不同，这就是nlp最根本的难点，歧义(ambiguity)，因此contextualized word embeddings尝试根据词在不同上下文中产生不同的词表示，ELMo(Embeddings from Language Models)是其中最成功的模型。ELMo基于2层双向LSTM的语言模型biLM，相比Word2vec就可以使用更长的context(从context window到了whole sentence)，使用每层LSTM的internal states作为词表征，每个词向量是由所在句子动态计算的。ELMo are a function of all of the internal layers of the biLM, a linear combination of the vectors stacked above each input word for each end task.
-c418

Use character CNN to build initial word representation (only) 2048charn-gramfiltersand2highwaylayers,512dimprojection
User 4096 dim hidden/cell LSTM states with 512 dim
projections to next input
• Use a residual connection
• Tie parameters of token input and output (softmax) and tie these between forward and backward LMs

EMLo有效地证明了，暴露deep internals of the pre-trained network是非常有效，给下游任务不同level的semi-supervision signals：

Lower-level states is better for lower-level syntax: Part-of-speech tagging, syntactic dependencies, NER
Higher-level states is better for higher-level semantics: Sentiment, Semantic role labeling, question answering, SNLI

Transformer-based Neural LMs

Transformer Done!

GPT

Bert

beginning of a new era in NLP

GPT-2

Reference

Books and Tutorials
- Jurafsky, Dan. Speech and Language Processing (3rd ed. 2019)
- CS224n: Natural Language Processing with Deep Learning, Winter 2019
- CMU11-747: Neural Nets for NLP, Spring 2019
- Goldberg, Yoav. Neural Network Methods for Natural Language Processing. (2017)
- Neubig, Graham. “Neural machine translation and sequence-to-sequence models: A tutorial.” arXiv preprint arXiv:1703.01619 (2017).
Papers
- Chen, Stanley F., and Joshua Goodman. “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language 13.4 (1999): 359-394.
- Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. (original NNLM paper)
- Mikolov, Tomáš, et al. “Recurrent neural network based language model.” Eleventh annual conference of the international speech communication association. 2010. (original RNNLM paper)
- Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017). (original AWD-LSTM paper)
- Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018). (original ULMFiT paper)
- Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). (original ELMo paper)
- Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). (original Bert paper)
Blogs

CrazyTensor

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Top 2 Language Models

ULMFiT/AWD-LSTM2018年语言模型预训练特定任务迁移学习的nlp新范式开始成为主流，ULMFiT(Universal Language Model Fine-tuning for Text Classification)是开山鼻祖之一。模型流程如下，包括基础语言模型训练，目标语言模型迁移，分类器微调，语言模型是基于AWD-LSTM(Regularizing and Optimizi...
复制链接

扫一扫