ULMFiT/AWD-LSTM
2018年语言模型预训练特定任务迁移学习的nlp新范式开始成为主流,ULMFiT(Universal Language Model Fine-tuning for Text Classification)是开山鼻祖之一。模型流程如下,包括基础语言模型训练,目标语言模型迁移,分类器微调,语言模型是基于AWD-LSTM(Regularizing and Optimizing LSTM Language Models),分类器是在语言模型上拼接了pooling和linear layers:
AWD-LSTM(Merity et al. 2017)语言模型细节如下:
该模型的精髓是dropout everywhere,基础LSTM加以许多正则和优化tricks达到了当时的SOTA:
- 在LSTM的hidden to hidden weight matrices上使用DropConnect regularization(传统Dropout会破坏长依赖,DropConnect从随机置零部分activations改为随机置零部分网络weights,不破坏标准cuDNN LSTM实现,更加高效)
- 在embedding层和LSTM层后加入Dropout,防止过拟合(传统Dropout mask每次都会采样,LockedDropout只采样一次,锁定复用,更加高效)
- Embedding层和softmax层共享参数,以减少参数总数,并防止去学习输入和输出的一一对应关系,对语言模型的学习有比较大的帮助
- Activation Regularization(AR)/Temporal AR:在LSTM的最后一层输出上,L2作用于activation让其接近于0,不同时间output的差值也做L2
ULMFiT在AWD-LSTM的基础上加入了不少训练的技巧:
- Slanted Triangular Learning Rates (STLR):lr先线性增加后线性减少,增加区间短减少区间长,先快速收敛,后微调。
- Discriminative Fine-Tuning:网络不同层提取不同的信息,从一般到特定。先确定最后层lr后,其它层按比例递减 η l − 1 = η l / 2.6 \eta^{l-1} = \eta^l / 2.6 ηl−1=ηl/2.6
- Gradual Unfreezing:从上往下逐层放开训练,防止灾难性遗忘
- Variable Length BPTT: BPTT长度在默认70上加入随机性,以达到类似shuffling的效果
- Concat Pooling: 拼接hidden state,max pooling和mean pooling,一定程度解决长依赖问题
ELMo
基础的word vectors(Word2vec,GloVe,fastText)对于每一个词都只有一个唯一的表示,而现实中词在不同语义环境下含义会不同,这就是nlp最根本的难点,歧义(ambiguity),因此contextualized word embeddings尝试根据词在不同上下文中产生不同的词表示,ELMo(Embeddings from Language Models)是其中最成功的模型。ELMo基于2层双向LSTM的语言模型biLM,相比Word2vec就可以使用更长的context(从context window到了whole sentence),使用每层LSTM的internal states作为词表征,每个词向量是由所在句子动态计算的。ELMo are a function of all of the internal layers of the biLM, a linear combination of the vectors stacked above each input word for each end task.
- Use character CNN to build initial word representation (only) 2048charn-gramfiltersand2highwaylayers,512dimprojection
- User 4096 dim hidden/cell LSTM states with 512 dim
projections to next input
• Use a residual connection
• Tie parameters of token input and output (softmax) and tie these between forward and backward LMs
EMLo有效地证明了,暴露deep internals of the pre-trained network是非常有效,给下游任务不同level的semi-supervision signals:
- Lower-level states is better for lower-level syntax: Part-of-speech tagging, syntactic dependencies, NER
- Higher-level states is better for higher-level semantics: Sentiment, Semantic role labeling, question answering, SNLI
Transformer-based Neural LMs
Transformer Done!
GPT
Bert
beginning of a new era in NLP
GPT-2
Reference
- Books and Tutorials
- Jurafsky, Dan. Speech and Language Processing (3rd ed. 2019)
- CS224n: Natural Language Processing with Deep Learning, Winter 2019
- CMU11-747: Neural Nets for NLP, Spring 2019
- Goldberg, Yoav. Neural Network Methods for Natural Language Processing. (2017)
- Neubig, Graham. “Neural machine translation and sequence-to-sequence models: A tutorial.” arXiv preprint arXiv:1703.01619 (2017).
- Papers
- Chen, Stanley F., and Joshua Goodman. “An empirical study of smoothing techniques for language modeling.” Computer Speech & Language 13.4 (1999): 359-394.
- Bengio, Yoshua, et al. “A neural probabilistic language model.” Journal of machine learning research 3.Feb (2003): 1137-1155. (original NNLM paper)
- Mikolov, Tomáš, et al. “Recurrent neural network based language model.” Eleventh annual conference of the international speech communication association. 2010. (original RNNLM paper)
- Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. “Regularizing and optimizing LSTM language models.” arXiv preprint arXiv:1708.02182 (2017). (original AWD-LSTM paper)
- Howard, Jeremy, and Sebastian Ruder. “Universal language model fine-tuning for text classification.” arXiv preprint arXiv:1801.06146 (2018). (original ULMFiT paper)
- Peters, Matthew E., et al. “Deep contextualized word representations.” arXiv preprint arXiv:1802.05365 (2018). (original ELMo paper)
- Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018). (original Bert paper)
- Blogs