本文系翻译文章,原作者主页,作为自然语言处理的一个初学者,对作者 Sebastian Ruder 表示感谢!
-----------------------------------------我是分割线-----------------------------------------------
一、简介
本文是在自然语言处理实践当中的一些最佳实践的集合,当有新的可以应用到实践当中的想法出现时,我们会逐步更新的,以帮助我们跟随深度学习在自然语言处理上的应用。
在NLP 的社区中一直有着这样一个笑话,说一个精心设计的 LSTM 可以胜任任何场景的任何任务。尽管在过去的两年确实有这种趋势,但是 NLP 社区最近正在向着一些有趣的模式当中迁移。
不过作为一个社区,我们并不想再花费两年重新发现一个新的"LSTM",我们并不想重新发现已有的技术或方法。尽管现存的很多深度学习库已经编码了一些通用的神经网络的实践工具,但是诸如设计初始化方案等一些其他细节,特定场景的应用还是需要实施者来具体解决。
本文并不旨在追踪最先进的技术,而是把广泛的收集相关任务的最佳实践作为主要目标。换言之,本文并不会仅描述特定结构
,而是旨在收集那些成功实践了的架构的功能。虽然这些功能中的某些功能对于最先进的技术是最有用的,但是我希望通过对所有功能进行广泛的了解可以帮助我们更好的评估、更有意义的与基准比较、重塑我们工作的目的来获取灵感。
我假定你们已经很熟悉神经网络在自然语言处理的应用(如若不然,请阅读Yoav Goldberg的优秀引文[43]),并且热衷于 NLP 应用的一般场景或特定任务。本文的主要目标是帮助你加速了解相关的最佳实践,这样你就可以尽快做出有意义的贡献。
我将会先对大多数相关任务的最佳实践的综述。然后我将会概述一些相关的共同任务的实践,特别是分类器,标签序列,自然语言生成和机器翻译。
免责声明:把某些方法称为最佳时间是很难的:你凭什么做出如此判断?如果有更好的方法呢?本文是基于我个人的理解和经验(必然是不完整)。在下文中,我将只会讨论那些在至少不同两个群体当中表示有用的方法实践,对于每一个最佳实践我都会尽力给出两个以上的参考。
二、 最佳实践
2.1、词嵌入(Word Embedding)
2.2、深度(Depth)
2.3、层连接(Layer Connections)
2.4、舍弃元(Dropout)
2.5、多任务学习(Multi-Task Learning)
2.6、注意机制(Attention)
2.7、优化
2.8、集合
2.9、超参数调优
2.10、LSTM 技巧
三、特定任务的实践
3.1、分类
由于可以更有效的完成卷积操作,卷积神经网络(CNNs)在自然语言处理(NLP)领域中的分类任务中已经十分流行了,而且它不仅仅可以用来解决序列问题。以下是关于卷积神经网络(CNNs) 的最佳实践,并且介绍了一些最佳超参数的选择。
CNN 过滤器:把过滤尺寸调整到最佳过滤尺寸时性能最佳[Kim,12],[Kim,16]。特征值最佳数量应该在50-600之间[Zhang&Wallace,59]。
聚合函数:一阶最大池层(1-max-pooling)效果好于平均池层(average-pooling)和 kk 最大池层(kk-max-pooling)[Zhang&Wallace,59]。
3.2、序列标签
标签序列在 NLP 领域无处不在。尽管现有的很多最佳时间是关于模型结构的某些特定部分,下面的讨论则会围绕模型输出和预测阶段。
标签方案:对于很多任务来说,对于文本的不同部分,应当采用不同的标签方案。这些方案包括:BIO,用一个标签 B 来标记片段的第一个标志(token),剩余的标志被标记为 I 标签,片段之外的部分被标记为 O-标签;IOB,同 BIO 类似,但是对于与前面的标志是同一类别的又不属于同一片段的,则会标记 B- ;IOBES,用来区分单标志(single-token)的 S-和片段中最后标志 E-,使用 IBOES 和 BIO 可以产生相似的效果。
CRF 输出层:如果像命名实体识别一样,输出之间存有相互依赖的关系,那么可以用线性马尔科夫随机场(CRF)来代替最后的 softmax 层。此种方法已经被证明在模型连续的情况下已经可以带来持续的改进[Huang,60],[Max&Hovy,61],[Lample,62]。
解码约束:可以使用解码约束作为除了 CRF 输出层之外的另外一种拒绝错误序列的方法,举个例子,,没有产生有效的 BIO 翻译的错误序列。解码约束的优点在于,无论是特殊的任务还是语法约束都可以通过这种方式强制中断。
3.3、自然语言生成(Natural Language Generation)
这一向量体现了我们有多关注资源中的所有词。我们现在可以在这个覆盖向量上增加额外的注意,以鼓励模型不要重复关注相同的词:
fatt(hi,sj,ci)=va⊤tanh(W1hi+W2sj+W3ci)
此外,我们还可以增加一个额外的辅助损失来帮助我们补货我们想要关注的特定任务的注意值,例如,在 NMT 当中,我们希望大概一一对齐;如果最终的覆盖向量在每次索引处大于或小于1,我们就惩罚这个模型[Tu,54]。总而言之,我们只想惩罚那些在相同位置重复的模型[See,65]。
3.4、神经机器翻译(Neural Machine Translation)
编码器和解码器深度(Encoder and Decoder Depth):自动编码器不需要超过2-4层的深度。解码器虽然层数越多表现越好,但是并不需要超过4层的深度来提升性能[Britz,27]。
方向性(Directionality):双向编码器的性能要优于单项编码器。Surskever提出了通过改变文本顺序的方式来减少长期依赖的数量来[Surskever,67]。通过双向编码器逆序的文本表现优于未被逆序的部分。
Beam 搜索策略:标准化参数为1.0时,Beam窗口取大约为10的中间值效果较好[Britz,27]。Senrich提出了一种把利用字节对编码(byte-pair encoding,BPE)的方法把词分解为子词(sub-words)[Senrich,66]。BPE 通过迭代合并频繁出现的富豪对,这就导致频繁出现的字符被合成单个符号,从而有效的消除词汇表之外的单词了。这一方法最初提出是为了消除生僻词,利用子词(sub-words)的模型效果要好于全词(full-words)系统,利用子词的系统的有效词汇表的容量是32000。
结尾
译者书
参考文献
2. Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615
3. Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. arXiv Preprint arXiv:1602.02410.
4. Zilly, J. G., Srivastava, R. K., Koutnik, J., & Schmidhuber, J. (2017). Recurrent Highway Networks. In International Conference on Machine Learning (ICML 2017).
5. Zhang, Y., Chen, G., Yu, D., Yao, K., Kudanpur, S., & Glass, J. (2016). Highway Long Short-Term Memory RNNS for Distant Speech Recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
6. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In CVPR.
7. Huang, G., Weinberger, K. Q., & Maaten, L. Van Der. (2016). Densely Connected Convolutional Networks. CVPR 2017.
8. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15, 1929–1958.
9. Ba, J., & Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems.
10. Li, Z., Gong, B., & Yang, T. (2016). Improved Dropout for Shallow and Deep Learning. In Advances in Neural Information Processing Systems 29 (NIPS 2016).
11. Gal, Y., & Ghahramani, Z. (2016). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Advances in Neural Information Processing Systems.
12. Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1746–1751.
13. Ruder, S. (2017). An Overview of Multi-Task Learning in Deep Neural Networks.
14. Semi-supervised Multitask Learning for Sequence Labeling. In Proceedings of ACL 2017.
15. Bahdanau, D., Cho, K., & Bengio, Y.. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.
16. Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015.
17. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention Is All You Need. arXiv Preprint arXiv:1706.03762.
18. Lin, Z., Feng, M., Santos, C. N. dos, Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A Structured Self-Attentive Sentence Embedding. In ICLR 2017.
19. Daniluk, M., Rockt, T., Welbl, J., & Riedel, S. (2017). Frustratingly Short Attention Spans in Neural Language Modeling. In ICLR 2017.
20. Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
21. Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations.
22. Ruder, S. (2016). An overview of gradient descent optimization. arXiv Preprint arXiv:1609.04747.
23. Denkowski, M., & Neubig, G. (2017). Stronger Baselines for Trustable Results in Neural Machine Translation. ↩
24. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv Preprint arXiv:1503.02531. https://doi.org/10.1063/1.4931082
25. Kuncoro, A., Ballesteros, M., Kong, L., Dyer, C., & Smith, N. A. (2016). Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser. Empirical Methods in Natural Language Processing.
26. Kim, Y., & Rush, A. M. (2016). Sequence-Level Knowledge Distillation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16).
27. Britz, D., Goldie, A., Luong, T., & Le, Q. (2017). Massive Exploration of Neural Machine Translation Architectures. In arXiv preprint arXiv:1703.03906.
28. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems, 649–657.
29. Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very Deep Convolutional Networks for Natural Language Processing.
30. Le, H. T., Cerisara, C., & Denis, A. (2017). Do Convolutional Networks need to be Deep for Text Classification ? In arXiv preprint arXiv:1707.04108.
31. Wu, Y., Schuster, M., Chen, Z., Le, Q. V, Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
32. Plank, B., Søgaard, A., & Goldberg, Y. (2016). Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
33. He, L., Lee, K., Lewis, M., & Zettlemoyer, L. (2017). Deep Semantic Role Labeling: What Works and What’s Next. ACL.
34. Melis, G., Dyer, C., & Blunsom, P. (2017). On the State of the Art of Evaluation in Neural Language Models.
35. Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling. In Proceedings of ACL 2017.
36. Ramachandran, P., Liu, P. J., & Le, Q. V. (2016). Unsupervised Pretrainig for Sequence to Sequence Learning. arXiv Preprint arXiv:1611.02683.
37. Kadlec, R., Schmid, M., Bajgar, O., & Kleindienst, J. (2016). Text Understanding with the Attention Sum Reader Network. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
38. Cheng, J., Dong, L., & Lapata, M. (2016). Long Short-Term Memory-Networks for Machine Reading. arXiv Preprint arXiv:1601.06733.
39. Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A Decomposable Attention Model for Natural Language Inference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
40. aulus, R., Xiong, C., & Socher, R. (2017). A Deep Reinforced Model for Abstractive Summarization. In arXiv preprint arXiv:1705.04304.
41. Liu, Y., & Lapata, M. (2017). Learning Structured Text Representations. In arXiv preprint arXiv:1705.09207. Retrieved from http://arxiv.org/abs/1705.09207
42. Zhang, J., Mitliagkas, I., & Ré, C. (2017). YellowFin and the Art of Momentum Tuning. arXiv preprint arXiv:1706.03471.
43. Goldberg, Y. (2016). A Primer on Neural Network Models for Natural Language Processing. Journal of Artificial Intelligence Research, 57, 345–420.
44. Melamud, O., McClosky, D., Patwardhan, S., & Bansal, M. (2016). The Role of Context Types and Dimensionality in Learning Word Embeddings. In Proceedings of NAACL-HLT 2016 (pp. 1030–1040).
45. Ruder, S., Ghaffari, P., & Breslin, J. G. (2016). A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP-16), 999–1005.
46. Reimers, N., & Gurevych, I. (2017). Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks. In arXiv preprint arXiv:1707.06799.
47. Søgaard, A., & Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 231–235.
48. Liu, P., Qiu, X., & Huang, X. (2017). Adversarial Multi-task Learning for Text Classification. In ACL 2017.
49. Ruder, S., Bingel, J., Augenstein, I., & Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks. arXiv Preprint arXiv:1705.08142.
50. Dozat, T., & Manning, C. D. (2017). Deep Biaffine Attention for Neural Dependency Parsing. In ICLR 2017.
51. Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2015). On Using Very Large Target Vocabulary for Neural Machine Translation. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1–10.
52. Sennrich, R., Haddow, B., & Birch, A. (2016). Edinburgh Neural Machine Translation Systems for WMT 16. In Proceedings of the First Conference on Machine Translation (WMT 2016).
53. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot Ensembles: Train 1, get M for free. In ICLR 2017.
54. Inan, H., Khosravi, K., & Socher, R. (2016). Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling. arXiv Preprint arXiv:1611.01462.
55. Press, O., & Wolf, L. (2017). Using the Output Embedding to Improve Language Models. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2, 157--163.
56. Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. Neural Information Processing Systems Conference (NIPS 2012).
57. Mikolov, T. (2012). Statistical language models based57. on neural networks (Doctoral dissertation, PhD thesis, Brno University of Technology). ↩
58. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. International Conference on Machine Learning, (2), 1310–1318.
59. Zhang, Y., & Wallace, B. (2015). A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification. arXiv Preprint arXiv:1510.03820, (1).
60. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991.
61. Ma, X., & Hovy, E. (2016). End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv Preprint arXiv:1603.01354.
62. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural Architectures for Named Entity Recognition. NAACL-HLT 2016.
63. Kiddon, C., Zettlemoyer, L., & Choi, Y. (2016). Globally Coherent Text Generation with Neural Checklist Models. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP2016), 329–339.
64. Tu, Z., Lu, Z., Liu, Y., Liu, X., & Li, H. (2016). Modeling Coverage for Neural Machine Translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.
65. See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. In ACL 2017.
66. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016).
67. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 9.