An overview on domain adaptation in neural machine translation

最新推荐文章于 2022-05-17 11:52:25 发布

武大西门

最新推荐文章于 2022-05-17 11:52:25 发布

阅读量952

点赞数

分类专栏：机器学习文章标签： neural machine trans 机器学习机器翻译 domain adaptati

本文链接：https://blog.csdn.net/u014803202/article/details/77162251

版权

机器学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

本文探讨了神经机器翻译（NMT）中的领域适应问题。NMT模型对训练领域的敏感性使得从低资源领域数据和大量领域外数据中提升翻译性能成为挑战。文章介绍了单领域和多领域适应的两种设置，以及混合训练、微调、领域特定嵌入等方法，旨在通过额外特征平衡模型的泛化能力和领域适应性。

摘要由CSDN通过智能技术生成

Neural machine translation(NMT) has achieved state-of-the-art performance in most of the language pairs. Arguably, the success of NMT is mainly attributed to the large-scale, high-quality bilingual corpus. Although new corpora are becoming increasing available, only those that belong to the same or similar domains are helpful for improving the translation performance. The NMT models are very sensitive to the domains they trained on because each domain has its own style, sentence structure and terminology. However, the in-domain corpora is usually relatively scarce and the out-of-domain corpora is abundant. Naturally, improving the translation performance of the NMT model with the low-resource in-domain data and large-scale out-of-domain data is challenging and promising. In this essay, we give an overview about the domain adaptation in NMT.

Domain adaptation, which means to adapt a general model to a specific domain, has got widely interests in recent years. Many approaches has been proposed: transfer learning, batch normalization, data selection, fine tune and so on. This essay only focus on the approaches which adapt a neural machine translation model to a specific domain. We collect, read and analysis the papers published in recent years, and give an overview here. We consider two settings, \emph{single-domain adaptation} and \emph{multi-domain adaptation}. The \emph{single-domain adaptation} means that there is only single domain the NMT model needs to be adapted to. That is, the training corpus only contains the in-domain data and out-of-domain data (or called general data). The \emph{multi-domain adaptation} means that the NMT model shall be adapted to multi domains. That is, the training corpus has been (or can be) classified into multi domains and the NMT model shall performs well on each domain.

The single-domain adaptation mainly focuses on this problem: Given the low-resource in-domain data and large-scale general data, how can we train a NMT model which adapts well on this specific domain. More strictly, we may also expect that the NMT model does not has performance deration on the general test sets. The approaches can be classified into the following categories:

A naive thinking is that the model trained on the mix of the in-domain and general data can perform well either on the in-domain test sets or on the general test sets. Hence, a simple method is proposed as that training the NMT model on the mix of the in-domain and general data directly, which we referred to as mix training. This method is very simple and effect when the in-domain data has the comparable size with the general data. However, when the amount of in-domain data is relatively too small, the prediction of the NMT model will be dominated by the general data and the translation performances on the in-domain test sets will be extremely poor.

To alleviate the domination of the large-scale general data, some researches propose a two-stage translation method. Firstly, training the NMT model on the large-scale general data until convergence; Then, continue training the NMT model on the in-domain data. We call this method as fine tuning. A glaring shortcoming of this method is that the model can get over fit easily in the in-domain data. Specifically, the translation performance of the NMT model on the general test sets get degraded severely. Some researchers have conducted experiments to compare the mix training and the fine tuning on the in-domain test sets. They show that the mix training gets better translation performance when the amount of in-domain data is not so small and the fine tuning is better if otherwise.

From the two subsections above, we know that the mix training has better generalization ability and the fine tuning fits the in-domain data better. Is there one method which can combines the advantages of the mix training and fine tuning? To this end, researchers propose to supply some additional features into the NMT model. With the additional features, the NMT model can distinguish the in-domain data from the out-of-domain data automatically and its translation prediction shall not be dominated by the out-of-domain data. Hence, we can directly feed the in-domain and out-of-domain data into the NMT model at the same time. Specifically, there are mainly two kind of additional features: artificial token and domain-specific embedding.

The artificial token, maybe like "@IN-DOMAIN@" or "@OUT-DOMAIN@", is appended at the end of each input source sentence to indicate whether the input is in-domain sentence pair or the out-of-domain sentence pair. The artificial tokens are appropriately selected in order to avoid overlaps with words present in the source vocabulary.

Each word in NMT is represented as word embedding, which is a high-dimension vector. The word embedding can be easily extended to arbitrary number of cells, designed to encode domain information. Under this feature framework, the sentence-level domain information is added on a word-by-word basis to all words in a sentence.

Experiments show that the domain-specific embedding achieves consistently better performance than the artificial token. As a matter of fact, this method can be easily extended to multi-domain adaptation problem.

Data selection can be classified into static data selection and dynamic data selection. Static data selection ranks sentence pairs in a large training corpus according to their difference to an in-domain corpus and a general corpus and the top n sentence pair with the highest rank are selected and used for training the NMT model. Hence, the selected top n sentences are static and not changed any more. Since the static data selection discard the irrelevant data, it can also exacerbate the problem of low vocabulary coverage and unreliable statistics for rarer words , which are major issues in NMT. In addition, it has been shown that NMT performance drops tremendously in low-resource scenarios. To over this problem, researchers propose the dynamic data selection. Contrary to the static data selection, the dynamic data selection samples n sentence pairs for each iteration by using the distribution computed from the ranks. In practice, the top-ranked sentence pairs are selected in nearly each epoch while bottom-ranked sentence pairs are selected nearly only once.

There are several ways to compute the difference for one sentences to an in-domain corpus and a general corpus. Traditional method is applying the language model, where two language models are trained on the in-domain corpus and general corpus respectively. And then compute the cross entropy for each sentence. The cross entropy can be viewed as the difference of this sentence to in-domain corpus and general corpus. The researchers also propose one new method which use the sentence embedding to compute the difference. The sentence embedding is usually computed as the average of the hidden states of the encoder.

In the multi-domain setting, we consider that given k full-trained domain experts, how can we train the k+1 domain model rapidly. Typically, we can address this problem in multi-task framework, where each domain is regarded as a individual task. A possible downside of this approach is that we need to re-train the model each time when a new domain comes. To overcome this shortcoming, researchers propose a solution based on attending an ensemble of domain experts. Assuming k domain-specific intent and slot models trained on respective domains, given domain k+1, the model uses a weighted combination of the k domain experts' feedback along with its own opinion to make predictions on the new domain. Experiments show that this model significantly outperformed baselines that do not use domain adaptation and also performed better than the full re-training approach.

This informal essay summaries the new approaches about domain adaptation in neural machine translation. We wish this essay is helpful to researchers who are interested in this area.

We list some important reference papers here.

[1] Chu, Chenhui and Dabre, Raj and Kurohashi, Sadao, An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation (ACL2017).

[2] Wang, Rui and Finch, Andrew and Utiyama, Masao and Sumita, Eiichiro, Sentence Embedding for Neural Machine Translation Domain Adaptation (ACL2017).

[3] Freitag M, Alonaizan Y. Fast Domain Adaptation for Neural Machine Translation[J]. 2016.

[4] Servan C, Crego J, Senellart J. Domain specialization: a post-training domain adaptation for Neural Machine Translation[J]. 2016.

[5] Kobus C, Crego J, Senellart J. Domain Control for Neural Machine Translation[J]. 2016.

[6] Kim, Young-Bum and Stratos, Karl and Kim, Dongchan, Domain Attention with an Ensemble of Experts (ACL2017).