2018.2-Mikel Artetxe, Kyunghyum Cho-Unsupervised Nueral Machine Translation
UPV/EHU, New York University
ICLR2018
Abstract
- This paer
- build upon the recent work on unsupervised embedding mappings
- a slightly modified attentional encoder-decoder model
- combination of denoising and back-translation
- novelty
- Dual structure => handle both directions together
- Shared encoder
- fixed cross-lingual embeddings in the encoder during training
- Result
- no parallel resource
- WMT 2014 French -> English: 15.56 BLEU
- WMT 2014 German -> English; 10.21 BLEU
- combined with 100,000 parallel sentences
- WMT 2014 French -> English: 21.81 BLEU
- WMT 2014 German -> English; 15.24 BLEU
- no parallel resource
- Related Works
- unsupervised cross-lingual embeddings
- statistical deciperment for machine translation
- low-resource neural machine translation
Method
System Architecture
- encoder: a two-layer bidirectional RNN (GRU cells with 600 hidden units)
- decoder: a two-layer RNN (GRU cells with 600 hidden units)
- dim. of the ebds.: 300
- attention mechanism: global attention method with the general alignment function (同这篇文献)
Unsupervised Training
- for each stc. in lang. L1, train allternating in 2 steps:
- STEP 1: Denoising: shared encoder + L1 decoder
- random swap
- STEP 2: On-the-fly Backtranslation, including 2 parts
- PART 1: translate in inference mode: shared encoder + L2 decoder
- PART 2: for the translated stc.: shared encoder + L1 decoder
- STEP 1: Denoising: shared encoder + L1 decoder
Expt.s
- Datasets:
- Train: WMT 2014 French-English & German-English
- Test: newstest2014, Tokenized BLEU (multi-bleu.perl script)
- Corpus Preprocessing:
- tokenization and truecasing
- byte pair encoding (BPE): manages to correctly translate rare words
- learning on monolingual corpus:
- 50,000 operations
- Limite the vocabulary: most frequent 50,000 tokens.
- Replace the rest with a special token .
- Accelerate training: discarde sentences with more than 50 elements.
- Cross-lingual Embeddings:
- Training:
- Cross-entopy loss func.
- Train each system took about 4-5 days on a single Titan X GPU for the full unsupervised variant.
- Decoding:
- Training time: greedy decoding
- Test time: beam-search
- Result: