1. 多模态NMT的数据集:
主要来源于WMT16,WMT17,WMT18的共享任务(Multi30k EN-DE,EN-Fr,EN-CS):
http://www.statmt.org/wmt16/multimodal-task.html
http://www.statmt.org/wmt17/multimodal-task.html
http://www.statmt.org/wmt18/multimodal-task.html
2. IWSLT(国际口语研讨会)数据集:
IWSLT2011~IWSLT2020:https://wit3.fbk.eu/home;
如IWSLT2015:
train和dev:https://wit3.fbk.eu/2015-01
test.en:https://wit3.fbk.eu/2015-01-b
test.de:https://wit3.fbk.eu/2015-01-c
总数据集下载: https://github.com/pengr/iwslt15/blob/master/en-de.tgz
2. WMT(国际机器翻译研讨会)数据集:https://www.tensorflow.org/datasets/catalog/wmt15_translate#wmt15_translateru-ensubwords8k
3. OPUS:https://opus.nlpl.eu/
3. 中文机器翻译数据集:https://www.jianshu.com/p/df85ddf56eef
4. 大规模中文自然语言处理语料:https://github.com/brightmart/nlp_chinese_corpus
5. 中文自然语言处理机器翻译语料库:https://github.com/didi/ChineseNLP/blob/master/docs/machine_translation.md