中文机器翻译数据集

最新推荐文章于 2025-04-20 16:50:12 发布

CopperDong

最新推荐文章于 2025-04-20 16:50:12 发布

阅读量1.7w

点赞数 8

分类专栏：机器翻译

原文链接：https://www.jianshu.com/p/df85ddf56eef

版权

机器翻译专栏收录该内容

2 篇文章

订阅专栏

Dataset

AI challenger (英中翻译 规模最大的口语领域英中双语对照数据集)

UM-Corpus: A Large English-Chinese Parallel Corpus

OpenSubtitles2016

Methods

AI Challenger 2017 奇遇记

机器翻译如何解决数据量小的问题？

自然语言处理任务数据集

keywords: NLP, DataSet, corpus process

语料处理一般步骤

以下处理步骤出自[Mikolov T, et al. Exploiting Similarities among Languages for Machine Translation[J]. Computer Science, 2013.]

Tokenization of text using scripts (from www.statmt.org)
Duplicate sentences were removed
Numeric values were rewritten as a single token
special characters were removed (such as !?,:)

AI Challenger - 英中翻译评测

适用领域：机器翻译

规模最大的口语领域英中双语对照数据集。提供了超过1000万的英中对照的句子对作为数据集合。所有双语句对经过人工检查，数据集从规模、相关度、质量上都有保障。

训练集：10,000,000 句
验证集（同声传译）：934 句
验证集（文本翻译）：8000 句

https://challenger.ai/datasets/translation

WMT(Workshop on Machine Translation) - 机器翻译研讨会

适用领域：机器翻译

WMT 是机器翻译领域最重要的公开数据集。数据规模较大，取决于不同的语言，通常在百万句到千万句不等。

2017年WMT的网址 http://www.statmt.org/wmt17/

UN Parallel Corpus - 联合国平行语料

适用领域：机器翻译

联合国平行语料库由已进入公有领域的联合国正式记录和其他会议文件组成。语料库包含1990至2014年编写并经人工翻译的文字内容，包括以语句为单位对齐的文本。

语料库旨在提供多语种的语言资源，帮助在机器翻译等各种自然语言处理方面开展研究和取得进展。为了方便使用，本语料库还提供现成的特定语种双语文本和六语种平行语料子库。

介绍：https://conferences.unite.un.org/UNCorpus/zh#introduction

下载：https://conferences.unite.un.org/UNCorpus/zh/DownloadOverview

（目前一直下载不下来）

2nd International Chinese Word Segmentation Bakeoff

适用领域：中文分词

This directory contains the training, test, and gold-standard data
used in the 2nd International Chinese Word Segmentation Bakeoff.

http://sighan.cs.uchicago.edu/bakeoff2005/

20 Newsgroups

适用领域：文本分类

The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.

http://qwone.com/~jason/20Newsgroups/

NLPCC 2017 新闻标题分类

适用领域：文本分类

http://tcci.ccf.org.cn/conference/2017/taskdata.php

Reuters-21578 Text Categorization Collection

适用领域：文本分类

This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories.

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

全网新闻数据(SogouCA)

适用领域：文本分类、事件检测跟踪、新词发现、命名实体识别自动摘要

来自若干新闻站点2012年6月—7月期间国内，国际，体育，社会，娱乐等18个频道的新闻数据，提供URL和正文信息

http://www.sogou.com/labs/resource/ca.php

CMU World Wide Knowledge Base (Web->KB) project

适用领域：知识抽取

To develop a probabilistic, symbolic knowledge base that mirrors the content of the world wide web. If successful, this will make text information on the web available in computer-understandable form, enabling much more sophisticated information retrieval and problem solving.

http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-11/www/wwkb/

Wikidump

适用领域：word embedding

中文：https://dumps.wikimedia.org/zhwiki/latest/

GitHub 项目

大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP

https://github.com/brightmart/nlp_chinese_corpus

评论 8

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。