国内外最好的语料库汇总

置顶 Zero_to_zero1234

已于 2025-03-12 19:56:11 修改

阅读量3.5w

点赞数 13

分类专栏：自然语言处理文章标签：国内外最好的语料库汇总

于 2019-07-18 15:25:57 首次发布

本文链接：https://blog.csdn.net/suiyueruge1314/article/details/96431911

版权

自然语言处理专栏收录该内容

42 篇文章

订阅专栏

语料在语言学科研究和深度学习中都至关重要，下面对常用的语料库/语音库资源进行总结：部分信息来源于其他博客，但是本文会保持持续更新

human video generation 视频数据集
参考： https://github.com/taichuai/awesome-human-video-generation-corpus

更新20250108

对话数字人数据
https://project.mhzhou.com/vico/

Open Speech and Language Resources
http://www.openslr.org/resources.php

更新(2020年6月10)：

若干开源语音数据库： https://blog.ailemon.me/2018/11/21/free-open-source-chinese-speech-datasets/

更新2020/10/23

AISHELL-3 高保真中文语音数据库（希尔贝壳中文普通话语音数据库AISHELL-3的语音时长为85小时88035句，可做为多说话人合成系统。录制过程在安静室内环境中，使用高保真麦克风（44.1kHz，16bit）。218名来自中国不同口音区域的发言人参与录制。专业语音校对人员进行拼音和韵律标注，并通过严格质量检验，此数据库音字确率在98%以上。（支持学术研究，未经允许禁止商用。））
DiDiSpeech: A Large Scale Mandarin Speech Corpus It consists of about 800 hours of speech data at 48kHz sampling rate from 6000 speakers and the corresponding texts. All speech data in the corpus was recorded in quiet environment and is suitable for various speech processing tasks, such as voice conversion, multi-speaker text-to-speech and automatic speech recognition.

NHSS: A Speech and Singing Parallel
We present a database of parallel recordings of speech and singing, collected and released by the Human Language Technology (HLT) laboratory at the National University of Singapore (NUS), that is called NUS-HLT Speak-Sing (NHSS) database. This database consists of recordings of sung vocals of English pop songs, the spoken counterpart of lyrics of the songs read by the singers in their natural reading manner, and manually prepared utterance-level and word-level annotations. The audio recordings in the NHSS database correspond to a total of 100 songs sung and spoken by 10 singers, resulting in total of 7 hours audio data. There are 5 male and 5 female singers, singing and reading the lyrics of 10 songs each. We release this database to the public for research activities.

更新2020/12/25
http://www.openslr.org/82/
多场景说话人识别数据集CN-Celeb ，包含了来自3000 名中国明星在采访、歌舞、音乐、影视等各类场景中的语音片段。CN-Celeb2 的采集流程与 CN-Celeb1 相仿，语音片段全部由各个数据源经过自动化处理程序提取，并通过人工校验得到。整个 CN-Celeb 系列覆盖了噪音、信道、发音方式等各方面的复杂性，特别适用于研究复杂场景下的说话人识别技术。

更新2021/02/10
数据集名称：speechocean762

数据集下载链接为：http://www.openslr.org/101/ ，其对应的Kaldi recipe入口为：egs/gop_speechocean762（数据介绍：
小米语音联合海天瑞声开源了业界首个比较完善的英语发音评测公开数据集
数据集语言：中国人讲英语，样本均衡，内容完善，数据集包含5000个英文句子，内容涵盖日常生活多个方面；由250位英语非母语发音人录制，其母语均为普通话；发音人性别、年龄占比均衡，男女比例1:1，儿童及成年发音人比例1:1；发音人英语水平经过严格设计及筛选，好、中、差比例为2:1:1，可保证对不同程度英语发音学习者的反馈测试。
）

标贝开源：

https://www.data-baker.com/#/data/index/source
有效时长：约12小时
平均字数：16字
语言类型：标准普通话
发音人：女；20-30岁；声音积极知性
录音环境：声音采集环境为专业录音棚环境：1）录音棚符合专业音库录制标准；2）录音环境和设备自始至终保持不变；3）录音环境的信噪比不低于35dB。

cmudict

http://www.speech.cs.cmu.edu/cgi-bin/cmudict

粤语NLP：
https://github.com/CanCLID/awesome-cantonese-nlp

IPA：
https://en.wikipedia.org/wiki/Pinyin
https://github.com/untunt/PhonoCollection/blob/master/Standard%20Chinese.md

更新2021/06/22
开源中英双语多说话人的情感 VC数据库 Emotional Voice Conversion: Theory, Databases and ESD( https://arxiv.org/abs/2105.14762 )

更新2021/09/01
RyanSpeech Corpus

RyanSpeech is a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or do not have quality male speech data. In order to meet the need for a high-quality

http://mohammadmahoor.com/ryanspeech/

EVC：
https://arxiv.org/pdf/2105.14762.pdf

更新2021/09/08
Aishell4
http://www.aishelltech.com/aishell_4
AISHELL-4是一个通过麦克风阵列实录的八通道中文普通话会议场景语音数据集。该数据集共包含211场会议，每场会议4至8人，数据集共120小时左右。该数据集旨在促进实际应用场景下多说话人处理的研究。AISHELL-4数据包括了实际会议场景下各种重要特性，例如停顿、重叠、说话人轮转、噪声等。同时数据集提供了准确的音字转写文本及时间戳信息，方便研究者进行诸如前端处理、语音识别、说话人分割等单独任务，并可以进行联合优化。

更新2021/10/14
WenetSpeech

A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
https://wenet-e2e.github.io/WenetSpeech/

更新2022/02/15
更新几个英文语料库
LibriTTS corpus
http://openslr.magicdatatech.com/60/
Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus

common voice
https://commonvoice.mozilla.org/zh-CN/datasets

Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS)
http://www.openslr.org/109/
About this resource:
Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS) is a multi-speaker English dataset for training text-to-speech models. The dataset is based on public audiobooks from LibriVox and texts from Project Gutenberg.
The Hi-Fi TTS dataset contains about 291.6 hours of speech from 10 speakers with at least 17 hours per speaker sampled at 44.1 kHz.

Free ST American English Corpus
http://www.openslr.org/45/

VCTK
CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)
https://datashare.ed.ac.uk/handle/10283/3443

RyanSpeech Corpus

http://mohammadmahoor.com/ryanspeech/

M-AILABS

Most of the data is based on LibriVox and Project Gutenberg. The training data consist of nearly thousand hours of audio and the text-files in prepared format.

https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/

多个数据目录：
http://openslr.magicdatatech.com/resources.php

https://github.com/coqui-ai/open-speech-corpora

更新2023/08/06
StarRail Dataset 米哈游提供的多人游戏语音数据库

https://github.com/AI-Hobbyist/StarRail_Datasets/tree/main/Label%20%26%20Voice
librispeech clean:
http://www.openslr.org/141/

2023-08-24 更新
https://www.tedownload.com/ 经济学人数据下载
https://github.com/hehonghui/awesome-english-ebooks
https://www.douban.com/group/topic/283251376/?_i=2814330H4g_IOf

20240429
上海交通大学 StoryTTS 数据集：StoryTTS: A Highly Expressive Text-to-Speech Dataset with Rich Textual Expressiveness Annotations

meta开源的 Expressive Anechoic Recordings of Speech (EARS) dataset

中文对话数据集：
MAGICDATA Mandarin Chinese Conversational Speech Corpus

wenet4tts : https://huggingface.co/datasets/Wenetspeech4TTS/WenetSpeech4TTS

20240623
数字人说话视频数据集：https://github.com/taichuai/awesome-talking-head-corpus

20240627
DisfluencySpeech – Single-Speaker Conversational Speech Dataset with Paralanguage(约10小时的口音tts语料库):
https://huggingface.co/datasets/amaai-lab/DisfluencySpeech?row=2

20240904
https://huggingface.co/datasets/MushanW/GLOBE
一个名为GLOBE的高质量英语语料库，它包含了来自世界各地的口音。GLOBE语料库被特别设计来解决当前零样本（zero-shot）说话人自适应文本到语音（TTS）系统在适应不同口音的说话人时所表现出的泛化能力不足的问题。与常用的英语语料库如LibriTTS和VCTK相比，GLOBE的独特之处在于它包含了来自23,519名说话人的话语，覆盖了全球164种不同的口音，并且为这些说话人提供了详细的元数据信息。

https://emilia-dataset.github.io/Emilia-Demo-Page/

Emilia数据集是通过从互联网上的视频平台和播客中收集大量的语音数据构建的。这些数据覆盖了多种内容类型，包括脱口秀、访谈、辩论、体育解说和有声读物。这种多样性确保了数据集能够捕捉到广泛的人类真实说话风格。

Emilia数据集的初始版本包含了六种不同语言的101,654小时的多语言语音数据，这些语言包括英语、法语、德语、中文、日语和韩语。

https://github.com/keonlee9420/DailyTalk
DailyTalk的新文本到语音（TTS）数据集。这个数据集是为对话式TTS设计的，它包含了2541个对话，这些对话是从开源的对话数据集DailyDialog中采样、修改和录制的，并且继承了DailyDialog的注释属性。DailyTalk数据集的目的是为了改善当前TTS数据集在对话方面内容的不足。研究者们在DailyTalk数据集的基础上，扩展了以前的工作作为基线模型，其中非自回归TTS模型会根据对话中的历史信息进行条件化处理。

图像数据集

http://www.seeprettyface.com/mydataset_page2.html
http://www.gwylab.com/download.html

其他：

国外语料库 ❀❀❀

BNC——英国国家语料库（British National Corpus）：http://www.natcorp.ox.ac.uk/

BOE——柯林斯英语语料库（the Bank of English）：http://www.collinslanguage.com/wordbanks/

联合国文件数据库（提供80万份六种语言平行文档）http://documents.un.org/simple.asp

ANC——美国国家语料库（American National Corpus）:http://www.anc.org/

兰开斯特汉语语料库 (LCMC) http://ota.oucs.ox.ac.uk/s/download.php?otaid=2474

OLAC语言开发典藏社群（Open Language Archives Community）http://search.language-archives.org/index.html

COCA———美国当代英语语料库(Corpus of Contemporary American English)

http://www.americancorpus.org/

COHA——美国近当代英语语料库（Corpus of Historical American English）：http://corpus.byu.edu.coha/

SKETCHENGINE多语言语料库：

www.sketchengine.co.uk

BASE——英国学术口语语料库（British Academic Spoken English Corpus）：http://www2.warwick.ac.uk/fac/soc/celte/research/base/

Leeds: http://corpus.leeds.ac.uk/internet.html

JustTheWord： http://193.133.140.102/JustTheWord/index.html

Lextutor: http://www.lextutor.ca/

Web Concordancer: www.edict.com.hk

国内语料库 ❀❀❀
BCC语料库：http://bcc.blcu.edu.cn/

语料库：http://yulk.org/

语料库在线：http://www.cncorpus.org/

北京大学中国语言学研究中心：http://ccl.pku.edu.cn/corpus.asp

国家语委现代汉语语料库http://www.cncorpus.org/

北外语料库语言学：http://www.bfsu-corpus.org/

古代汉语语料库http://www.cncorpus.org/login.aspx

语料库语言学在线：http://ccl.pku.edu.cn/corpus.asp

《人民日报》标注语料库http://www.icl.pku.edu.cn/icl_res/

汉语国际教育技术研发中心：HSK动态作文语料库http://202.112.195.192:8060/hsk/login.asp

语言研究所：北京口语语料查询系统（B J K Y）http://www.blcu.edu.cn/yys/6_beijing/6_beijing_chaxun.asp

现代汉语平衡语料库http://www.sinica.edu.tw/SinicaCorpus/

古汉语语料库http://www.sinica.edu.tw/ftms-bin/ftmsw

近代汉语标记语料库http://www.sinica.edu.tw/Early_Mandarin/

树图数据库http://treebank.sinica.edu.tw/

中英双语知识本体词网http://bow.sinica.edu.tw/

搜文解字：http://words.sinica.edu.tw/

文国寻宝记：http://www.sinica.edu.tw/wen/

唐诗三百首http://cls.admin.yzu.edu.tw/300/

汉籍电子文献http://www.sinica.edu.tw/~tdbproj/handy1/

红楼梦网络教学研究数据中心http://cls.hs.yzu.edu.tw/HLM/home.htm

中国传媒大学文本语料库检索系统：http://ling.cuc.edu.cn/RawPub/

哈工大信息检索研究室对外共享语料库资源http://ir.hit.edu.cn/demo/ltp/Sharing_Plan.htm

香港教育学院语言资讯科学中心及其语料库实验室http://www.livac.org/index.php?lang=sc

中文语言资源联盟http://www.chineseldc.org/

杨百翰大学语料库http://view.byu.edu/