中文普通话语音识别开源数据集,截止到2024.01.02
数据集 | 时长(h) | 人数 | 标注准确率 | 下载链接 | 开源协议 | 备注 |
thchs30 | 30 | 40 | - | openslr.org | Apache License v.2.0 | - |
Primewords_set1 | 100 | 296 | >98% | openslr.org | CC BY-NC-ND 4.0 | - |
aishell1 | 178 | 400 | >95% | openslr.org | Apache License v.2.0 | - |
ST-CMDS | 122 | 855 | - | openslr.org | CC BY-NC-ND 4.0 | - |
aishell2 | 1000 | 1991 | >96% | 希尔贝壳—专注于人工智能大数据和技术的创新 | - | 需要申请 |
aidatatang_200zh | 200 | 600 | >98% | openslr.org | CC BY-NC-ND 4.0 | - |
aidatatang_1505zh | 1505 | 6408 | >98% | 数据堂-AI数据服务-人工智能数据采集与标注 | CC BY-NC-ND 4.0 | 需要申请 |
Speechocean | 10.33 | 20 | >98% | openslr.org | CC BY-NC-ND 4.0 | - |
MAGICDATA | 755 | 1080 | >98% | openslr.org | CC BY-NC-ND 4.0 | - |
Common Voice | 70 | 3333 | - | Common Voice | CC-0 | mp3格式 |
aishell3 | 85 | 218 | >98% | openslr.org | Apache License v.2.0 | |
TAL_ASR | 100 | 80+ | 好未来AI开放平台-数据集 (100tal.com) | 注册即可下载 | ||
WenetSpeech | 10000 | ≥95% | WenetSpeech (wenet-e2e.github.io) | CC BY 4.0 | 填写表格审核通过后下载 | |
MAGICDATA Conversational | 180 | 663 | openslr.org | CC BY-NC-ND 4.0 | ||
SHALCAS22A | 60 | openslr.org | CC BY-NC-ND 4.0 |