历史最全开放语音/音频数据集整理分享

    本资源整理了40多个语音和音频处理相关的开源数据集,分享给有需要的朋友。

    资源整理自网络,源地址:https://github.com/jim-schwoebel/voice_datasets

     

    有两种主要类型的音频数据集:语音数据集和音频事件/音乐数据集。

语音数据集

    2000 HUB5 English - The Hub5      evaluation series focused on conversational speech over the telephone with      the particular task of transcribing conversational speech into text. Its      goals were to explore promising new areas in the recognition of      conversational speech, to develop advanced technology incorporating those      ideas and to measure the performance of new technology.

    

    Arabic Speech Corpus - The Arabic      Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for      speech synthesis. The corpus contains phonetic and orthographic      transcriptions of more than 3.7 hours of MSA speech aligned with recorded      speech on the phoneme level. The annotations include word stress marks on      the individual phonemes.

    

    ASR datasets - A list of publically  available audio data that anyone can download for ASR or other speech      activities

    

    AudioMNIST - The dataset consists of      30000 audio samples of spoken digits (0-9) of 60 different speakers

    

    Common Voice - Common Voice is      Mozilla's initiative to help teach machines how real people speak. 12GB in      size; spoken text based on text from a number of public domain sources      like user-submitted blog posts, old books, movies, and other public speech      corpora.

    

    CHIME - This is a noisy speech      recognition challenge dataset (~4GB in size). The dataset contains real      simulated and clean voice recordings. Real being actual recordings of 4      speakers in nearly 9000 recordings over 4 noisy locations, simulated is      generated by combining multiple environments over speech utterances and      clean being non-noisy recordings.

    

    CMU Wilderness - (noncommercial) -      not available but a great speech dataset many accents reciting passages     from the Bible.

    

    Emotional Voices Database - various      emotions with 5 voice actors (amused, angry, disgusted, neutral, sleepy).

    

    Emotional Voice dataset - Nature -      2,519 speech samples produced by 100 actors from 5 cultures. With large-scale      statistical inference methods, we find that prosody can communicate at      least 12 distinct kinds of emotion that are preserved across the      2 cultures.

    

    Free Spoken Digit Dataset -4 speakers,      2,000 recordings (50 of each digit per speaker), English pronunciations.

    

    Flickr Audio Caption - 40,000 spoken      captions of 8,000 natural images, 4.2 GB in size.

 

    ISOLET Data Set - This 38.7 GB      dataset helps predict which letter-name was spoken — a simple      classification task.

    

    

    Librispeech - LibriSpeech is a      corpus of approximately 1000 hours of 16Khz read English speech derived      from read audiobooks from the LibriVox project.

 

    LJ Speech - This is a public domain      speech dataset consisting of 13,100 short audio clips of a single speaker      reading passages from 7 non-fiction books. A transcription is provided for      each clip. Clips vary in length from 1 to 10 seconds and have a total      length of approximately 24 hours.

 

    Multimodal EmotionLines Dataset (MELD) -      Multimodal EmotionLines Dataset (MELD) has been created by enhancing and      extending EmotionLines dataset. MELD contains the same dialogue instances      available in EmotionLines, but it also encompasses audio and visual      modality along with text. MELD has more than 1400 dialogues and 13000      utterances from Friends TV series. Each utterance in a dialogue has been      labeled with— Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear.

 

    Noisy Dataset- Clean and noisy parallel      speech database. The database was designed to train and test speech      enhancement methods that operate at 48kHz.

    

    Parkinson's speech dataset - The      training data belongs to 20 Parkinson’s Disease (PD) patients and 20      healthy subjects. From all subjects, multiple types of sound recordings      (26) are taken for this 20 MB set.

    

    Persian Consonant Vowel Combination (PCVC) Speech      Dataset - The Persian Consonant Vowel Combination (PCVC)      Speech Dataset is a Modern Persian speech corpus for speech recognition      and also speaker recognition. This dataset contains 23 Persian consonants      and 6 vowels. The sound samples are all possible combinations of vowels      and consonants (138 samples for each speaker) with a length of 30000 data      samples.

 

    Speech Accent Archive - For various      accent detection tasks.

    

    Speech Commands Dataset - The      dataset (1.4 GB) has 65,000 one-second long utterances of 30 short words,      by thousands of different people, contributed by members of the public      through the AIY website.

 

    Spoken Commands dataset - A large      database of free audio samples (10M words), a test bed for voice activity      detection algorithms and for recognition of syllables (single-word      commands). 3 speakers, 1,500 recordings (50 of each digit per speaker),      English pronunciations. This is a really small set- about 10 MB in size.

    

    Spoken Wikipeida Corpora - 38 GB in      size available in both audio and without audio format.

 

    Tatoeba - Tatoeba is a large      database of sentences, translations, and spoken audio for use in language      learning. This download contains spoken English recorded by their      community.

 

    Ted-LIUM - The TED-LIUM corpus was      made from audio talks and their transcriptions available on the TED      website (noncommercial).

 

    TIMIT dataset - TIMIT contains      broadband recordings of 630 speakers of eight major dialects of American      English, each reading ten phonetically rich sentences. It includes      time-aligned orthographic, phonetic and word transcriptions as well as a      16-bit, 16 kHz speech waveform file for each utterance (have to pay).

 

    VoxCeleb - VoxCeleb is a large-scale      speaker identification dataset. It contains around 100,000 utterances by      1,251 celebrities, extracted from You Tube videos. The data is mostly      gender balanced (males comprise of 55%). The celebrities span a diverse      range of accents, professions, and age. There is no overlap between the      development and test sets. It’s an intriguing use case for isolating and      identifying which superstar the voice belongs to.

 

    VoxForge - VoxForge was set up to      collect transcribed speech for use with Free and Open Source Speech      Recognition Engines.

 

    Zero Resource Speech Challenge - The      ultimate goal of the Zero Resource Speech Challenge is to construct a      system that learns an end-to-end Spoken Dialog (SD) system, in an unknown      language, from scratch, using only information available to a language      learning infant. “Zero resource” refers to zero linguistic expertise      (e.g., orthographic/linguistic transcriptions), not zero information      besides audio (visual, limited human feedback, etc). The fact that      4-year-olds spontaneously learn a language without supervision from      language experts show that this goal is theoretically reachable.

    

音频事件/音乐数据集

    

    AudioSet - An expanding ontology of      632 audio event classes and a collection of 2,084,320 human-labeled      10-second sound clips drawn from YouTube videos.

 

    Bird audio detection challenge -      This challenge contained new datasets (5.4 GB) collected in real live      bio-acoustics monitoring projects, and an objective, standardized      evaluation framework.

 

    Environmental audio dataset - Audio      data collection and manual data annotation both are tedious processes, and      lack of proper development dataset limits fast development in the      environmental audio research.

 

    Free Music Archive - FMA is a      dataset for music analysis. 1000 GB in size.

 

    Freesound dataset - many different      sound events. https://annotator.freesound.org/ and https://annotator.freesound.org/fsd/explore/ -      The AudioSet Ontology is a hierarchical collection of over 600 sound      classes and we have filled them with 297,159 audio samples from Freesound.      This process generated 678,511 candidate annotations that express the      potential presence of sound .

 

    Karoldvl-ESC - The ESC-50 dataset is      a labeled collection of 2000 environmental audio recordings suitable for      benchmarking methods of environmental sound classification.

 

    Million Song Dataset - The Million      Song Dataset is a freely-available collection of audio features and      meta-data for a million contemporary popular music tracks. 280 GB in size.

 

    Urban Sound Dataset - two datasets      and a taxonomy for urban sound research.

  • 8
    点赞
  • 53
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

lqfarmer

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值