人工智能领域经典数据集,转载自 medium.com 由于原网页在国内无法打开,转录至此。
原网页地址 https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2
在手机上有些链接可能会出现异常,建议在电脑端打开本页面下载资料。
Computer Vision
MNIST : most commonly used sanity check. Dataset of 25x25, centered, B ? W handwritten digits. It is an easy task - just because something works on MNIST, does not mean it works.
CIFAR 10 ? CIFAR 100 : 32x32 color images. Not commonly used anymore, though once again, can be an interesting sanity check.
Image API : the de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category WordNet hierarchy from ImageNet.
LSUN : Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
PASCAL VOC : Generic image Segmentation / classification - not terribly useful for building real-world image annotation, but great for baselines.
SVHN : House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
MS COCO : Generic image understanding / captioning, with an associated competition.
Visual Genome : Very detailed visual knowledge base with deep captioning of ~ 100K images.
Labeled Faces in the Wild : Cropped faces (using Viola-Jones ) that have been labeled with a name identifier. A subset of the people present have two images in the dataset - it's quite common for people to train matching Systems here
Natural Language
Text Classification Datasets (Google Drive Link) from zh et al., 2015 : An extensive set of eight datasets for text classification. These are the most printed baselines for new text classification baselines. Sample size of 120K to 3.6M, ranging From binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo !, Sogou, and AG.
WikiText : large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind .
Question Pairs : first dataset release from Quora containing duplicate / semantic similarity labels.
SQuAD : The Stanford Question Answering Dataset - broadly useful methods answering and reading comprehension dataset, where every answer to a question is posed as a span , or segment of text.
CMU Q / A Dataset : Manually-generated factoid question / answer pairs with difficulty ratings from Wikipedia articles.
Maluuba Datasets : Sophisticated, human-generated datasets for stateful natural language understanding research.
Billion Words : large, general purpose modeling modeling dataset. Often used to train distributed word representations such as word2vec or GloVe .
Common Crawl : Petabyte -scale crawl of the web - most frequently used for learning word embeddings. Available for free from Amazon S3 . Can also useful as a network dataset for it's crawl of the WWW.
bAbi : synthetic reading comprehension and question answering dataset from Facebook AI Research (FAIR) .
The Children's Book Test ( download link ): Baseline of (Question + context, Answer) pairs extracted from Children's books available through Project Gutenberg. Useful for question-answering, reading comprehension, and factoid look-up
Stanford Sentiment Treebank : standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence's parse tree.
20 Newsgroups : one of the classic datasets for text classification, usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
Reuters : older, purely classification based dataset with text from the newswire. Commonly used in tutorials.
IMDB : an older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.
UCI's Spambase : Older, classic spam email dataset from the famous UCI Machine Learning Repository . Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized flight filtering.
Speech
Most speech recognition collectors are proprietary - the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.
2000 HUB5 English : English-only speech data used most recently in the Deep Speech paper from Baidu.
LibriSpeech : Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by clubs of the book containing both the text and the speech
VoxForge : Clean speech dataset of accented english, useful for instances in which you expect to need robustness to different accents or intonations
TIMIT : English-only speech recognition dataset.
CHIME : Noisy speech recognition challenge dataset. Dataset contains real, simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non- Noisy recordings.
TED-LIUM : Audio transcription of TED talks. 1495 TED meetings audio recordings along with full text transcriptions of those recordings.