- Text Classification
- Language Modeling
- Image Captioning
- Machine Translation
- Question Answering
- Speech Recognition
- Document Summarization
Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.
Below are some good beginner text classification datasets.
- Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
- [IMDB Movie Review Sentiment Classification] (stanford)(http://ai.stanford.edu/~amaas/data/sentiment/). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
- News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
For more, see the post:
Datasets for single-label text categorization.
Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.
It is a pre-cursor task in tasks like speech recognition and machine translation.
Below are some good beginner language modeling datasets.
Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.
There are more formal corpora that are well studied; for example:
Brown University Standard Corpus of Present-Day American English. A large sample of English words.
Google 1 Billion Word Corpus.
mage captioning is the task of generating a textual description for a given image.
Below are some good beginner image captioning datasets.
- Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
- Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
- Flickr 30K. A collection of 30 thousand described images taken from flickr.com.
For more see the post:
Machine translation is the task of translating text from one language to another.
Below are some good beginner machine translation datasets.
- Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and French.
- European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of European languages.
There are a ton of standard datasets used for the annual machine translation challenges; see:
Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.
Below are some good beginner question answering datasets.
- Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia articles.
- Deepmind Question Answering Corpus. Question answering about news articles from the Daily Mail.
- Amazon question/answer data. Question answering about Amazon products.
For more, see the post:
Speech recognition is the task of transforming audio of a spoken language into human readable text.
Below are some good beginner speech recognition datasets.
- TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its wide use. Spoken American English and associated transcription.
- VoxForge. Project to build an open source database for speech recognition.
- LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.
Document summarization is the task of creating a short meaningful description of a larger document.
Below are some good beginner document summarization datasets.
- Legal Case Reports Data Set. A collection of 4 thousand legal cases and their summarization.
- TIPSTER Text Summarization Evaluation Conference Corpus. A collection of nearly 200 documents and their summaries.
- The AQUAINT Corpus of English News Text. Not free, but widely used. A corpus of news articles.
For more see:
This section provides additional lists of datasets if you are looking to go deeper.
- Text Datasets Used in Research on Wikipedia
- Datasets: What are the major text corpora used by computational linguists and natural language processing researchers?
- Stanford Statistical Natural Language Processing Corpora
- Alphabetical list of NLP Datasets
- NLTK Corpora
- Open Data for Deep Learning on DL4J
- NLP datasets