最新的二十个有趣的深度学习数据集

更大的标注数据集和更多可用的计算能力是AI革命的基石。在本文中,我列出了我们最近为数据科学家发现的一些非常好玩的深度学习数据集。

1. EMNIST: An Extension of MNIST to Handwritten Letters

MNIST is a very popular dataset for people getting started with Deep Learning in particular and Machine Learning on images in general. MNIST has images of digits which are to be mapped to the digits themselves. EMNIST extends this to images of letters as well. The dataset can be downloaded here . There is an alternative dataset we discovered as well on Reddit. It’s called HASYv2 and can be downloaded here

2. HICO & HICO-DET

HICO has images containing multiple objects and these objects have been tagged along with their relationships. The proposed problem is for algorithms to be able to dig out objects in an image and relationship between them after being trained on this dataset. I expect multiple papers to come out of this dataset in future.

3. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

CLEVR is an attempt by Fei-Fei Li’s group, the same scientist who developed the revolutionary ImageNet dataset. It has objects and questions asked about those objects along with their answers specified by humans. The aim of the project is to develop machines with common sense about what they see. So for example, the machine should be able to find “an odd one out” in an image automatically. You can download the dataset here.

4. HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving

This dataset is tagged in a way so that algorithms trained on it can be used for automatic theorem proving . The download link is here.

5. The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations

The Parallel Meaning Bank (PMB), developed at the University of Groningen, comprises sentences and texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and formal meaning representations. The download link is here.

6. JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

JFLEG dataset is an aim to tag sentences with nominal grammatical corrections and smart grammatical corrections. This dataset aims to build machines that can correct grammar automatically for people making mistakes. The dataset can be downloaded here.

7. Introducing VQA v2.0: A More Balanced and Bigger VQA Dataset!

This dataset has images, questions asked on them and their answers tagged. The aim is to train machines to answer questions asked about images (and in continuation about the real world they are seeing). Visual QA is an old dataset but its 2.0 version came out just this december.

8. Google Cloud & YouTube-8M Video Understanding Challenge

Probably the largest dataset available for training in the open. This is a dataset of 8 Million Youtube videos tagged with the objects within them. There is also a running Kaggle competition on the dataset with a bounty of 1,00,000 dollars.

9. Data Science Bowl 2017

This turns out to be the largest bounty offered to crack a Data Science problem. There are prizes of $1 Million to be grabbed by Data Scientists who can detect lung cancer using this dataset of tagges CT-Scans.

10. Exoplanets Dataset

Today, a team that includes MIT and is led by the Carnegie Institution for Science has released the largest collection of observations made with a technique called radial velocity, to be used for hunting exoplanets. The dataset can be downloaded here.

11. End-to-End Interpretation of the French Street Name Signs Dataset

This is a huge dataset of French Street signs labeled with what they denote. The dataset is easily readable by everyone’s favorite Tensorflow and can be downloaded here.

12. A Realistic Dataset for the Smart Home Device Scheduling Problem for DCOPs

An upcoming dataset for IoT and AI interface. You can download it here.

13. RepEval 2017 Shared Task

From Sam Bowman’s team, the creators of the famous SNLI dataset, this dataset about understanding the meaning of the text is going to be out soon as a competition. The dataset is expected by 15th March. You can find it here once it’s live.

14. Driver Speed Dataset

A 200 Gb huge dataset, which is aimed to calculate speed of moving vehicles. Can be downloaded here.

15. NWPU-RESISC45 Remote sensing images dataset

A huge dataset of remote sensing images covering a wide array of landscapes which can be seen through sattelites. Potential technology that can be developed includes satellite surveys, monitoring, and surveillance. Unfortunately, we are still waiting for the download link here.

16. Recipe to create your own free datasets from the open web

This is probably the most interesting of the datasets. This dataset has not been tagged by humans but by machines. Also, the authors make things clear about what is to be done if we want to create a similar dataset from the millions of images which are already available on the web.

17. The LIP Dataset

This large-scale data set focuses on the semantic understanding of a person. The download link for the dataset is here.

18. WikiReading Data

This dataset is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The downlaod link is here.

19. MUSCIMA++

MUSCIMA++ is a dataset of handwritten music notation for musical symbol detection. Here is the download link.

20. DeScript (Describing Script Structure)

DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. Here is the download link.


Reference:

http://blog.paralleldots.com/data-scientist/new-deep-learning-datasets-data-scientists/


如果觉得内容有用,帮助多多分享哦 :)

长按或者扫描如下二维码,关注 “CoderPai” 微信号(coderpai)。添加底部的 coderpai 小助手,添加小助手时,请备注 “算法” 二字,小助手会拉你进算法群。如果你想进入 AI 实战群,那么请备注 “AI”,小助手会拉你进AI实战群。

深度学习在自然语言处理领域有许多应用,其中之一就是生成文本。唐诗生成是一个有趣的任务,可以通过深度学习模型来实现。以下是一个简单的示例,展示如何使用深度学习生成唐诗,并介绍相关的数据集和代码。 ### 数据集 首先,我们需要一个唐诗数据集。常用的数据集包括《全唐诗》和一些公开的唐诗数据集。这些数据集通常包含大量的唐诗文本,可以用于训练生成模型。 ### 代码示例 我们可以使用Python和深度学习框架(如TensorFlow或PyTorch)来实现唐诗生成。以下是一个使用PyTorch的简单示例: ```python import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader # 定义数据集 class TangPoetryDataset(Dataset): def __init__(self, data, vocab, seq_length): self.data = data self.vocab = vocab self.seq_length = seq_length def __len__(self): return len(self.data) - self.seq_length def __getitem__(self, idx): return ( torch.tensor([self.vocab[c] for c in self.data[idx:idx+self.seq_length]]), torch.tensor([self.vocab[c] for c in self.data[idx+1:idx+self.seq_length+1]]) ) # 定义模型 class PoetryModel(nn.Module): def __init__(self, vocab_size, embed_size, hidden_size): super(PoetryModel, self).__init__() self.embedding = nn.Embedding(vocab_size, embed_size) self.lstm = nn.LSTM(embed_size, hidden_size, batch_first=True) self.fc = nn.Linear(hidden_size, vocab_size) def forward(self, x, hidden): x = self.embedding(x) x, hidden = self.lstm(x, hidden) x = self.fc(x) return x, hidden # 训练模型 def train(model, dataloader, criterion, optimizer, epochs): for epoch in range(epochs): for inputs, targets in dataloader: hidden = (torch.zeros(1, inputs.size(0), hidden_size), torch.zeros(1, inputs.size(0), hidden_size)) outputs, hidden = model(inputs.unsqueeze(1), hidden) loss = criterion(outputs.view(-1, vocab_size), targets) optimizer.zero_grad() loss.backward() optimizer.step() print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}') # 超参数 vocab_size = len(vocab) embed_size = 128 hidden_size = 256 seq_length = 20 batch_size = 32 epochs = 10 # 加载数据 with open('tang_poetry.txt', 'r', encoding='utf-8') as f: data = f.read() # 构建词汇表 vocab = set(data) vocab_size = len(vocab) vocab = {c: i for i, c in enumerate(vocab)} # 创建数据集和数据加载器 dataset = TangPoetryDataset(data, vocab, seq_length) dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) # 初始化模型、损失函数和优化器 model = PoetryModel(vocab_size, embed_size, hidden_size) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters()) # 训练模型 train(model, dataloader, criterion, optimizer, epochs) ``` ### 代码说明 1. **数据集定义**:我们定义了一个`TangPoetryDataset`类,用于加载和处理唐诗文本数据。 2. **模型定义**:我们定义了一个`PoetryModel`类,使用嵌入层、LSTM层和全连接层来生成唐诗。 3. **训练模型**:我们定义了一个`train`函数,用于训练模型。 4. **超参数设置**:设置了一些超参数,如嵌入维度、LSTM隐藏状态维度、序列长度、批量大小和训练轮数。 5. **数据加载**:从文件中加载唐诗文本,并构建词汇表。 6. **数据预处理**:将文本数据转换为数值表示,并创建数据集和数据加载器。 7. **模型初始化**:初始化模型、损失函数和优化器。 8. **模型训练**:训练模型。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值