【数据集】人工智能领域比较常见的数据集汇总

链接:

https://medium.com/startup-grind/fueling-the-ai-gold-rush-7ae438505bc2#.4ogf5l3xu


原文链接:

http://weibo.com/1657470871/EvlMEm0EH?ref=home&rid=7_0_202_2669536424773680536&type=comment


It has never been easier to build AI or machine learning-based systems than it is today. The ubiquity of cutting edge open-source tools such as TensorFlow, Torch, and Spark, coupled with the availability of massive amounts of computation power through AWS, Google Cloud, or other cloud providers, means that you can train cutting-edge models from your laptop over an afternoon coffee.

Though not at the forefront of the AI hype train, the unsung hero of the AI revolution is data — lots and lots of labeled and annotated data, curated with the elbow grease of great research groups and companies who recognize that the democratization of data is a necessary step towards accelerating AI.

However, most products involving machine learning or AI rely heavily on proprietary datasets that are often not released, as this provides implicit defensibility.

With that said, it can be hard to piece through what public datasets are useful to look at, which are viable for a proof of concept, and what datasets can be useful as a potential product or feature validation step before you collect your own proprietary data.

It’s important to remember that good performance on data set doesn’t guarantee a machine learning system will perform well in real product scenarios. Most people in AI forget that the hardest part of building a new AI solution or product is not the AI or algorithms — it’s the data collection and labeling. Standard datasets can be used as validation or a good starting point for building a more tailored solution.

This week, a few machine learning experts and I were talking about all this. To make your life easier, we’ve collected an (opinionated) list of some open datasets that you can’t afford not to know about in the AI world.


Computer Vision

  • MNIST

  • CIFAR 10 & CIFAR 100

  • ImageNet

  • LSUN

  • PASCAL VOC

  • SVHN

  • MS COCO

  • Visual Genome

  • Labeled Faces in the Wild

Natural Language

  • Text Classification Datasets 

  • WikiTex

  • Question Pairs

  • SQuAD

  • CMU Q/A Dataset

  • Maluuba Datasets

  • Billion Words

  • Common Crawl

  • bAbi

  • The Children’s Book Test

  • Stanford Sentiment Treebank

  • 20 Newsgroups

  • Reuters

  • IMDB

  • UCI’s Spambase

Speech

Most speech recognition datasets are proprietary — the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.

  • 2000 HUB5 English

  • LibriSpeech

  • VoxForge

  • TIMIT

  • CHIME

  • TED-LIUM

Recommendation and ranking systems

  • Netflix Challenge

  • MovieLens

  • Million Song Dataset

  • Last.fm

Networks and Graphs

  • Amazon Co-Purchasing and Amazon Reviews

  • Friendster Social Network Dataset

Geospatial data

  • OpenStreetMap

  • Landsat8

  •  NEXRAD


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值