Here are 13 resources on Machine Learning data sets.
Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014. All new Landsat 8 scenes are made available each day, often within hours of production. MathWorks has created a freely-downloadable tool for accessing, processing, and visualizing Landsat on AWS data in MATLAB. With this tool, you can create a map display of scene locations with markers that show each scene’s metadata.
Category: GIS, Sensor Data, Satellite Imagery, Natural Resource
NASA NEX is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. Through NEX, users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities.
A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.
The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals. The Amazon mirror contains the complete data set from the project and the data can be found at: s3.amazonaws.com/1000genomes.
MNIST database of handwritten digits
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
UCI Machine Learning Repository
UC Irvine Machine Learning Repository currently maintain 333 datasets as a service to machine learning community.
The Delve datasets and families are available from this page. Every dataset (or family) has a brief overview page and many also have detailed documentation. You can download gzipped-tar files of the datasets, but you will require the delve software environment to get maximum benefit from them. Datasets are categorized as primarily assessment, development or historical according to their recommended use. Within each category we have distinguished datasets as regression or classification according to how their prototasks have been created.
Data sets for nonlinear dimensionality reduction
Data sets for nonlinear dimensionality reduction provides datasets for Swiss roll and Faces.
mldata is a machine learning dataset repository. It contains more than 800 public archived data sets with ratings, views, no of downloads, comments.
When benchmarking an algorithm it is recommendable to use a standard test database (data set) for researchers to be able to directly compare the results. Most of the mammographic databases are not publicly available. The most easily accessed databases and therefore the most commonly used databases are the Mammographic Image Analysis Society (MIAS) database and the Digital Database for Screening Mammography (DDSM).
Mulan: A Java Library for Multi-Label Learning have Multi-label classification datasets and Multi-target regression datasets.
The Auton Lab encourages researchers to examine and replicate their findings. To facilitate this goal, they provide datasets identical to those used in their published works.
Datasets for "The Elements of Statistical Learning"
Datasets for "The Elements of Statistical Learning" provides datasets in different types of categories like Bone Mineral Density, Countries, Galaxy and many more.
http://www.datasciencecentral.com/m/blogpost?id=6448529:BlogPost:341263