Data Visualization and Exploration Sites
- Google Public Data, with dynamic visualization and exploration tools.
- Tableau Public, free software for visualizing and sharing data
- Swivel Public
Data repositories
- KDD Cup center, with all data, tasks, and results.
- UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research.
- UCI Machine Learning Repository.
- AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications.
- Bioassay data, described in Virtual screening of bioassay data, by Amanda Schierz, J. of Cheminformatics, with 21 Bioassay datasets (Active / Inactive compounds) available for download.
- Canada Open Data, pilot project with many government and geospatial datasets.
- Causality Workbench data repository.
- Data.gov.uk, publicly available data from UK (also London datastore.)
- Datamob, public data put to good use.
- DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA.
- DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Goverment datasets.
- Delve, Data for Evaluating Learning in Valid Experiments
- Enron Email Dataset, data from about 150 users, mostly senior management of Enron.
- FEDSTATS, a comprehensive source of US statistics and more
- FIMI repository for frequent itemset mining, implementations and datasets.
- Financial Data Finder at OSU, a large catalog of financial data sets
- GEO (GEO Gene Expression Omnibus), a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query and retrieval.
- Google ngrams datasets, text from millions of books scanned by Google.
- Grain Market Research, financial data including stocks, futures, etc.
- ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008.
- Infobiotics PSP (protein structure prediction) datasets, adjustable real-world family of benchmarks for testing the scalability of classification/regression methods.
- Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything.
- Investor Links, includes financial data
- Kevin Chai list of datasets, for text, SNA, and other fields.
- MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research.
- ML Data, the data repository of the EU Pascal2 networks.
- NASDAQ Data Store, provides access to market data.
- National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America.
- National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more.
- PubGene(TM) Gene Database and Tools, genomic-related publications database
- SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments.
- SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users' activities at the project management web site.
- StatLib, CMU Datasets Archive.
- STATOO Datasets part 1 and STATOO Datasets part 2
- UCR Time Series Classification/Clustering page, offering datasets, papers, links, and code.
- United States Census Bureau.
- Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources