Top 10 Open Dataset Resources on Github

转载 2016年06月02日 09:22:33

The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.

Over the past several months we have had a look at a number of top Github repository collections, such as:

Github social coding

This post will be a bit different, in that we are looking at the top open dataset repositories that Github has to offer. The post was inspired by the Github Open Data Showcase, which is good, but which is not very large. Ideally, I would like to make a list of the top open datasets on Github, period; however, this gets tricky, since searching for "open data," or any variant of this search term, is going to lead to complications on a site set up with the explicit goal of sharing open source projects and their data.

I decided to take the offerings in this showcase which were not explicitly noted as being out of date and add in 3 additional strictly-dataset repos with the highest numbers of stars I could find from simple search, rank them all accordingly, and present them here. We have found at KDnuggets that datasets are one of the most sought-after pieces of the data science puzzlefor many readers, and hopefully this fresh batch (at least, fresh from our perspective) is of use to some of our readers.

We are currently conducting our latest Annual KDnuggets Analytics Software Poll, and so the particular percentages from last year may change, but we know that open source tools have been used by 73% of data scientists in the past 12 months. While this number reflects software, and not data, it is easy to surmise that open data is a heavily-relied upon commodity in data science and related data-oriented disciplines for research, practice, and production alike, for myriad reasons.

So here they are, the open dataset repos with the highest number of stars as of the time of writing.

1. Awesome Public Datasets

Stars: 14137, Forks: 1573

Brought to us by Xiaming (Sammy) Chen, this seems to be the undisputed leader of the open dataset collections available on Github. This curated list is organized by such topics as biology, sports, museums, and natural language, and appears to include several hundred datasets. Most are free, but there is a disclaimer at the top of the list that some are not. Xiaming also points out 2 other awesome-branded repo lists that contain more datasets; however, since those lists contain all sorts of other big data/machine learning/data science links, they will not be included in the list below, despite their high number of stars. Feel free to explore them on your own... obviously.

2. OpenAddresses

Stars: 529, Forks: 510

This is the official repo of, the free and open global address collection. Why addresses?

Street address data is essential infrastructure. Street names, house numbers and zip codes, when combined with geographic coordinates, are the hub that connects digital to physical places. Precisely because of their connecting role, free and open addresses are rocket fuel for civic and commercial innovation.

3. Congress Legislators

Stars: 417, Forks: 187

This repo is is summed up by its description:

Members of the United States Congress, 1789-Present, in YAML, as well as committees, presidents, and vice presidents.

4. Open Exoplanet Catalogue

Stars: 300, Forks: 88

This is a catalog of all known discovered planets existing outside of our solar system. The database is generally updated within 24 hours of new discoveries, too, which means this is about as up-to-date as one could imagine; that the repo was last updated 20 days ago is encouraging in this respect. The README also points to this repo, should you be interested in a simple CSV of the data.

5. CitySDK

Stars: 274, Forks: 92

CitySDK is described as a "[u]ser-friendly [J]avascript SDK for US Census Bureau data," which also includes a number of samples detailing integration of the data with other open datasets. It refers to itself as a "toolbox" for civic hackers, and boasts latitude/longitude and ZIP code translation, and a modular architecture which makes integration with other data services straightforward. Use the API to create your own, custom dataset.

6. openFDA

Stars: 236, Forks: 53

openFDA is a project by the FDA, which aims to bring a collection of FDA public datasets to researchers and developers via APIs, raw data, usage examples, and documentation. Data is noted as not being suited for clinical use, and one should assume no specific validity of any data results included within. Even with these disclaimers, there is no doubt that the data here would be great practice for those interested in the domain.

Chicago Food Inspections

7. Food Inspections Evaluation

Stars: 100, Forks: 44

In case the name "Chicago Food Inspections Evaluation" didn't give it away, here's what to expect from this repo:

This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.

8. GSA Data

Stars: 92, Forks: 40

This contains various data published by the General Services Administration, which handles the basic functioning of federal agencies (offices, supplies, and the like). Specifically, it contains a collection of over 5000 .gov domains and their data.

9. US Congressional Districts

Stars: 82, Forks: 21

From the repo's README:

Historic and current US Congressional districts as GeoJSON, versioned within Git

10. CERN Open Data Portal

Stars: 79, Forks: 34

This is the source code for the CERN Open Data Portal, described as "the access point to a growing range of data produced through the research performed at CERN."


AL11的目录配置和open dataset访问共享文件的权限

最近准备学习open dataset, 之前项目也遇到了一个共享目录的权限问题, 所以我决定先学习一下AL11和共享目录的问题, 这里先说AL11吧. AL11里面有很多目录, 有些是安装...


ABAP/4 允许使用应 用服务器或演示服务器上的顺序文件。 例如,这些 文件可以用 作数据的临时存储设备或本地程序与SAP 系统的接口。 使用应用服 务器上的文 件 ABAP/4提供一些语句,...

ABAP 写数据到SAP服务器文件并读取

一、在应用服务器中打开文件OPEN  DATASET    [options] 此语句打开文件 。如果不指定任何模式 选项,则文件将按二进 制模式打开。如果系统不能打开文件,则将系统字段 SY-SU...

open dataset abap 01


Open Source on Github

As a senior Computer Science student, I am constantly hearing about how great it is to contribute to...

Penetration testing checklist based on OWASP Top 10 Mobile

0x01 Client Side - Static and Dynamic analysis Test Name Description Tool OWASP Appli...

The Top 10 Mistakes on SQL Server.pdf

  • 2010年09月16日 01:32
  • 232KB
  • 下载

Failed to open/create the internal network Vagrant on Windows10


Top 10 Games on Linux -sudo update

Here are some cool games I like to play on my Ubuntu 10.10 – I think they run on most other versio...
您举报文章:Top 10 Open Dataset Resources on Github