Top 10 Open Dataset Resources on Github

转载 2016年06月02日 09:22:33

The top open dataset repositories on Github include a variety of data, freely available for use by researchers, practitioners, and students alike.

Over the past several months we have had a look at a number of top Github repository collections, such as:

Github social coding

This post will be a bit different, in that we are looking at the top open dataset repositories that Github has to offer. The post was inspired by the Github Open Data Showcase, which is good, but which is not very large. Ideally, I would like to make a list of the top open datasets on Github, period; however, this gets tricky, since searching for "open data," or any variant of this search term, is going to lead to complications on a site set up with the explicit goal of sharing open source projects and their data.

I decided to take the offerings in this showcase which were not explicitly noted as being out of date and add in 3 additional strictly-dataset repos with the highest numbers of stars I could find from simple search, rank them all accordingly, and present them here. We have found at KDnuggets that datasets are one of the most sought-after pieces of the data science puzzlefor many readers, and hopefully this fresh batch (at least, fresh from our perspective) is of use to some of our readers.

We are currently conducting our latest Annual KDnuggets Analytics Software Poll, and so the particular percentages from last year may change, but we know that open source tools have been used by 73% of data scientists in the past 12 months. While this number reflects software, and not data, it is easy to surmise that open data is a heavily-relied upon commodity in data science and related data-oriented disciplines for research, practice, and production alike, for myriad reasons.

So here they are, the open dataset repos with the highest number of stars as of the time of writing.

1. Awesome Public Datasets

Stars: 14137, Forks: 1573

Brought to us by Xiaming (Sammy) Chen, this seems to be the undisputed leader of the open dataset collections available on Github. This curated list is organized by such topics as biology, sports, museums, and natural language, and appears to include several hundred datasets. Most are free, but there is a disclaimer at the top of the list that some are not. Xiaming also points out 2 other awesome-branded repo lists that contain more datasets; however, since those lists contain all sorts of other big data/machine learning/data science links, they will not be included in the list below, despite their high number of stars. Feel free to explore them on your own... obviously.

2. OpenAddresses

Stars: 529, Forks: 510

This is the official repo of, the free and open global address collection. Why addresses?

Street address data is essential infrastructure. Street names, house numbers and zip codes, when combined with geographic coordinates, are the hub that connects digital to physical places. Precisely because of their connecting role, free and open addresses are rocket fuel for civic and commercial innovation.

3. Congress Legislators

Stars: 417, Forks: 187

This repo is is summed up by its description:

Members of the United States Congress, 1789-Present, in YAML, as well as committees, presidents, and vice presidents.

4. Open Exoplanet Catalogue

Stars: 300, Forks: 88

This is a catalog of all known discovered planets existing outside of our solar system. The database is generally updated within 24 hours of new discoveries, too, which means this is about as up-to-date as one could imagine; that the repo was last updated 20 days ago is encouraging in this respect. The README also points to this repo, should you be interested in a simple CSV of the data.

5. CitySDK

Stars: 274, Forks: 92

CitySDK is described as a "[u]ser-friendly [J]avascript SDK for US Census Bureau data," which also includes a number of samples detailing integration of the data with other open datasets. It refers to itself as a "toolbox" for civic hackers, and boasts latitude/longitude and ZIP code translation, and a modular architecture which makes integration with other data services straightforward. Use the API to create your own, custom dataset.

6. openFDA

Stars: 236, Forks: 53

openFDA is a project by the FDA, which aims to bring a collection of FDA public datasets to researchers and developers via APIs, raw data, usage examples, and documentation. Data is noted as not being suited for clinical use, and one should assume no specific validity of any data results included within. Even with these disclaimers, there is no doubt that the data here would be great practice for those interested in the domain.

Chicago Food Inspections

7. Food Inspections Evaluation

Stars: 100, Forks: 44

In case the name "Chicago Food Inspections Evaluation" didn't give it away, here's what to expect from this repo:

This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.

8. GSA Data

Stars: 92, Forks: 40

This contains various data published by the General Services Administration, which handles the basic functioning of federal agencies (offices, supplies, and the like). Specifically, it contains a collection of over 5000 .gov domains and their data.

9. US Congressional Districts

Stars: 82, Forks: 21

From the repo's README:

Historic and current US Congressional districts as GeoJSON, versioned within Git

10. CERN Open Data Portal

Stars: 79, Forks: 34

This is the source code for the CERN Open Data Portal, described as "the access point to a growing range of data produced through the research performed at CERN."


AL11的目录配置和open dataset访问共享文件的权限

最近准备学习open dataset, 之前项目也遇到了一个共享目录的权限问题, 所以我决定先学习一下AL11和共享目录的问题, 这里先说AL11吧. AL11里面有很多目录, 有些是安装...
  • jy00873757
  • jy00873757
  • 2013年01月25日 11:45
  • 8680


ABAP/4 允许使用应 用服务器或演示服务器上的顺序文件。 例如,这些 文件可以用 作数据的临时存储设备或本地程序与SAP 系统的接口。 使用应用服 务器上的文 件 ABAP/4提供一些语句,...
  • zhongguomao
  • zhongguomao
  • 2011年08月30日 13:20
  • 4186

  • liupengpeng1109
  • liupengpeng1109
  • 2016年02月25日 15:33
  • 266

open dataset compress

OPEN DATASET FILTER .The operating system command in the field is processed when the file is open...
  • jesson0083
  • jesson0083
  • 2010年03月16日 09:53
  • 450

Top 10 Open Dataset Resources on Github

The top open dataset repositories on Github include a variety of data, freely available for use by r...
  • u011153667
  • u011153667
  • 2016年06月02日 09:22
  • 251

以Network Dataset(网络数据集)方式实现的最短路径分析

原文地址: 构建网络有两种方式,分别是网络数据集N...
  • chanyinhelv
  • chanyinhelv
  • 2013年11月06日 15:07
  • 5845

GitHub上最流行的Top 10 JavaScript项目

  • xmt1139057136
  • xmt1139057136
  • 2017年08月03日 15:59
  • 1074


新增开源软件TOP10你认识几个?(上) 发布时间:2016-02-14 10:25:00 来源:中关村在线 作者:鲁畅 关键字: ...
  • chenxing888
  • chenxing888
  • 2016年02月17日 20:06
  • 488


*&---------------------------------------------------------------------* *& Report ZTEST_LISTBOX *&...
  • liupengpeng1109
  • liupengpeng1109
  • 2016年02月25日 15:05
  • 2181


MSDN:DataList Web 服务器控件以某种格式显示数据,这种格式可以使用模板和样式进行定义。DataList 控件可用于任何重复结构中的数据,如表。DataList 控件可以以不同的布局显示...
  • Mercop
  • Mercop
  • 2012年09月05日 23:41
  • 2347
您举报文章:Top 10 Open Dataset Resources on Github