GitHub作为全球最大的代码托管平台,每小时都有成千上万个项目产生,他为开源作出了不可磨灭的贡献。本文使用了NetworkX对GitHub的进行图形分析,通过gitHub的丰富数据,构建可以在各种不同的方式使用数据模型。这里将github用户、代码、仓库构建成 兴趣图 。本文包含三方面:
github 开发者平台和对应的api
如何使用NetworkX作图
构建github的兴趣图
图算法
了解github的API
同Twitter和Facebook一样,第一步就是获取Git自带的API。地址:https://developer.github.com/v3/
其中大部分我们并不需要,我们仅仅关注的是用户和仓库。所有的API访问通过HTTPS,并从https://api.github.com访问。所有发送和接收的数据是JSON。
创建API连接
github实现了 OAuth 接口,在拥有github账户后获取API通道的方式有两种。一种叫做personal access token.你可以为你自己的使用或实现的Web流创建一个个人访问令牌,以允许其他用户授权你的应用程序。一种是OAuth application,所有的开发人员在开始之前都需要注册他们的应用程序。注册OAuth应用分配一个唯一的客户ID和客户的密钥。这里简单起见,采用personal access token。点击 新建 即可生成有对应权限的token。生成好的token如下所示:
向api根目录发送请求,试一试这个token连接情况:
curl https://api.github.com/?access_token=$TOKEN
{
"current_user_url": "https://api.github.com/user",
"current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",
"authorizations_url": "https://api.github.com/authorizations",
"code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",
"emails_url": "https://api.github.com/user/emails",
"emojis_url": "https://api.github.com/emojis",
"events_url": "https://api.github.com/events",
"feeds_url": "https://api.github.com/feeds",
"followers_url": "https://api.github.com/user/followers",
"following_url": "https://api.github.com/user/following{/target}",
"gists_url": "https://api.github.com/gists{/gist_id}",
"hub_url": "https://api.github.com/hub",
"issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",
"issues_url": "https://api.github.com/issues",
"keys_url": "https://api.github.com/user/keys",
"notifications_url": "https://api.github.com/notifications",
"organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}",
"organization_url": "https://api.github.com/orgs/{org}",
"public_gists_url": "https://api.github.com/gists/public",
"rate_limit_url": "https://api.github.com/rate_limit",
"repository_url": "https://api.github.com/repos/{owner}/{repo}",
"repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}",
"current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}",
"starred_url": "https://api.github.com/user/starred{/owner}{/repo}",
"starred_gists_url": "https://api.github.com/gists/starred",
"team_url": "https://api.github.com/teams",
"user_url": "https://api.github.com/users/{user}",
"user_organizations_url": "https://api.github.com/user/orgs",
"user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}",
"user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}
github 的API符合 HATEOAS 设计,从上面可以看到,如果想获取当前用户的信息,应该去访问api.github.com/user,然后就得到了下面结果。
{
"login": "luzhijun",
"id": 15256911,
"avatar_url": "https://avatars.githubusercontent.com/u/15256911?v=3",
"gravatar_id": "",
"url": "https://api.github.com/users/luzhijun",
"html_url": "https://github.com/luzhijun",
...
}
pygithub
ok,这里介绍一个方便python使用的包,pygithub,你可以通过它方便地用python脚本管理github。其API与github对应。 举个例子,拿到指定用户的所有仓库:
from github import Github
# Specify your own access token here
ACCESS_TOKEN = ''
USER = 'luzhijun'
client = Github(ACCESS_TOKEN)
user = client.get_user(USER)
REPOS=user.get_repos()
print(list(REPOS))
[Repository(full_name=”luzhijun/huxblog-boilerplate”), Repository(full_name=”luzhijun/leecode”), Repository(full_name=”luzhijun/luzhijun.github.io”), Repository(full_name=”luzhijun/Optimization”), Repository(full_name=”luzhijun/SVDRecommenderSystem”)]
如何使用NetworkX作图
了解NetworkX
NetworkX是一个用Python语言开发的图论与复杂网络建模工具,内置了常用的图与复杂网络分析算法,可以方便的进行复杂网络数据分析、仿真建模等工作。
下面创建一个有向图x->y:
import networkx as nx
# 创建有向图
g = nx.DiGraph()
# 加条边x->y
g.add_edge('X', 'Y')
# 打印图的相关数据信息
print (nx.info(g),'\n')
print ("Nodes:", g.nodes())
print ("Edges:", g.edges())
# 节点属性
print ("X props:", g.node['X'])
print ("Y props:", g.node['Y'])
# 边属性
print ("X=>Y props:", g['X']['Y'])
# 更新节点信息
g.node['X'].update({'prop1' : 'value1'})
print ("X props:", g.node['X'])
# 更新边信息
g['X']['Y'].update({'label' : 'label1'})
print ("X=>Y props:", g['X']['Y'])
Name:
Type: DiGraph #无向图表示为Graph Number of nodes: 2
Number of edges: 1
Average in degree: 0.5000
Average out degree: 0.5000
Nodes: [‘Y’, ‘X’]
Edges: [(‘X’, ‘Y’)]
X props: {}
Y props: {}
X=>Y props: {}
X props: {‘prop1’: ‘value1’}
X=>Y props: {‘label’: ‘label1’}
有向图和无向图都可以给边赋予权重,用到的方法是add_weighted_edges_from,它接受1个或多个三元组[u,v,w]作为参数,其中u是起点,v是终点,w是权重。例如:
g.add_weighted_edges_from([('X','Y',10.0)])
print (g.get_edge_data('X','Y'))
{‘weight’: 10.0, ‘label’: ‘label1’}
NetworkX提供了常用的图论经典算法,例如DFS、BFS、最短路、最小生成树、最大流等等,非常丰富,如果不做复杂网络,只作图论方面的工作,也可以应用NetworkX作为基本的开发包,更多用法参照networkx。
使用NetworkX构建兴趣图
兴趣图和社交网络图是有区别的