Install Dato(GraphLab Create)
Dato需要注册才能使用, 并且有30天的试用期.
下面使用python的虚拟环境安装一个干净的dato测试环境:
# Create a virtual environment named dato-env
virtualenv dato-env
# Activate the virtual environment
source dato-env/bin/activate
# Make sure pip is up to date
pip install --upgrade pip
# Install IPython Notebook (optional)
pip install "ipython[notebook]"
# Install your licensed copy of GraphLab Create
pip install --upgrade --no-cache-dir https://get.dato.com/GraphLab-Create/1.5.2/EMAIL/KEY/GraphLab-Create-License.tar.gz
如果是旧版本升级, 则到dato-env下执行: bin/pip install graphlab-create==1.5.2
测试dato可用:
➜ dato-env bin/python
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import graphlab as gl
如果没有报错, 说明可以使用graphlab的python包了.
如果执行路径不对,比如不在dato-env下或者直接敲入python都会报错找不到graphlab模块,
因为系统中已经有python了. 无法认识虚拟环境的python. 所以必须用的是虚拟环境下的python!
然后参考https://dato.com/learn/gallery/notebooks/getting_started_with_graphlab...
Getting Started with GraphLab Create
1.加载数据为SFrame
SFrame: tab分割的结构, 对数据再加工和特征构造非常理想
Graph: 对处理稀疏数据非常理想的一种结构
vertices = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv')
edges = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv')
读取csv文件时, gl会根据文件第一行的内容推断tab分割列的类型:
bond_vertices: [str,str,int,int]
bond_edges: [str,str,str]
查看vertices顶点和edges边, 直接一个变量就可以了:
>>> vertices
+----------------+--------+-----------------+---------+
| name | gender | license_to_kill | villian |
+----------------+--------+-----------------+---------+
| James Bond | M | 1 | 0 |
| M | M | 1 | 0 |
| Moneypenny | F | 1 | 0 |
| Q | M | 1 | 0 |
| Wai Lin | F | 1 | 0 |
| Inga Bergstorm | F | 0 | 0 |
| Elliot Carver | M | 0 | 1 |
| Paris Carver | F | 0 | 1 |
| Gotz Otto | M | 0 | 1 |
| Henry Gupta | M | 0 | 1 |
+----------------+--------+-----------------+---------+
>>> edges
+----------------+------------+------------+
| src | dst | relation |
+----------------+------------+------------+
| Wai Lin | James Bond | friend |
| M | James Bond | worksfor |
| Inga Bergstorm | James Bond | friend |
| Elliot Carver | James Bond | killed_by |
| Gotz Otto | James Bond | killed_by |
| James Bond | M | managed_by |
| Q | M | managed_by |
| Moneypenny | M | managed_by |
| Q | Moneypenny | colleague |
| M | Moneypenny | worksfor |
+----------------+------------+------------+
2.创建图对象Graph,并添加顶点和边
g = gl.SGraph()
g = g.add_vertices(vertices=vertices, vid_field='name')
g = g.add_edges(edges=edges, src_field='src', dst_field='dst')
查看图的结构, 注意到把原先顶点的name改成了__id. 把边的src,dst改成__src_id, __dst_id.
>>> g
SGraph({'num_edges': 20, 'num_vertices': 10})
Vertex Fields:['__id', 'gender', 'license_to_kill', 'villian']
Edge Fields:['__src_id', '__dst_id', 'relation']
图对象提供了一些方法可以获取变和顶点. 跟原先的vertices,edges变量的输出类似.
g.get_vertices()
g.get_edges()
3.对图计算pagerank
>>> pr = gl.pagerank.create(g)
PROGRESS: Counting out degree
PROGRESS: Done counting out degree
PROGRESS: +-----------+-----------------------+
PROGRESS: | Iteration | L1 change in pagerank |
PROGRESS: +-----------+-----------------------+
PROGRESS: | 1 | 6.65833 |
PROGRESS: | 2 | 4.65611 |
PROGRESS: | 3 | 3.46298 |
PROGRESS: | 4 | 2.55686 |
PROGRESS: | 5 | 1.95422 |
PROGRESS: | 6 | 1.42139 |
PROGRESS: | 7 | 1.10464 |
PROGRESS: | 8 | 0.806704 |
PROGRESS: | 9 | 0.631771 |
PROGRESS: | 10 | 0.465388 |
PROGRESS: | 11 | 0.364898 |
PROGRESS: | 12 | 0.271257 |
PROGRESS: | 13 | 0.212255 |
PROGRESS: | 14 | 0.159062 |
PROGRESS: | 15 | 0.124071 |
PROGRESS: | 16 | 0.0935911 |
PROGRESS: | 17 | 0.0727674 |
PROGRESS: | 18 | 0.0551714 |
PROGRESS: | 19 | 0.0427744 |
PROGRESS: | 20 | 0.0325555 |
PROGRESS: +-----------+-----------------------+
上面我们看到直接使用gl的pagerank.create方法, 传入构造好的Graph对象, 就返回了pr对象.
>>> pr
Class : PagerankModel
Graph
-----
num_edges : 20
num_vertices : 10
Results
-------
graph : SGraph. See m['graph']
change in last iteration (L1 norm) : 0.0326
vertex pagerank : SFrame. See m['pagerank']
Settings
--------
maximun number of iterations : 20
convergence threshold (L1 norm) : 0.01
probablity of random jumps to any node in the graph: 0.15
Metrics
-------
training time (secs) : 1.0853
number of iterations : 20
Queryable Fields
----------------
training_time : Total training time of the model
graph : A new SGraph with the pagerank as a vertex property
delta : Change in pagerank for the last iteration in L1 norm
reset_probability : The probablity of randomly jumps to any node in the graph
pagerank : An SFrame with each vertex's pagerank
num_iterations : Number of iterations
threshold : The convergence threshold in L1 norm
max_iterations : The maximun number of iterations to run
看到上面的可查询的字段, 都可以通过pr.get()来获得:
>>> pr.get('pagerank')
+----------------+----------------+-------------------+
| __id | pagerank | delta |
+----------------+----------------+-------------------+
| Moneypenny | 1.18363921275 | 0.00143637385736 |
| Inga Bergstorm | 0.869872717136 | 0.00477951418076 |
| Henry Gupta | 0.284762885673 | 1.89255522874e-05 |
| Paris Carver | 0.284762885673 | 1.89255522874e-05 |
| Q | 1.18363921275 | 0.00143637385736 |
| Wai Lin | 0.869872717136 | 0.00477951418076 |
| M | 1.87718696576 | 0.00666194771763 |
| James Bond | 2.52743578524 | 0.0132914517076 |
| Elliot Carver | 0.634064732205 | 0.000113553313724 |
| Gotz Otto | 0.284762885673 | 1.89255522874e-05 |
+----------------+----------------+-------------------+
但是上面是没有排序的, 我们按照pagerank这一列进行topK排序, 得到最重要的人: 邦德!
>>> pr.get('pagerank').topk(column_name='pagerank')
+----------------+----------------+-------------------+
| __id | pagerank | delta |
+----------------+----------------+-------------------+
| James Bond | 2.52743578524 | 0.0132914517076 |
| M | 1.87718696576 | 0.00666194771763 |
| Moneypenny | 1.18363921275 | 0.00143637385736 |
| Q | 1.18363921275 | 0.00143637385736 |
| Inga Bergstorm | 0.869872717136 | 0.00477951418076 |
| Wai Lin | 0.869872717136 | 0.00477951418076 |
| Elliot Carver | 0.634064732205 | 0.000113553313724 |
| Henry Gupta | 0.284762885673 | 1.89255522874e-05 |
| Paris Carver | 0.284762885673 | 1.89255522874e-05 |
| Gotz Otto | 0.284762885673 | 1.89255522874e-05 |
+----------------+----------------+-------------------+