hello dato--graphlab create

最新推荐文章于 2024-09-09 23:28:21 发布

weixin_33757911

最新推荐文章于 2024-09-09 23:28:21 发布

阅读量109

点赞数

文章标签： python 人工智能

原文链接：https://segmentfault.com/a/1190000003043101

版权

Install Dato(GraphLab Create)

Dato需要注册才能使用, 并且有30天的试用期.
下面使用python的虚拟环境安装一个干净的dato测试环境:

# Create a virtual environment named dato-env
virtualenv dato-env

# Activate the virtual environment
source dato-env/bin/activate

# Make sure pip is up to date
pip install --upgrade pip

# Install IPython Notebook (optional)
pip install "ipython[notebook]"

# Install your licensed copy of GraphLab Create
pip install --upgrade --no-cache-dir https://get.dato.com/GraphLab-Create/1.5.2/EMAIL/KEY/GraphLab-Create-License.tar.gz

如果是旧版本升级, 则到dato-env下执行: bin/pip install graphlab-create==1.5.2

测试dato可用:

➜  dato-env  bin/python
Python 2.7.8 (default, Oct 20 2014, 15:05:19) 
[GCC 4.9.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import graphlab as gl

如果没有报错, 说明可以使用graphlab的python包了.
如果执行路径不对,比如不在dato-env下或者直接敲入python都会报错找不到graphlab模块,
因为系统中已经有python了. 无法认识虚拟环境的python. 所以必须用的是虚拟环境下的python!

然后参考https://dato.com/learn/gallery/notebooks/getting_started_with_graphlab...

Getting Started with GraphLab Create

1.加载数据为SFrame

SFrame: tab分割的结构, 对数据再加工和特征构造非常理想
Graph: 对处理稀疏数据非常理想的一种结构

vertices = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_vertices.csv')
edges = gl.SFrame.read_csv('http://s3.amazonaws.com/dato-datasets/bond/bond_edges.csv')

读取csv文件时, gl会根据文件第一行的内容推断tab分割列的类型:
bond_vertices: [str,str,int,int]
bond_edges: [str,str,str]

查看vertices顶点和edges边, 直接一个变量就可以了:

>>> vertices
+----------------+--------+-----------------+---------+
|      name      | gender | license_to_kill | villian |
+----------------+--------+-----------------+---------+
|   James Bond   |   M    |        1        |    0    |
|       M        |   M    |        1        |    0    |
|   Moneypenny   |   F    |        1        |    0    |
|       Q        |   M    |        1        |    0    |
|    Wai Lin     |   F    |        1        |    0    |
| Inga Bergstorm |   F    |        0        |    0    |
| Elliot Carver  |   M    |        0        |    1    |
|  Paris Carver  |   F    |        0        |    1    |
|   Gotz Otto    |   M    |        0        |    1    |
|  Henry Gupta   |   M    |        0        |    1    |
+----------------+--------+-----------------+---------+

>>> edges
+----------------+------------+------------+
|      src       |    dst     |  relation  |
+----------------+------------+------------+
|    Wai Lin     | James Bond |   friend   |
|       M        | James Bond |  worksfor  |
| Inga Bergstorm | James Bond |   friend   |
| Elliot Carver  | James Bond | killed_by  |
|   Gotz Otto    | James Bond | killed_by  |
|   James Bond   |     M      | managed_by |
|       Q        |     M      | managed_by |
|   Moneypenny   |     M      | managed_by |
|       Q        | Moneypenny | colleague  |
|       M        | Moneypenny |  worksfor  |
+----------------+------------+------------+

2.创建图对象Graph,并添加顶点和边

g = gl.SGraph()
g = g.add_vertices(vertices=vertices, vid_field='name')
g = g.add_edges(edges=edges, src_field='src', dst_field='dst')

查看图的结构, 注意到把原先顶点的name改成了__id. 把边的src,dst改成__src_id, __dst_id.

>>> g
SGraph({'num_edges': 20, 'num_vertices': 10})
Vertex Fields:['__id', 'gender', 'license_to_kill', 'villian']
Edge Fields:['__src_id', '__dst_id', 'relation']

图对象提供了一些方法可以获取变和顶点. 跟原先的vertices,edges变量的输出类似.

g.get_vertices()
g.get_edges()

3.对图计算pagerank

>>> pr = gl.pagerank.create(g)
PROGRESS: Counting out degree
PROGRESS: Done counting out degree
PROGRESS: +-----------+-----------------------+
PROGRESS: | Iteration | L1 change in pagerank |
PROGRESS: +-----------+-----------------------+
PROGRESS: | 1         | 6.65833               |
PROGRESS: | 2         | 4.65611               |
PROGRESS: | 3         | 3.46298               |
PROGRESS: | 4         | 2.55686               |
PROGRESS: | 5         | 1.95422               |
PROGRESS: | 6         | 1.42139               |
PROGRESS: | 7         | 1.10464               |
PROGRESS: | 8         | 0.806704              |
PROGRESS: | 9         | 0.631771              |
PROGRESS: | 10        | 0.465388              |
PROGRESS: | 11        | 0.364898              |
PROGRESS: | 12        | 0.271257              |
PROGRESS: | 13        | 0.212255              |
PROGRESS: | 14        | 0.159062              |
PROGRESS: | 15        | 0.124071              |
PROGRESS: | 16        | 0.0935911             |
PROGRESS: | 17        | 0.0727674             |
PROGRESS: | 18        | 0.0551714             |
PROGRESS: | 19        | 0.0427744             |
PROGRESS: | 20        | 0.0325555             |
PROGRESS: +-----------+-----------------------+

上面我们看到直接使用gl的pagerank.create方法, 传入构造好的Graph对象, 就返回了pr对象.

>>> pr
Class                                   : PagerankModel

Graph
-----
num_edges                               : 20
num_vertices                            : 10

Results
-------
graph                                   : SGraph. See m['graph']
change in last iteration (L1 norm)      : 0.0326
vertex pagerank                         : SFrame. See m['pagerank']

Settings
--------
maximun number of iterations            : 20
convergence threshold (L1 norm)         : 0.01
probablity of random jumps to any node in the graph: 0.15

Metrics
-------
training time (secs)                    : 1.0853
number of iterations                    : 20

Queryable Fields
----------------
training_time                           : Total training time of the model
graph                                   : A new SGraph with the pagerank as a vertex property
delta                                   : Change in pagerank for the last iteration in L1 norm
reset_probability                       : The probablity of randomly jumps to any node in the graph
pagerank                                : An SFrame with each vertex's pagerank
num_iterations                          : Number of iterations
threshold                               : The convergence threshold in L1 norm
max_iterations                          : The maximun number of iterations to run

看到上面的可查询的字段, 都可以通过pr.get()来获得:

>>> pr.get('pagerank')
+----------------+----------------+-------------------+
|      __id      |    pagerank    |       delta       |
+----------------+----------------+-------------------+
|   Moneypenny   | 1.18363921275  |  0.00143637385736 |
| Inga Bergstorm | 0.869872717136 |  0.00477951418076 |
|  Henry Gupta   | 0.284762885673 | 1.89255522874e-05 |
|  Paris Carver  | 0.284762885673 | 1.89255522874e-05 |
|       Q        | 1.18363921275  |  0.00143637385736 |
|    Wai Lin     | 0.869872717136 |  0.00477951418076 |
|       M        | 1.87718696576  |  0.00666194771763 |
|   James Bond   | 2.52743578524  |  0.0132914517076  |
| Elliot Carver  | 0.634064732205 | 0.000113553313724 |
|   Gotz Otto    | 0.284762885673 | 1.89255522874e-05 |
+----------------+----------------+-------------------+

但是上面是没有排序的, 我们按照pagerank这一列进行topK排序, 得到最重要的人: 邦德!

>>> pr.get('pagerank').topk(column_name='pagerank')
+----------------+----------------+-------------------+
|      __id      |    pagerank    |       delta       |
+----------------+----------------+-------------------+
|   James Bond   | 2.52743578524  |  0.0132914517076  |
|       M        | 1.87718696576  |  0.00666194771763 |
|   Moneypenny   | 1.18363921275  |  0.00143637385736 |
|       Q        | 1.18363921275  |  0.00143637385736 |
| Inga Bergstorm | 0.869872717136 |  0.00477951418076 |
|    Wai Lin     | 0.869872717136 |  0.00477951418076 |
| Elliot Carver  | 0.634064732205 | 0.000113553313724 |
|  Henry Gupta   | 0.284762885673 | 1.89255522874e-05 |
|  Paris Carver  | 0.284762885673 | 1.89255522874e-05 |
|   Gotz Otto    | 0.284762885673 | 1.89255522874e-05 |
+----------------+----------------+-------------------+

dato userguide

https://dato.com/learn/userguide/index.html

weixin_33757911

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hello dato--graphlab create

Install Dato(GraphLab Create)Dato需要注册才能使用, 并且有30天的试用期.下面使用python的虚拟环境安装一个干净的dato测试环境:# Create a virtual environment named dato-envvirtualenv dato-env# Activate th...
复制链接

扫一扫