powergraph源码分析-2

最新推荐文章于 2022-06-20 23:42:11 发布

chihaiheng5264

最新推荐文章于 2022-06-20 23:42:11 发布

阅读量295

点赞数

文章标签：人工智能

原文链接：https://my.oschina.net/u/2256569/blog/1633013

版权

上回说了些powergraph最基本的,估计知道了还是不能完成算法编写或是大规模的图分析。这篇打算切入主题,谈下几个东西,1,finalize。2. GAS
图的load之后,直接调用finalize如下

graph.load(vertex_dir, vertex_loader);
graph.load(graph_dir, edge_loader);
graph.finalize();

当然这不是什么好的产品化,需要显示调用finalize,为什么呢? 我们还是看下distributed_ingress_base,这是个很关键的class,上文说过。还有些关键的原因,后续再说。
1, 构建本地图,根据收到点和边。不是load的时候已经构建了么。原因是,load过程中已经做了hash的分配,以及vertex/edge exchange,exchange的时候,非本机owen的vid会被消息形式发送给ownership的机器。owen的vid,也不会真实加载,只会标注,并且load入buffer。注意是edge_buffer_type,而不是edge_data_type。这也是为什么,你的内存要比实际大很多才能跑的另一个原因。构图时out of mem是可能出现内存爆掉的几处之一。
2, metadata标注。经过第一步,你的localgraph的点和边数据已经ready,但是这时候引入一个flying的概念,就是不具备ownership的点。这部分会收集三个信息,owner,gvid和_mirrors。这步完成后,flying vid和vid2lvid_buffer也会进入graph.lvid2record和vid2lvid。这是两个关键的vector。mirror的信息用于后续把更新分发到需要的机器。
3, 最后,根据metainfo,做两个动作,synchronize_mirrors_to_master_or和synchronize_master_to_mirrors。

参考下pagerank.cpp，然后就可以用ominiengine以sync的方式启动引擎，开始gas。如果构图数据从hdfs流入，你应该实现vertex, edge loader类似下面

bool vertex_loader(graph_type& graph, const std::string& fname,
const std::string& line)

// parse line into tokens

vertex_data vdata;

vdata.embedding = line.parse...

edge loader类似，graph load的时候传入vertexLoader and edgeLoader。

GAS分成gather, apply and scatter。很多文章，去搜下。具体逻辑实现参考下pagerank.cpp

/**
* Gather only in edges.
*/
edge_dir_type gather_edges(icontext_type& context,
const vertex_type& vertex) const {
return graphlab::IN_EDGES;
} // end of Gather edges

/* Gather the weighted rank of the adjacent page */
double gather(icontext_type& context, const vertex_type& vertex,
edge_type& edge) const {
return (edge.source().data() / edge.source().num_out_edges());
}

/* Use the total rank of adjacent pages to update this page */
void apply(icontext_type& context, vertex_type& vertex,
const gather_type& total) {

const double newval = (1.0 - RESET_PROB) * total + RESET_PROB;
last_change = (newval - vertex.data());
vertex.data() = newval;
if (ITERATIONS) context.signal(vertex);
}

/* The scatter edges depend on whether the pagerank has converged */
edge_dir_type scatter_edges(icontext_type& context,
const vertex_type& vertex) const {
// If an iteration counter is set then
if (ITERATIONS) return graphlab::NO_EDGES;
// In the dynamic case we run scatter on out edges if the we need
// to maintain the delta cache or the tolerance is above bound.
if(USE_DELTA_CACHE || std::fabs(last_change) > TOLERANCE ) {
return graphlab::OUT_EDGES;
} else {
return graphlab::NO_EDGES;
}
}

/* The scatter function just signal adjacent pages */
void scatter(icontext_type& context, const vertex_type& vertex,
edge_type& edge) const {
if(USE_DELTA_CACHE) {
context.post_delta(edge.target(), last_change);
}

if(last_change > TOLERANCE || last_change < -TOLERANCE) {
context.signal(edge.target());
} else {
context.signal(edge.target()); //, std::fabs(last_change));

}
}

void save(graphlab::oarchive& oarc) const {
// If we are using iterations as a counter then we do not need to
// move the last change in the vertex program along with the
// vertex data.
if (ITERATIONS == 0) oarc << last_change;
}

void load(graphlab::iarchive& iarc) {
if (ITERATIONS == 0) iarc >> last_change;
}

跟map reduce有点儿像，当然流程上遵循图传播的流程，我们假定是一个自然图，我们要算他的pagerank，做法也很简单，假定这个图就是网页构成的，每个网页是一个点，当然，每个网页有很多超链，链接到别的网页，所以图构成的点就是网页，边就是超链，当然超链是有方向的，有链出去的，也有链进来的，也就是出边和入边。好，然后我们开始在图上计算每个网页的pagerank，也算是种很简单的embedding。我们有vertex 1....n，edge1...m，每个vertex上有rank，初始值给1。然后，引擎启动，开始进入gather，随机在vertex1~vertexn选择一个节点，假定这个vertex1附近有vertex2, vertex3, vertex4, 这个时候gather会做3次，把vertex2~4的rank拿过来。拿过来的时候因为pagerank算法，链入本节点的分值增加，链出降低。做个简单的假定，vertex2同时也链出2个页面，vertex4链出3个，而vertex3没有链出。所以gather逻辑如下

return (edge.source().data() / edge.source().num_out_edges());

vertex2 gather后 1 / 2 = 0.5,

vertex3 gather后 1 / 1 = 1;

vertex4 gather后 1 / 3 = 0.33

vertex1的gather结束，进入apply。apply的逻辑如下

const double newval = (1.0 - RESET_PROB) * total + RESET_PROB;
last_change = (newval - vertex.data());
vertex.data() = newval;
if (ITERATIONS) context.signal(vertex);

apply的目的是根据gather的信息更新vertex1的rank值，注意这个total是对gather的vertex2~4做了aggregates，所以total的值是0.5 + 1 + 0.33 = 1.83。然后设定了RESET_PROB是0.15，所以计算结果就是(1-0.15) * 1.83 + 0.15 = 1.71，并且更新到vertex1的rank，然后signal传播出去。

scatter就很简单了，设定tolerence，就是如果与vertex更新前的值差别<0.01，就不用传播，1.71-1>0.01，所以把vertex1的1.71传播出去给到vertex2, vertex3, 和vertex4。于是，vertex2~4就分别进入了自己的gather, apply和scatter环节。有趣的是留意到USE_DELTA_CACHE，这是powergraph对计算中间结果做的缓存优化，效果还是很明显的，不过也带来内存占用极大问题，这个在后续的powergraph大规模图性能调优上在详细展开。

有兴趣的同学可以跑下pagerank，看看输出是什么。当然，最后你会设定super step次数和终结条件，聪明同学留意到，这个算法在图上跑，随着跑的次数，rank值会不停地变化，所以一般会设定收敛条件。神经网络的话，会算gradients。当然，你也可以显示的指定最大迭代数比如max_iterations=3。

这样一个超级简单的图深度就完成了，不严格的说也算是一种很简单的graph embedding，而且计算量不大。当然，加入你有10台机器，有50亿数据跑这个pagerank，估计会在loading阶段挂掉。这是分布式图partitioning的领域，有机会后续再详细聊原因和解法。

最后祝您graph embedding愉快^___^

转载于:https://my.oschina.net/u/2256569/blog/1633013

chihaiheng5264

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
powergraph源码分析-2

上回说了些powergraph最基本的,估计知道了还是不能完成算法编写或是大规模的图分析。这篇打算切入主题,谈下几个东西,1,finalize。2. GAS 图的load之后,直接调用finalize如下 graph.load(vertex_dir, vertex_loader); ...
复制链接

扫一扫