背景
测试了一个case,用GraphX 1.6跑标准的LPA算法,使用的是内置的LabelPropagation算法包。数据集是Google web graph,(忽略可能这个数据集不是很合适),资源情况是standalone模式,18个worker,每个worker起一个executor,50g内存,32核,数据加载成18个分区。
case里执行200轮迭代,代码:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._
// load the graph
val google = GraphLoader.edgeListFile(sc, "/home/admin/benchmark/data/google/web-Google.txt", false, 18)
LabelPropagation.run(google, 200)
GraphX的执行方式
graphx的LPA是使用自己封装的Pregel跑的,先说优点,问题在后面暴露后分析:
1. 包掉了使用VertexRDD和EdgeRDD做BSP的过程,api简单,泛型清晰
2. 某轮迭代完成后,本轮没有msg流动的话,判定早停,任务结束
3. 迭代开始前,graph自动cache,结束后,某些中间结果rdd自动uncache
代码如下:
def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
(graph: Graph[VD, ED],
initialMsg: A,
maxIterations: Int = Int.MaxValue,
activeDirection: EdgeDirection = EdgeDirection.Either)
(vprog: (VertexId, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] =
{
var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
// compute the messages
var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
var activeMessages = messages.count()
// Loop
var prevG: Graph[VD, ED] = null
var i = 0
while (activeMessages > 0 && i < maxIterations) {
// Receive the messages and update the vertices.
prevG = g
g = g.joinVertices(messages)(vprog).cache()
val oldMessages = messages
// Send new messages, skipping edges where neither side received a message. We must cache
// messages so it can be materialized on the next line, allowing us to uncache the previous
// iteration.
messages = g.mapReduceTriplets(
sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
// The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
// (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
// and the vertices of g).
activeMessages = messages.count()
logInfo("Pregel finished iteration " + i)
// Unpersist the RDDs hidden by newly-materialized RDDs
oldMessages.unpersist(blocking = false)
prevG.unpersistVertices(blocking = false)
prevG.edges.unpersist(blocking = false)
// count the iteration
i += 1
}
g
} // end of apply

本文通过一个使用GraphX进行图计算的案例,揭示了在执行大量迭代时,Spark Driver成为性能瓶颈的问题。在1.6版本的GraphX中,LabelPropagation算法在迭代过程中由于血缘关系导致内存消耗过大,引发OOM异常。调整Driver内存至10GB后,虽然避免了异常,但GC频繁,延长了计算时间。对比自研框架,GraphX在迭代效率上有明显不足。作者认为,对于大迭代任务,GraphX需要手动控制cache和迭代次数来缓解问题,且其图更新和建模过程不灵活,不适合图数据更新场景。总结来说,GraphX更适合数据分析中的特定环节,而非大规模图计算或生产环境。
最低0.47元/天 解锁文章
970

被折叠的 条评论
为什么被折叠?



