GraphX迭代的瓶颈与分析

最新推荐文章于 2021-03-28 08:29:22 发布

张包峰

最新推荐文章于 2021-03-28 08:29:22 发布

阅读量6.5k

点赞数 1

分类专栏： Spark Graph 图计算文章标签：图计算

本文链接：https://blog.csdn.net/pelick/article/details/50630003

版权

本文通过一个使用GraphX进行图计算的案例，揭示了在执行大量迭代时，Spark Driver成为性能瓶颈的问题。在1.6版本的GraphX中，LabelPropagation算法在迭代过程中由于血缘关系导致内存消耗过大，引发OOM异常。调整Driver内存至10GB后，虽然避免了异常，但GC频繁，延长了计算时间。对比自研框架，GraphX在迭代效率上有明显不足。作者认为，对于大迭代任务，GraphX需要手动控制cache和迭代次数来缓解问题，且其图更新和建模过程不灵活，不适合图数据更新场景。总结来说，GraphX更适合数据分析中的特定环节，而非大规模图计算或生产环境。

摘要由CSDN通过智能技术生成

背景

测试了一个case，用GraphX 1.6跑标准的LPA算法，使用的是内置的LabelPropagation算法包。数据集是Google web graph，(忽略可能这个数据集不是很合适)，资源情况是standalone模式，18个worker，每个worker起一个executor，50g内存，32核，数据加载成18个分区。

case里执行200轮迭代，代码:

import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.graphx.lib._

// load the graph
val google = GraphLoader.edgeListFile(sc, "/home/admin/benchmark/data/google/web-Google.txt", false, 18)

LabelPropagation.run(google, 200)

GraphX的执行方式

graphx的LPA是使用自己封装的Pregel跑的，先说优点，问题在后面暴露后分析：
1. 包掉了使用VertexRDD和EdgeRDD做BSP的过程，api简单，泛型清晰
2. 某轮迭代完成后，本轮没有msg流动的话，判定早停，任务结束
3. 迭代开始前，graph自动cache，结束后，某些中间结果rdd自动uncache

代码如下:

  def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] =
  {
    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
    // compute the messages
    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      prevG = g
      g = g.joinVertices(messages)(vprog).cache()

      val oldMessages = messages
      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = g.mapReduceTriplets(
        sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      // The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
      // (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
      // and the vertices of g).
      activeMessages = messages.count()

      logInfo("Pregel finished iteration " + i)

      // Unpersist the RDDs hidden by newly-materialized RDDs
      oldMessages.unpersist(blocking = false)
      prevG.unpersistVertices(blocking = false)
      prevG.edges.unpersist(blocking = false)
      // count the iteration
      i +=