spark graphx 图计算浅析

low_125

已于 2022-10-30 12:00:26 修改

阅读量896

点赞数 1

分类专栏： Spark Graphx 文章标签： spark

于 2022-10-29 19:47:02 首次发布

本文链接：https://blog.csdn.net/low_125/article/details/127590371

版权

Spark 同时被 2 个专栏收录

1 篇文章

订阅专栏

Graphx

1 篇文章

订阅专栏

本文对Spark GraphX图计算进行了浅析。介绍了图的定义，指出顶点、边等数据集很少多次传输。GraphX采用顶点切分方式进行分布式图分割，可减少通信和存储开销。还说明了程序运行时为减少数据传输的处理方式，以及Pregel程序的运行情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

spark graphx 图计算浅析

 总认为版本低点，分析难度小些，主要为了解思想，以spark-0.9.0-incubating版本为分析对象，以下为其例子程序PregelSuite。

  test("chain propagation") {
    withSpark { sc =>
      val n = 5
      val chain = Graph.fromEdgeTuples(
        sc.parallelize((1 until n).map(x => (x: VertexId, x + 1: VertexId)), 3),
        0).cache()
      assert(chain.vertices.collect.toSet === (1 to n).map(x => (x: VertexId, 0)).toSet)
      val chainWithSeed = chain.mapVertices { (vid, attr) => if (vid == 1) 1 else 0 }.cache()
      assert(chainWithSeed.vertices.collect.toSet ===
        Set((1: VertexId, 1)) ++ (2 to n).map(x => (x: VertexId, 0)).toSet)
      val result = Pregel(chainWithSeed, 0)(
        (vid, attr, msg) => math.max(msg, attr),
        et => if (et.dstAttr != et.srcAttr) Iterator((et.dstId, et.srcAttr)) else Iterator.empty,
        (a: Int, b: Int) => math.max(a, b) )
      assert(result.vertices.collect.toSet ===
        chain.vertices.mapValues { (vid, attr) => attr + 1 }.collect.toSet)
    }
  }

看看图的定义

class GraphImpl[VD: ClassTag, ED: ClassTag] protected (
    @transient val vertices: VertexRDD[VD],
    @transient val edges: EdgeRDD[ED],
    @transient val routingTable: RoutingTable,
    @transient val replicatedVertexView: ReplicatedVertexView[VD])
  extends Graph[VD, ED] with Serializable ```

看到标注transient，意味着顶点、边等数据集很少多次传输，当这些数据集（RDD算子）传到worker后，数据很少变动。
GraphX 采用顶点切分方式进行分布式图分割，下面是分割示意。
在这里插入图片描述
边切分与顶点切分
GraphX 不是沿着边沿分割图形，而是沿着顶点分割图形，这可以减少通信和存储开销，在逻辑上，这对应于将边缘分配给机器并允许顶点跨越多台机器。
以上面的例子，跟踪可以得到下图（图、顶点表、边表、路由表）：
在这里插入图片描述
从上图看，顶点表和边表分区并不一样，在Spark节点初始化后，如下图。

在上面的分区图中，RoutingTable、ReplicatedVertexView、Edges等RDD算子（数据集）不太可能正好在一个worker中，没什么关系，计算时Spark可以通过RDDid直接将计算过的数据集读过来，也相当于在一个Worker。
程序在Spark中运行时，为了减少数据传输，路由表RoutingTable数据集、顶点表数据集(Vertices)、边表数据集（Edges）cache后，将不发生变化；真正变化计算数据的是updatedVerts，也就是ReplicatedVertexView重复顶点视图中的updatedVerts，为上面红框内的内容，变化发生在g.outerJoinVertices程序中。

class ReplicatedVertexView[VD: ClassTag](
updatedVerts: VertexRDD[VD],
edges: EdgeRDD[_],
routingTable: RoutingTable,
prevViewOpt: Option[ReplicatedVertexView[VD]] = None)

Pregel程序如下。

def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] =
  {
    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
    // compute the messages
    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages. Vertices that didn't get any messages do not appear in newVerts.
      val newVerts = g.vertices.innerJoin(messages)(vprog).cache()

      // Update the graph with the new vertices.
      prevG = g
      g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }
      g.cache()
      
      val oldMessages = messages
      messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDirection))).cache()
      activeMessages = messages.count()

      // Unpersist the RDDs hidden by newly-materialized RDDs
      oldMessages.unpersist(blocking=false)
      newVerts.unpersist(blocking=false)
      prevG.unpersistVertices(blocking=false)
      // count the iteration
      i += 1
    }

    g
  } // end of apply