sparkGraphX 图操作：pregel（加强的aggregateMessages）

最新推荐文章于 2024-07-18 06:20:28 发布

温暖会追上来的.

最新推荐文章于 2024-07-18 06:20:28 发布

阅读量1.2k

点赞数

分类专栏： Spark小白文章标签： spark scala

本文链接：https://blog.csdn.net/Strawberry_595/article/details/105551236

版权

Spark小白专栏收录该内容

9 篇文章

订阅专栏

1、Pregel API：

2、代码实现：

使用pregal实现找出源顶点到每个节点最小花费

使用pregel实现找出源节点到每个节点的最大深度

1、Pregel API：

图本身就是内在的递归的数据结构，因为一个顶点的属性可能依赖于其neighbor，而neighbor的属性又依赖于他们的neighbour。所以很多重要的图算法都会迭代计算每个顶点的属性，直到达到一个稳定状态。

GraphX中的Pregel操作符是一个批量同步并行（bulk-synchronous parallel message abstraction）的messaging abstraction，用于图的拓扑结构（topology of the graph）。The Pregel operator executes in a series of super steps in whichvertices receive the sum of their inbound messagesfrom the previous super step,compute a new valuefor the vertex property, and thensend messages to neighboring verticesin the next super step. Message是作为edge triplet的一个函数并行计算的，message的计算可以使用source和dest顶点的属性。没有收到message的顶点在super step中被跳过。迭代会在么有剩余的信息之后停止，并返回最终的图。

pregel的定义：

def pregel[A]

    (initialMsg: A,//在第一次迭代中每个顶点获取的起始

    msgmaxIter: Int = Int.MaxValue,//迭代计算的次数

    activeDir: EdgeDirection = EdgeDirection.Out

)(

    vprog: (VertexId, VD, A) => VD,//顶点的计算函数，在每个顶点运行，根据顶点的ID，属性和获取的inbound message来计算顶点的新属性值。顶一次迭代的时候，inbound message为initialMsg，且每个顶点都会执行一遍该函数。以后只有上次迭代中接收到信息的顶点会执行。

    sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],//应用于顶点的出边（out edges）用于接收顶点发出的信息

    mergeMsg: (A, A) => A//合并信息的算法

)

算法实现的大致过程：

var g = mapVertices((vid, vdata) => vprog(vid, vdata, initMsg)).cache //第一步是根据initMsg在每个顶点执行一次vprog算法，从而每个顶点的属性都会迭代一次。

var messages = g.mapReduceTriplets(sendMsg, mergeMsg)

var messagesCount = messages.count

var i = 0

while(activeMessages > 0 && i < maxIterations){

    g = g.joinVertices(messages)(vprog).cache

    val oldMessages = messages

    messages = g.mapReduceTriplets(

        sendMsg,

mergeMsg,

        Some((oldMessages, activeDirection))

    ).cache()

    activeMessages = messages.count

    i += 1

}

g

pregel算法的一个实例：将图跟一些一些初始的score做关联，然后将顶点分数根据出度大小向外发散，并自己保留一份：

//将图中顶点添加上该顶点的出度属性

val graphWithDegree = graph.outerJoinVertices(graph.outDegrees){

    case (vid, name, deg) => (name, deg match {

        case Some(deg) => deg+0.0

        case None => 1.25}

    )

}//将图与初始分数做关联

val graphWithScoreAndDegree = graphWithDegree.outerJoinVertices(scoreRDD){

    case (vid, (name, deg), score) => (name,deg, score.getOrElse(0.0))

}

graphWithScoreAndDegree.vertices.foreach(x => println("++++++++++++id:"+x._1+"; deg: "+x._2._2+"; score:"+x._2._3))//将图与初始分数做关联

val graphWithScoreAndDegree = graphWithDegree.outerJoinVertices(scoreRDD){

    case (vid, (name, deg), score) => (name,deg, score.getOrElse(0.0))

}

graphWithScoreAndDegree.vertices.foreach(x => println("++++++++++++id:"+x._1+"; deg: "+x._2._2+"; score:"+x._2._3))

算法的第一步：将0.0（也就是传入的初始值initMsg）跟各个顶点的值相加（还是原来的值），然后除以顶点的出度。这一步很重要，不能忽略。并且在设计的时候也要考虑结果会不会被这一步所影响。

解释来源：https://www.jianshu.com/p/d9170a0723e4

2、代码实现：

使用pregal实现找出源顶点到每个节点最小花费

package homeWork

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.graphx.util.GraphGenerators

object MapGraphX5 {


  def main(args: Array[String]): Unit = {
    //设置运行环境
    val conf = new SparkConf().setAppName("Pregel API GraphX").setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // 构建图
    val myVertices = sc.parallelize(Array((1L, 0), (2L, 0), (3L, 0), (4L, 0),
      (5L, 0)))
    val myEdges = sc.makeRDD(Array(Edge(1L, 2L, 2.5),
      Edge(2L, 3L, 3.6), Edge(3L, 4L, 4.5),
      Edge(4L, 5L, 0.1), Edge(3L, 5L, 5.2)
    ))
    val myGraph = Graph(myVertices, myEdges)

    //设置源顶点
    val sourceId: VertexId = 1L
    //初始化数据集，是源顶点就为0.0，不是就设置为double的正无穷大
    val initialGraph = myGraph.mapVertices((id, _) =>
      if (id == sourceId) 0.0 else Double.PositiveInfinity)

/*
    def pregel[A](
                   initialMsg : A,
                   maxIterations : scala.Int = { /* compiled code */ },
                   activeDirection : org.apache.spark.graphx.EdgeDirection = { /* compiled code */ }
                 )
                 (
                   vprog : scala.Function3[org.apache.spark.graphx.VertexId, VD, A, VD],
                   sendMsg : scala.Function1[org.apache.spark.graphx.EdgeTriplet[VD, ED],
                     scala.Iterator[scala.Tuple2[org.apache.spark.graphx.VertexId, A]]],
                   mergeMsg : scala.Function2[A, A, A])(implicit evidence$6 : scala.reflect.ClassTag[A]
                 )
    : org.apache.spark.graphx.Graph[VD, ED] = { /* compiled code */ }
*/


    val sssp: Graph[Double, Double] = initialGraph.pregel(
      //initialMs
      Double.PositiveInfinity
      //maxIterations和activeDirection使用默认值
    )(
      //vprog   更改数据集
      (id, dist, newDist) => math.min(dist, newDist),
      //sendMsg
      triplet => { // Send Message
        //寻找1L顶点到每个顶点的最小花费
        if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
          //满足sum（起始顶点+边值） 小于 终止顶点当前数据集中的值，就把sum发送给终止顶点，更新数据集的数据
          Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
        } else {
          Iterator.empty
        }
      },
      //mergeMsg    选择当前数据和发送数据的最小值传送
      (a, b) => math.min(a, b)
    )


    sssp.vertices.collect.foreach(println(_))


  }
}

使用pregel实现找出源节点到每个节点的最大深度

package pregel

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Edge, EdgeDirection, Graph}

object Demo2 {

  def main(args: Array[String]): Unit = {

    //设置运行环境
    val conf = new SparkConf().setAppName("Pregol Api GraphX").setMaster("local")
    val sc = new SparkContext(conf)
    sc.setLogLevel("WARN")

    // 构建图
    val myVertices = sc.parallelize(Array((1L, "张三"), (2L, "李四"), (3L, "王五"), (4L, "钱六"),
      (5L, "领导")))
    val myEdges = sc.makeRDD(Array( Edge(1L,2L,"朋友"),
      Edge(2L,3L,"朋友") , Edge(3L,4L,"朋友"),
      Edge(4L,5L,"上下级"),Edge(3L,5L,"上下级")
    ))

    val myGraph = Graph(myVertices,myEdges)

    val g =  myGraph.mapVertices((vid,vd)=>0)

    var newGraph: Graph[Int, String] = g.pregel(0)(
      (id, attr, maxValue) => maxValue,
      triplet => { // Send Message
        if (triplet.srcAttr + 1 > triplet.dstAttr) {
           Iterator((triplet.dstId, triplet.srcAttr + 1))
        } else {
          Iterator.empty
        }
      },
      (a: Int, b: Int) => math.max(a, b)
    )

    newGraph.vertices.collect.foreach(println(_))




    

  }

}