谈谈Spark GraphX吧！

最新推荐文章于 2024-05-25 18:19:15 发布

明日菜心

最新推荐文章于 2024-05-25 18:19:15 发布

阅读量4.3k

点赞数 2

分类专栏：图算法

本文链接：https://blog.csdn.net/aiyinsimei/article/details/73927877

版权

图算法专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一。图的结构

如何定义图？

举个栗子A：

val userGraph: Graph[(String, String), String]

userGraph是个图变量（定义图结构的变量的简称），其中（String，String）是顶点属性类型，String是边属性类型。顶点（Vertex），边（Edge），三元组（Triplet）结构后面会介绍。

1）顶点Vertex的结构

举个栗子B：

val sc: SparkContext
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))

其中，RDD[(VertexId,VD)]等同VertexRDD[VD]。VD表示顶点属性类型，如栗子B中顶点属性类型为（String，String）。VertexId，表示顶点id类型，在Spark GraphX源码里面，VertexId等同Long，即type VertexId = Long。恐怕你已知道，顶点结构体里面，包含两种类型，一个是顶点id的类型（源码里已写死为Long类型），一个是顶点属性的类型（用户可以自定义其类型，即该顶点所携带的信息量）。

2）边Edge的结构

举个栗子C：

// Assume the SparkContext has already been constructed
val sc: SparkContext
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))

其中，RDD[Edge[ED]]等同EdgeRDD[ED]。ED表示边属性类型，如栗子C中边属性类型为String。Edge（srcId，dstId，edgeAttr），srcId表示源顶点id值，dstId表示目标顶点id值，edgeAttr表示边属性值。So，边结构体里面，包含两种类型，三个变量，即一个是源顶点id（与目标顶点id类型一样，在源码里也已写死为Long类型），一个是目标顶点id，一个是源顶点到目标顶点这条边的属性（用户可以自定义其类型，即该条边所携带的信息量）。

3）既含有顶点属性也含有边属性的结构

举个栗子D：

val graph: Graph[(String, String), String] // Constructed from above
// Use the triplets view to create an RDD of facts.
val facts: RDD[String] =
  graph.triplets.map(triplet =>
    triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
facts.collect.foreach(println(_))  
val triplets: RDD[EdgeTriplet[VD, ED]]=graph.triplets

其中，RDD[EdgeTriplet[VD, ED]]里面VD表示顶点属性类型，ED表示边属性类型。从栗子D中可以看出，与Edge[ED]类型区别的是，在EdgeTriplet[VD, ED]类型的变量里，不仅含有边属性，也含有边两端顶点的属性。

class EdgeTriplet[VD, ED] extends Edge[ED] {
  /**
   * The source vertex attribute
   */
  var srcAttr: VD = _ // nullValue[VD]

  /**
   * The destination vertex attribute
   */
  var dstAttr: VD = _ // nullValue[VD]
……
}
case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] (
    var srcId: VertexId = 0,
    var dstId: VertexId = 0,
    var attr: ED = null.asInstanceOf[ED])

4）下图是从sparkGraphX官网上复制过来，图能更清晰地传递文字的含义。

二。生成图的几种方式

1）根据顶点和边生成图

  /**
   * Construct a graph from a collection of vertices and
   * edges with attributes.  Duplicate vertices are picked arbitrarily and
   * vertices found in the edge collection but not in the input
   * vertices are assigned the default attribute.
   *
   * @tparam VD the vertex attribute type
   * @tparam ED the edge attribute type
   * @param vertices the "set" of vertices and their attributes
   * @param edges the collection of edges in the graph
   * @param defaultVertexAttr the default vertex attribute to use for vertices that are
   *                          mentioned in edges but not in vertices
   * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
   * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
   */
  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
  }

2）根据边生成图

 /**
   * Construct a graph from a collection of edges.
   *
   * @param edges the RDD containing the set of edges in the graph
   * @param defaultValue the default vertex attribute to use for each vertex
   * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
   * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
   *
   * @return a graph with edge attributes described by `edges` and vertices
   *         given by all vertices in `edges` with value `defaultValue`
   */
  def fromEdges[VD: ClassTag, ED: ClassTag](
      edges: RDD[Edge[ED]],
      defaultValue: VD,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
  }

其中，顶点可以根据边的信息生成。

    val edgesCached = edges.withTargetStorageLevel(edgeStorageLevel).cache()
    val vertices =
      VertexRDD.fromEdges(edgesCached, edgesCached.partitions.length, defaultVertexAttr)
      .withTargetStorageLevel(vertexStorageLevel)

3）根据边的二元组生成图

 /**
   * Construct a graph from a collection of edges encoded as vertex id pairs.
   *
   * @param rawEdges a collection of edges in (src, dst) form
   * @param defaultValue the vertex attributes with which to create vertices referenced by the edges
   * @param uniqueEdges if multiple identical edges are found they are combined and the edge
   * attribute is set to the sum.  Otherwise duplicate edges are treated as separate. To enable
   * `uniqueEdges`, a [[PartitionStrategy]] must be provided.
   * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
   * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
   *
   * @return a graph with edge attributes containing either the count of duplicate edges or 1
   * (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.
   */
  def fromEdgeTuples[VD: ClassTag](
      rawEdges: RDD[(VertexId, VertexId)],
      defaultValue: VD,
      uniqueEdges: Option[PartitionStrategy] = None,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =
  {
    val edges = rawEdges.map(p => Edge(p._1, p._2, 1))
    val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
    uniqueEdges match {
      case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)
      case None => graph
    }
  }

三。图的计算模型（重点）——Pregel的编程模型

1）定义

很多有关图的算法都是基于Pregel实现的。

其核心为三个函数（用户可自定义实现这些函数体，但函数输入输出的格式已固定），如下：

（以下解释是翻译源码注释）

vprog函数：

官方解释：用户定义的顶点程序。该顶点程序运行在每个顶点上，接受到达该顶点的消息并计算一个新的顶点值。在第一次迭代中，顶点程序在所有顶点上调用，并传递默认消息。在随后的迭代中，顶点程序只在接收消息的顶点上调用。

通俗用语：用户可自定义实现函数体，在第一次迭代中，函数作用于图的每个顶点，在随后的迭代中，该函数只作用接收message的顶点。函数的入参是顶点属性值和顶点接收的message（message类型用户可以自定义），通过用户实现的函数体+message来更新顶点原来的属性值。顶点接收的message应该是sendMsg+mergeMsg函数的结果。

sendMsg函数：

~~官方解释:用户提供的函数，该函数应用于当前迭代中接收消息的顶点的外边缘。~~

通俗用语：用户可自定义实现函数体，函数作用于图的每条边，函数的入参是边的三元组，通过用户实现的函数体+边的三元组（源顶点及其属性，目标顶点及其属性，源顶点与目标顶点之间边的属性）向顶点传递message（用户可以自定义message类型）。

mergeMsg函数：

官方解释：用户提供的函数，接收两个类型为A的信息，合并成一个类型为A的信息。这个函数一定是可交换的和可关联的，并且在理想的情况下，类型为A的信息的大小不应该增加。

通俗用语：用户可自定义实现函数体，函数作用于图的每个顶点，根据sendMsg函数向每个顶点传递message，mergeMsg函数主要是合并传递给顶点的两个message。假设message类型为A，该函数的入参是两个类型为A的message，通过用户实现的函数体+两个message合并成一个类型为A的message。

  /**
   * Execute a Pregel-like iterative vertex-parallel abstraction.  The
   * user-defined vertex-program `vprog` is executed in parallel on
   * each vertex receiving any inbound messages and computing a new
   * value for the vertex.  The `sendMsg` function is then invoked on
   * all out-edges and is used to compute an optional message to the
   * destination vertex. The `mergeMsg` function is a commutative
   * associative function used to combine messages destined to the
   * same vertex.
   *
   * On the first iteration all vertices receive the `initialMsg` and
   * on subsequent iterations if a vertex does not receive a message
   * then the vertex-program is not invoked.
   *
   * This function iterates until there are no remaining messages, or
   * for `maxIterations` iterations.
   *
   * @tparam VD the vertex data type
   * @tparam ED the edge data type
   * @tparam A the Pregel message type
   *
   * @param graph the input graph.
   *
   * @param initialMsg the message each vertex will receive at the first
   * iteration
   *
   * @param maxIterations the maximum number of iterations to run for
   *
   * @param activeDirection the direction of edges incident to a vertex that received a message in
   * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only
   * out-edges of vertices that received a message in the previous round will run. The default is
   * `EdgeDirection.Either`, which will run `sendMsg` on edges where either side received a message
   * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` will only run on edges where
   * *both* vertices received a message.
   *
   * @param vprog the user-defined vertex program which runs on each
   * vertex and receives the inbound message and computes a new vertex
   * value.  On the first iteration the vertex program is invoked on
   * all vertices and is passed the default message.  On subsequent
   * iterations the vertex program is only invoked on those vertices
   * that receive messages.
   *
   * @param sendMsg a user supplied function that is applied to out
   * edges of vertices that received messages in the current
   * iteration
   *
   * @param mergeMsg a user supplied function that takes two incoming
   * messages of type A and merges them into a single message of type
   * A.  ''This function must be commutative and associative and
   * ideally the size of A should not increase.''
   *
   * @return the resulting graph at the end of the computation
   *
   */
  def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED]

其余变量的解释可参考源码里的官方注释。

此处如果有图，会更清晰，但是时间有限，暂时占位，后期补充。

2）Pregel函数体如下：

def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] =
  {
    require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," +
      s" but got ${maxIterations}")

    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
    // compute the messages
    var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      prevG = g
      g = g.joinVertices(messages)(vprog).cache()

      val oldMessages = messages
      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = GraphXUtils.mapReduceTriplets(
        g, sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      // The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
      // (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
      // and the vertices of g).
      activeMessages = messages.count()

      logInfo("Pregel finished iteration " + i)

      // Unpersist the RDDs hidden by newly-materialized RDDs
      oldMessages.unpersist(blocking = false)
      prevG.unpersistVertices(blocking = false)
      prevG.edges.unpersist(blocking = false)
      // count the iteration
      i += 1
    }
    messages.unpersist(blocking = false)
    g
  } // end of apply

这个函数体其实也很好理解啦。

第一步：vprog函数对图中每个顶点的属性值通过初始（默认）message进行更新。

第二步：当然是通过sendMsg和mergeMsg函数对图中每个顶点依据边传送信息和合并信息，最终形成RDD[(顶点,message)]。噢，当然啦，不是每个顶点都能接收到message，这取决于用户实现的sendMsg函数逻辑。

第三步：应该首先判断接收到message的顶点个数（即活跃的顶点个数），如有存在接收message的顶点，同时，当前迭代的次数小于最大的迭代次数，则调用vprog函数，作用于接收message的顶点，更新这些顶点的属性值。

第四步：循环。。。

四。常用的图的接口

（只整理我在实现louvain算法时所用到的图的接口函数，后期如遇新的函数，必会添加，哈哈哈，我的代码应该很清晰，不需要啰嗦的解释了哈~~）

1）triplets函数

    val edges: RDD[Edge[Double]] = graph.triplets.filter(edgeTriplet => edgeTriplet.srcAttr._cId != edgeTriplet.dstAttr._cId).map {
      case (edgeTriplet: EdgeTriplet[VertexData, Double]) =>
        val srcCId: VertexId = edgeTriplet.srcAttr._cId
        val dstCId: VertexId = edgeTriplet.dstAttr._cId
        val weight: Double = edgeTriplet.attr
        val minVertexId: VertexId = math.min(srcCId, dstCId)
        val maxVertexId: VertexId = math.max(srcCId, dstCId)
        ((minVertexId, maxVertexId), weight)
    }

2）zip函数

     val changeCount: Long = graph.vertices.zip(maxChangeInfo).filter {
        case ((vId1: VertexId, vertex: VertexData), (vId2: VertexId, cId: VertexId, maxModularityChange: Double)) =>
          vertex._cId != cId
      }.count()

3）connectedComponents函数（这个我解释下，这个函数是先判断图中存在的联通图，然后在每个联通图中，以最小的顶点id为该联通图的标识，表示拥有相同标识的顶点是属于一个联通图）

      val newMaxChangeInfo: VertexRDD[VertexId] = Graph.fromEdgeTuples(maxChangeInfo.map {
        case (vId: VertexId, cId: VertexId, maxModularityChange: Double) => (vId, cId)
      }, 0)
        .connectedComponents()
        .vertices

4）joinVertices函数（用于更新顶点的属性值，同outerJoinVertices的区别在于更新的属性值如果不存在，outerJoinVertices函数会用默认值）

    val updateInfoByCId: Graph[VertexData, Double] = graph.joinVertices(maxChangeInfo)((vId: VertexId, vertexData: VertexData, cId: VertexId) => {
      val newVertexData: VertexData = new VertexData(vId, cId)
      newVertexData._degree = vertexData._degree
      newVertexData._innerDegree = vertexData._innerDegree
      newVertexData._innerVertices = vertexData._innerVertices
      newVertexData
    })

5）outerJoinVertices函数

    val louvainG = initG.outerJoinVertices(vertexAttr)((vId: VertexId, oldVertexAttr: None.type, newVertexAttr: Option[Double]) => {
      val vertexData: VertexData = new VertexData(vId, vId)
      val weights: Double = newVertexAttr.getOrElse(0)
      vertexData._degree = weights
      vertexData._innerVertices += vId
      vertexData._commVertices += vId
      vertexData
    })

明日菜心

关注

2
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
谈谈Spark GraphX吧！

一.浅谈Spark GraphX1.首先，介绍下构成图的两大结构体。1）一个是节点RDD，其结构体如下：VertexRDD[VertexProperty]=RDD[(VertexId,VertexProperty)]2）一个是边RDD，其结构体如下：EdgeRDD[EdgeProperty]=RDD[Edge[EdgeProperty]]）,附加一个既含有节
复制链接

扫一扫