谈谈Spark GraphX吧!

一。图的结构

如何定义图?

举个栗子A:

val userGraph: Graph[(String, String), String]

       userGraph是个图变量(定义图结构的变量的简称),其中(String,String)是顶点属性类型,String是边属性类型。顶点(Vertex),边(Edge),三元组(Triplet)结构后面会介绍。

1)顶点Vertex的结构

举个栗子B:

val sc: SparkContext
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))

       其中,RDD[(VertexId,VD)]等同VertexRDD[VD]。VD表示顶点属性类型,如栗子B中顶点属性类型为(String,String)。VertexId,表示顶点id类型,在Spark GraphX源码里面,VertexId等同Long,即type VertexId = Long。恐怕你已知道,顶点结构体里面,包含两种类型,一个是顶点id的类型(源码里已写死为Long类型),一个是顶点属性的类型(用户可以自定义其类型,即该顶点所携带的信息量)。

2)边Edge的结构

举个栗子C:

// Assume the SparkContext has already been constructed
val sc: SparkContext
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))

       其中,RDD[Edge[ED]]等同EdgeRDD[ED]。ED表示边属性类型,如栗子C中边属性类型为String。Edge(srcId,dstId,edgeAttr),srcId表示源顶点id值,dstId表示目标顶点id值,edgeAttr表示边属性值。So,边结构体里面,包含两种类型,三个变量,即一个是源顶点id(与目标顶点id类型一样,在源码里也已写死为Long类型),一个是目标顶点id,一个是源顶点到目标顶点这条边的属性(用户可以自定义其类型,即该条边所携带的信息量)。

3)既含有顶点属性也含有边属性的结构

举个栗子D:

val graph: Graph[(String, String), String] // Constructed from above
// Use the triplets view to create an RDD of facts.
val facts: RDD[String] =
  graph.triplets.map(triplet =>
    triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
facts.collect.foreach(println(_))  
val triplets: RDD[EdgeTriplet[VD, ED]]=graph.triplets

       其中,RDD[EdgeTriplet[VD, ED]]里面VD表示顶点属性类型,ED表示边属性类型。从栗子D中可以看出,与Edge[ED]类型区别的是,在EdgeTriplet[VD, ED]类型的变量里,不仅含有边属性,也含有边两端顶点的属性。


class EdgeTriplet[VD, ED] extends Edge[ED] {
  /**
   * The source vertex attribute
   */
  var srcAttr: VD = _ // nullValue[VD]

  /**
   * The destination vertex attribute
   */
  var dstAttr: VD = _ // nullValue[VD]
……
}
case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] (
    var srcId: VertexId = 0,
    var dstId: VertexId = 0,
    var attr: ED = null.asInstanceOf[ED])


4)下图是从sparkGraphX官网上复制过来,图能更清晰地传递文字的含义。



二。生成图的几种方式

1)根据顶点和边生成图

  /**
   * Construct a graph from a collection of vertices and
   * edges with attributes.  Duplicate vertices are picked arbitrarily and
   * vertices found in the edge collection but not in the input
   * vertices are assigned the default attribute.
   *
   * @tparam VD the vertex attribute type
   * @tparam ED the edge attribute type
   * @param vertices the "set" of vertices and their attributes
   * @param edges the collection of edges in the graph
   * @param defaultVertexAttr the default vertex attribute to use for vertices that are
   *                          mentioned in edges but not in vertices
   * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
   * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
   */
  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    GraphImpl(vertices, edges, defaultVertexAttr, edgeStorageLevel, vertexStorageLevel)
  }



2)根据边生成图
 /**
   * Construct a graph from a collection of edges.
   *
   * @param edges the RDD containing the set of edges in the graph
   * @param defaultValue the default vertex attribute to use for each vertex
   * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
   * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
   *
   * @return a graph with edge attributes described by `edges` and vertices
   *         given by all vertices in `edges` with value `defaultValue`
   */
  def fromEdges[VD: ClassTag, ED: ClassTag](
      edges: RDD[Edge[ED]],
      defaultValue: VD,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {
    GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
  }

其中,顶点可以根据边的信息生成。
    val edgesCached = edges.withTargetStorageLevel(edgeStorageLevel).cache()
    val vertices =
      VertexRDD.fromEdges(edgesCached, edgesCached.partitions.length, defaultVertexAttr)
      .withTargetStorageLevel(vertexStorageLevel)

3)根据边的二元组生成图

 /**
   * Construct a graph from a collection of edges encoded as vertex id pairs.
   *
   * @param rawEdges a collection of edges in (src, dst) form
   * @param defaultValue the vertex attributes with which to create vertices referenced by the edges
   * @param uniqueEdges if multiple identical edges are found they are combined and the edge
   * attribute is set to the sum.  Otherwise duplicate edges are treated as separate. To enable
   * `uniqueEdges`, a [[PartitionStrategy]] must be provided.
   * @param edgeStorageLevel the desired storage level at which to cache the edges if necessary
   * @param vertexStorageLevel the desired storage level at which to cache the vertices if necessary
   *
   * @return a graph with edge attributes containing either the count of duplicate edges or 1
   * (if `uniqueEdges` is `None`) and vertex attributes containing the total degree of each vertex.
   */
  def fromEdgeTuples[VD: ClassTag](
      rawEdges: RDD[(VertexId, VertexId)],
      defaultValue: VD,
      uniqueEdges: Option[PartitionStrategy] = None,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] =
  {
    val edges = rawEdges.map(p => Edge(p._1, p._2, 1))
    val graph = GraphImpl(edges, defaultValue, edgeStorageLevel, vertexStorageLevel)
    uniqueEdges match {
      case Some(p) => graph.partitionBy(p).groupEdges((a, b) => a + b)
      case None => graph
    }
  }



三。图的计算模型(重点)——Pregel的编程模型

1)定义

很多有关图的算法都是基于Pregel实现的。

其核心为三个函数(用户可自定义实现这些函数体,但函数输入输出的格式已固定),如下:

(以下解释是翻译源码注释)

vprog函数:

官方解释:用户定义的顶点程序。该顶点程序运行在每个顶点上,接受到达该顶点的消息并计算一个新的顶点值。在第一次迭代中,顶点程序在所有顶点上调用,并传递默认消息。在随后的迭代中,顶点程序只在接收消息的顶点上调用。

通俗用语:用户可自定义实现函数体,在第一次迭代中,函数作用于图的每个顶点,在随后的迭代中,该函数只作用接收message的顶点。函数的入参是顶点属性值和顶点接收的message(message类型用户可以自定义),通过用户实现的函数体+message来更新顶点原来的属性值。顶点接收的message应该是sendMsg+mergeMsg函数的结果。

sendMsg函数:

官方解释:用户提供的函数,该函数应用于当前迭代中接收消息的顶点的外边缘。

通俗用语:用户可自定义实现函数体,函数作用于图的每条边,函数的入参是边的三元组,通过用户实现的函数体+边的三元组(源顶点及其属性,目标顶点及其属性,源顶点与目标顶点之间边的属性)向顶点传递message(用户可以自定义message类型)。

mergeMsg函数:

官方解释:用户提供的函数,接收两个类型为A的信息,合并成一个类型为A的信息。这个函数一定是可交换的和可关联的,并且在理想的情况下,类型为A的信息的大小不应该增加。

通俗用语:用户可自定义实现函数体,函数作用于图的每个顶点,根据sendMsg函数向每个顶点传递message,mergeMsg函数主要是合并传递给顶点的两个message。假设message类型为A,该函数的入参是两个类型为A的message,通过用户实现的函数体+两个message合并成一个类型为A的message。

  /**
   * Execute a Pregel-like iterative vertex-parallel abstraction.  The
   * user-defined vertex-program `vprog` is executed in parallel on
   * each vertex receiving any inbound messages and computing a new
   * value for the vertex.  The `sendMsg` function is then invoked on
   * all out-edges and is used to compute an optional message to the
   * destination vertex. The `mergeMsg` function is a commutative
   * associative function used to combine messages destined to the
   * same vertex.
   *
   * On the first iteration all vertices receive the `initialMsg` and
   * on subsequent iterations if a vertex does not receive a message
   * then the vertex-program is not invoked.
   *
   * This function iterates until there are no remaining messages, or
   * for `maxIterations` iterations.
   *
   * @tparam VD the vertex data type
   * @tparam ED the edge data type
   * @tparam A the Pregel message type
   *
   * @param graph the input graph.
   *
   * @param initialMsg the message each vertex will receive at the first
   * iteration
   *
   * @param maxIterations the maximum number of iterations to run for
   *
   * @param activeDirection the direction of edges incident to a vertex that received a message in
   * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only
   * out-edges of vertices that received a message in the previous round will run. The default is
   * `EdgeDirection.Either`, which will run `sendMsg` on edges where either side received a message
   * in the previous round. If this is `EdgeDirection.Both`, `sendMsg` will only run on edges where
   * *both* vertices received a message.
   *
   * @param vprog the user-defined vertex program which runs on each
   * vertex and receives the inbound message and computes a new vertex
   * value.  On the first iteration the vertex program is invoked on
   * all vertices and is passed the default message.  On subsequent
   * iterations the vertex program is only invoked on those vertices
   * that receive messages.
   *
   * @param sendMsg a user supplied function that is applied to out
   * edges of vertices that received messages in the current
   * iteration
   *
   * @param mergeMsg a user supplied function that takes two incoming
   * messages of type A and merges them into a single message of type
   * A.  ''This function must be commutative and associative and
   * ideally the size of A should not increase.''
   *
   * @return the resulting graph at the end of the computation
   *
   */
  def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] 


其余变量的解释可参考源码里的官方注释。

此处如果有图,会更清晰,但是时间有限,暂时占位,后期补充。

2)Pregel函数体如下:

def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
     (graph: Graph[VD, ED],
      initialMsg: A,
      maxIterations: Int = Int.MaxValue,
      activeDirection: EdgeDirection = EdgeDirection.Either)
     (vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED] =
  {
    require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," +
      s" but got ${maxIterations}")

    var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache()
    // compute the messages
    var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
    var activeMessages = messages.count()
    // Loop
    var prevG: Graph[VD, ED] = null
    var i = 0
    while (activeMessages > 0 && i < maxIterations) {
      // Receive the messages and update the vertices.
      prevG = g
      g = g.joinVertices(messages)(vprog).cache()

      val oldMessages = messages
      // Send new messages, skipping edges where neither side received a message. We must cache
      // messages so it can be materialized on the next line, allowing us to uncache the previous
      // iteration.
      messages = GraphXUtils.mapReduceTriplets(
        g, sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache()
      // The call to count() materializes `messages` and the vertices of `g`. This hides oldMessages
      // (depended on by the vertices of g) and the vertices of prevG (depended on by oldMessages
      // and the vertices of g).
      activeMessages = messages.count()

      logInfo("Pregel finished iteration " + i)

      // Unpersist the RDDs hidden by newly-materialized RDDs
      oldMessages.unpersist(blocking = false)
      prevG.unpersistVertices(blocking = false)
      prevG.edges.unpersist(blocking = false)
      // count the iteration
      i += 1
    }
    messages.unpersist(blocking = false)
    g
  } // end of apply

这个函数体其实也很好理解啦。

第一步:vprog函数对图中每个顶点的属性值通过初始(默认)message进行更新。

第二步:当然是通过sendMsg和mergeMsg函数对图中每个顶点依据边传送信息和合并信息,最终形成RDD[(顶点,message)]。噢,当然啦,不是每个顶点都能接收到message,这取决于用户实现的sendMsg函数逻辑。

第三步:应该首先判断接收到message的顶点个数(即活跃的顶点个数),如有存在接收message的顶点,同时,当前迭代的次数小于最大的迭代次数,则调用vprog函数,作用于接收message的顶点,更新这些顶点的属性值。

第四步:循环。。。


四。常用的图的接口

(只整理我在实现louvain算法时所用到的图的接口函数,后期如遇新的函数,必会添加,哈哈哈,我的代码应该很清晰,不需要啰嗦的解释了哈~~)

1)triplets函数

    val edges: RDD[Edge[Double]] = graph.triplets.filter(edgeTriplet => edgeTriplet.srcAttr._cId != edgeTriplet.dstAttr._cId).map {
      case (edgeTriplet: EdgeTriplet[VertexData, Double]) =>
        val srcCId: VertexId = edgeTriplet.srcAttr._cId
        val dstCId: VertexId = edgeTriplet.dstAttr._cId
        val weight: Double = edgeTriplet.attr
        val minVertexId: VertexId = math.min(srcCId, dstCId)
        val maxVertexId: VertexId = math.max(srcCId, dstCId)
        ((minVertexId, maxVertexId), weight)
    }

2)zip函数

     val changeCount: Long = graph.vertices.zip(maxChangeInfo).filter {
        case ((vId1: VertexId, vertex: VertexData), (vId2: VertexId, cId: VertexId, maxModularityChange: Double)) =>
          vertex._cId != cId
      }.count()

3)connectedComponents函数(这个我解释下,这个函数是先判断图中存在的联通图,然后在每个联通图中,以最小的顶点id为该联通图的标识,表示拥有相同标识的顶点是属于一个联通图)

      val newMaxChangeInfo: VertexRDD[VertexId] = Graph.fromEdgeTuples(maxChangeInfo.map {
        case (vId: VertexId, cId: VertexId, maxModularityChange: Double) => (vId, cId)
      }, 0)
        .connectedComponents()
        .vertices


4)joinVertices函数(用于更新顶点的属性值,同outerJoinVertices的区别在于更新的属性值如果不存在,outerJoinVertices函数会用默认值)

    val updateInfoByCId: Graph[VertexData, Double] = graph.joinVertices(maxChangeInfo)((vId: VertexId, vertexData: VertexData, cId: VertexId) => {
      val newVertexData: VertexData = new VertexData(vId, cId)
      newVertexData._degree = vertexData._degree
      newVertexData._innerDegree = vertexData._innerDegree
      newVertexData._innerVertices = vertexData._innerVertices
      newVertexData
    })


5)outerJoinVertices函数

    val louvainG = initG.outerJoinVertices(vertexAttr)((vId: VertexId, oldVertexAttr: None.type, newVertexAttr: Option[Double]) => {
      val vertexData: VertexData = new VertexData(vId, vId)
      val weights: Double = newVertexAttr.getOrElse(0)
      vertexData._degree = weights
      vertexData._innerVertices += vId
      vertexData._commVertices += vId
      vertexData
    })






  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值