GraphX二度关系（代码）

最新推荐文章于 2019-11-01 19:15:00 发布

mbshqqb

最新推荐文章于 2019-11-01 19:15:00 发布

阅读量2.2k

点赞数 2

分类专栏： spark 文章标签： GraphX

本文链接：https://blog.csdn.net/mbshqqb/article/details/78491104

版权

spark 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

关于GraphX的存储模式以及存储数据结构、二度关系的解释等有一个很好的网址：
http://www.dataguru.cn/article-10425-1.html
该网站介绍了二度关系的算法实现，下面我给出具体的SparkGraphX实现代码：

1. 先看一下数据集(一个简单的有向图)：

1,2
1,3
1,4
1,5
2,5
4,3
5,6
6,4

要得到的结果：
我们需要通过计算求出从i节点出发两步之内能到达的节点

2. 具体代码

package com.zj.graphx

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx.{Edge, Graph, VertexRDD}
import org.apache.spark.rdd.RDD

import scala.collection.mutable

object SecondaryDegreeRelationship {
  case class VD(map:mutable.Map[Long,Int]){}
  case class ED(){}
  case class Message(){}
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("SecondaryDegreeRelationship").setMaster("local[3]")
    val sparkContext= SparkContext.getOrCreate(conf)
    val edges:RDD[Edge[ED]]=sparkContext.textFile(SubGraph_pr.getClass.getResource("/tweeter/test.csv").getPath).map(line=>{
      val tokens=line.split(",")
      Edge[ED](tokens(0).toLong,tokens(1).toLong,null)
    })

    val graph:Graph[VD, ED]=Graph.fromEdges[VD,ED](edges,VD(mutable.Map[Long,Int]()))
    //需要将图进行翻转
    val reversalGraph=graph.reverse
    val degreeRelation_1:VertexRDD[mutable.Map[Long, Int]]=reversalGraph.aggregateMessages[mutable.Map[Long,Int]](triplet=>{
      triplet.sendToDst(triplet.srcAttr.map.+((triplet.srcId,1)))
    },_++_)
    val degreeRelation_2:VertexRDD[mutable.Map[Long, Int]]=reversalGraph.outerJoinVertices(degreeRelation_1)((vertexId, oldVD, mapOption) =>mapOption.getOrElse(mutable.Map[Long,Int]())).aggregateMessages[mutable.Map[Long,Int]](triplet=>{
      val message=triplet.srcAttr.map(t=>(t._1,t._2+1)).+((triplet.srcId,1))
      triplet.sendToDst(message)
    },(m1:mutable.Map[Long,Int],m2:mutable.Map[Long,Int])=>{
      (m1/:m2){case(m,(k,v))=>m+(k->Math.min(v,m.getOrElse(k,v)))}
    })
    graph.outerJoinVertices(degreeRelation_2)((vertexId, oldVD, mapOption) =>mapOption.getOrElse(mutable.Map[Long,Int]())).vertices.foreach(println)
  }
}

3. 运行结果

(4,Map(3 -> 1))
(1,Map(2 -> 1, 5 -> 1, 4 -> 1, 3 -> 1, 6 -> 2))
(6,Map(4 -> 1, 3 -> 2))
(3,Map())
(2,Map(5 -> 1, 6 -> 2))
(5,Map(4 -> 2, 6 -> 1)

4. 代码中的重点解析

图的创建方式有多种：
GraphLoader.edgeListFile():Graph可以直接存文件读取srcid,destid形成的边，缺点是：得到的Graph是顶点属性和边属性都是Int类型的，而且不能读取节点属性

def edgeListFile(
      sc: SparkContext,
      path: String,
      canonicalOrientation: Boolean = false,
      numEdgePartitions: Int = -1,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY)
    : Graph[Int, Int] ={}

Graph.fromEdgeTuples()：从源码可以看到该方法得到的所有顶点的属性值都相同，而且边属性都是Int类型的

def fromEdgeTuples[VD: ClassTag](
      rawEdges: RDD[(VertexId, VertexId)],
      defaultValue: VD,
      uniqueEdges: Option[PartitionStrategy] = None,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, Int] = {}

Graph.fromEdges()：在edges: RDD[Edge[ED]]中可以设置边属性，但所有的节点属性都相同

  def fromEdges[VD: ClassTag, ED: ClassTag](
      edges: RDD[Edge[ED]],
      defaultValue: VD,
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] = {}

Graph(): 该方法最灵活，在vertices: RDD[(VertexId, VD)]中可以设置节点的值，在edges: RDD[Edge[ED]]中设置边的值

  def apply[VD: ClassTag, ED: ClassTag](
      vertices: RDD[(VertexId, VD)],
      edges: RDD[Edge[ED]],
      defaultVertexAttr: VD = null.asInstanceOf[VD],
      edgeStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY,
      vertexStorageLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] =

最终随便选择了一种，确保了节点的数据类型为自己定义的VD类型的。
2. 第一个aggregateMessages：
mapMsg函数将邻居节点发送到本节点，通过reduceMsg函数将多个邻居节点的数据进行合并，考虑为什么msg是一个map，而不是一个tupple2
得到一个带有消息的VertexRDD,然后进行outerJoinVertices操作将该步骤得到的消息添加到原节点上，这个消息在后面的步骤中还需要用到，这时便明白了为什么我的VD类型里面是一个Map（当然了，我们最后得到的结果也需要该map保存）
使用了outerJoinVertices函数，而不是joinVertices，两者唯一的区别且是前者的优点在于当第二个集合中没有的数据在第一个集合中可以自己设置类型和值，但joinVertices在join时若第二个集合中没有的值第一个集合保留第一个集合的原始值，因此它的值的类型是不变的，此处我们为了在后面的代码中继续统一使用Map而不是使用VD作为消息而使用了outerJoinVertices，当然，使用joinVertices，并使用VD作为消息类型效果是一样的，在此处我只是为了锻炼自己对outerJoinVertices方法的应用便使用了outerJoinVertices方法
3. 第二个aggregateMessages：
mapMsg:将当前节点的路由信息表+1发送给邻居，并且把当前节点也发送给邻居（这一点很重要）
reduceMsg：将该节点得到的多个路由信息结合成一个表，若多个路由表中包含到同一个点，那么将跳数最短的那个保留

将两个map合并的方法短小精悍[http://www.cnblogs.com/tugeler/p/5134862.html]

mbshqqb

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
3
评论
GraphX二度关系（代码）

关于GraphX的存储模式以及存储数据结构、二度关系的解释等有一个很好的网址：http://www.dataguru.cn/article-10425-1.html 该网站介绍了二度关系的算法实现，下面我给出具体的SparkGraphX实现代码：1. 先看一下数据集(一个简单的有向图)：1,21,31,41,52,54,35,66,4要得到的结果：我们需要通过计算求出从i节点出
复制链接

扫一扫

专栏目录