<Zhuuu_ZZ>基于Spark GraphX的图形数据分析

一 为什么需要图计算

  • 许多大数据以大规模图或网络的形式呈现
  • 许多非图结构的大数据,常会被转为图模型进行分析
  • 图数据结构很好地表达了数据之间的关联性

二 图(Graph)的基本概念

  • 图是由顶点集合(vertex)及顶点间的关系集合(边edge)组成的一种网状数据结构
    • 通常表示为二元组:Gragh=(V,E)
    • 可以对事物之间的关系建模
  • 应用场景
    • 在地图应用中寻找最短路径
    • 社交网络关系
    • 网页间超链接关系

三 图的术语

1、顶点和边

  • 顶点(Vertex)
  • 边(Edge)
Graph=(V,E)
集合V={v1,v2,v3}
集合E={(v1,v2),(v1,v3),(v2,v3)}

在这里插入图片描述

2、有无向图

  • 有向图
G=(V,E)
V={A,B,C,D,E}
E={<A,B>,<B,C>,<B,D>,<C,E>,<D,A>,<E,D>}

在这里插入图片描述

  • 无向图
G=(V,E)
V={A,B,C,D,E}
E={(A,B),(A,D),(B,C),(B,D),(C,E),(D,E)}

在这里插入图片描述

3、有无环图

  • 有环图

    • 包含一系列顶点连接的回路(环路)
      在这里插入图片描述
  • 无环图

    • DAG即为有向无环图
      在这里插入图片描述

4、度(degrees)

  • 度:一个顶点所有边的数量
    • 出度(inDegrees):指从当前顶点指向其他顶点的边的数量
    • 入度(outDegrees):其他顶点指向当前顶点的边的数量
      在这里插入图片描述
  • 查看图信息
class Graph[VD, ED] {
  val numEdges: Long  //边数量
  val numVertices: Long  //顶点数量
  val inDegrees: VertexRDD[Int] //入度
  val outDegrees: VertexRDD[Int] //出度
  val degrees: VertexRDD[Int]  //度
}

四 图的经典表示法-邻接矩阵

在这里插入图片描述

  • 对于每条边,矩阵中相应单元格值为1
  • 对于每个循环,矩阵中相应单元格值为2,方便在行或列上求得顶点度数

五 GraphX API

1、通过两RDD创建Graph

//导入Spark Graph包
scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._

//创建vertices 顶点rdd
scala> val vertices=sc.makeRDD(Seq((1L,1),(2L,2),(3L,3)))
vertices: org.apache.spark.rdd.RDD[(Long, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:27

//创建edges变rdd
scala> val edges=sc.makeRDD(Seq(Edge(1L,2L,1),Edge(2L,3L,2)))
edges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = ParallelCollectionRDD[1] at makeRDD at <console>:27

//创建graph对象
scala> val graph=Graph(vertices,edges)
graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@1caff0db

scala> graph.
aggregateMessages         getCheckpointFiles   ops                    staticPageRank
cache                     groupEdges           outDegrees             staticParallelPersonalizedPageRank
checkpoint                inDegrees            outerJoinVertices      staticPersonalizedPageRank
collectEdges              isCheckpointed       pageRank               stronglyConnectedComponents
collectNeighborIds        joinVertices         partitionBy            subgraph
collectNeighbors          mapEdges             persist                triangleCount
connectedComponents       mapTriplets          personalizedPageRank   triplets
convertToCanonicalEdges   mapVertices          pickRandomVertex       unpersist
degrees                   mask                 pregel                 unpersistVertices
edges                     numEdges             removeSelfEdges        vertices
filter                    numVertices          reverse

//获取graph图对象的vertices信息
scala> graph.vertices.collect
res2: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((1,1), (2,2), (3,3))

//获取graph图对象的edges信息
scala> graph.edges.collect
res3: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(1,2,1), Edge(2,3,2))

//获取graph图对象的triplets信息
scala> graph.triplets.collect
res4: Array[org.apache.spark.graphx.EdgeTriplet[Int,Int]] = Array(((1,1),(2,2),1), ((2,2),(3,3),2))

2、通过文件加载方式创建Graph

[root@hadoopwei kb09file]# vi followers.txt
[root@hadoopwei kb09file]# cat followers.txt
2 3
3 4
1 4
2 4

scala> val graphLoad=GraphLoader.edgeListFile(sc,"file:///kb09file/followers.txt")
graphLoad: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@cf7146

scala> graphLoad.vertices.collect
res6: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((4,1), (2,1), (1,1), (3,1))

scala> graphLoad.edges.collect
res7: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(1,4,1), Edge(2,3,1), Edge(3,4,1), Edge(2,4,1))

scala> graphLoad.triplets.collect
res9: Array[org.apache.spark.graphx.EdgeTriplet[Int,Int]] = Array(((1,1),(4,1),1), ((2,1),(3,1),1), ((3,1),(4,1),1), ((2,1),(4,1),1))

3、构建用户关系属性图

  • 顶点属性
    • 用户名
    • 职业
  • 边属性
    • 合作关系
      在这里插入图片描述
scala> val users=sc.parallelize(Array((3L,("rxin","student")),(7L,("jgonzal","postdoc")),(5L,("franklin","professor")),(2L,("istorica","professor"))))
users: org.apache.spark.rdd.RDD[(Long, (String, String))] = ParallelCollectionRDD[52] at parallelize at <console>:27

scala>  val relationship=sc.parallelize(Array(Edge(3L,7L,"Colla"),Edge(5L,3L,"Advison"),Edge(2L,5L,"Colleague"),Edge(5L,7L,"Pi")))
relationship: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[53] at parallelize at <console>:27

scala> val graphUser=Graph(users,relationship)
graphUser: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@47ebbee1

scala> graphUser.edges.collect
res10: Array[org.apache.spark.graphx.Edge[String]] = Array(Edge(3,7,Colla), Edge(5,3,Advison), Edge(2,5,Colleague), Edge(5,7,Pi))

scala> graphUser.vertices.collect
res11: Array[(org.apache.spark.graphx.VertexId, (String, String))] = Array((2,(istorica,professor)), (3,(rxin,student)), (5,(franklin,professor)), (7,(jgonzal,postdoc)))

scala> graphUser.triplets.collect
res12: Array[org.apache.spark.graphx.EdgeTriplet[(String, String),String]] = Array(((3,(rxin,student)),(7,(jgonzal,postdoc)),Colla), ((5,(franklin,professor)),(3,(rxin,student)),Advison), ((2,(istorica,professor)),(5,(franklin,professor)),Colleague), ((5,(franklin,professor)),(7,(jgonzal,postdoc)),Pi))

4、构建用户社交网络关系

  • 顶点:用户名、年龄
  • 边:打call次数
    在这里插入图片描述
  • 属性
    在这里插入图片描述
  • 代码
val info=sc.makeRDD(Array((1L,("Alice",28)),(2L,("Bob",27)),(3L,("Charlie",65)),(4L,("David",42)),(5L,("Ed",55)),(6L,("Fran",50))))

val rela=sc.makeRDD(Array(Edge(2L,1L,7),Edge(3L,2L,4),Edge(4L,1L,1),Edge(2L,4L,2),Edge(5L,2L,2),Edge(5L,3L,8),Edge(5L,6L,3),Edge(3L,6L,3)))

scala> val userCallGraph=Graph(info,rela)
userCallGraph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@5c2e46c

scala> userCallGraph.vertices.collect
res13: Array[(org.apache.spark.graphx.VertexId, (String, Int))] = Array((1,(Alice,28)), (2,(Bob,27)), (3,(Charlie,65)), (4,(David,42)), (5,(Ed,55)), (6,(Fran,50)))

scala> userCallGraph.edges.collect
res14: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(2,1,7), Edge(3,2,4), Edge(4,1,1), Edge(2,4,2), Edge(5,2,2), Edge(5,3,8), Edge(5,6,3), Edge(3,6,3))

scala> userCallGraph.triplets.collect
res15: Array[org.apache.spark.graphx.EdgeTriplet[(String, Int),Int]] = Array(((2,(Bob,27)),(1,(Alice,28)),7), ((3,(Charlie,65)),(2,(Bob,27)),4), ((4,(David,42)),(1,(Alice,28)),1), ((2,(Bob,27)),(4,(David,42)),2), ((5,(Ed,55)),(2,(Bob,27)),2), ((5,(Ed,55)),(3,(Charlie,65)),8), ((5,(Ed,55)),(6,(Fran,50)),3), ((3,(Charlie,65)),(6,(Fran,50)),3))

scala> userCallGraph.vertices.filter(v=>v._2._2>30).collect
res16: Array[(org.apache.spark.graphx.VertexId, (String, Int))] = Array((3,(Charlie,65)), (4,(David,42)), (5,(Ed,55)), (6,(Fran,50)))

scala> userCallGraph.vertices.filter{ case(id,(name,age))=>age>30}
res22: org.apache.spark.graphx.VertexRDD[(String, Int)] = VertexRDDImpl[109] at RDD at VertexRDD.scala:57

scala> userCallGraph.vertices.filter{ case(id,(name,age))=>age>30}.collect.foreach(println)
(3,(Charlie,65))
(4,(David,42))
(5,(Ed,55))
(6,(Fran,50))

scala> userCallGraph.vertices.filter(v=>v._2._2>30).collect.foreach(x=>{println(x._2._2)})
65
42
55
50

scala> userCallGraph.triplets
res23: org.apache.spark.rdd.RDD[org.apache.spark.graphx.EdgeTriplet[(String, Int),Int]] = MapPartitionsRDD[95] at mapPartitions at GraphImpl.scala:48  //不是元组,是对象

scala> userCallGraph.triplets.foreach(println)
((5,(Ed,55)),(2,(Bob,27)),2)
((2,(Bob,27)),(4,(David,42)),2)
((3,(Charlie,65)),(6,(Fran,50)),3)
((5,(Ed,55)),(3,(Charlie,65)),8)
((3,(Charlie,65)),(2,(Bob,27)),4)
((5,(Ed,55)),(3,(Charlie,65)),6)
((4,(David,42)),(1,(Alice,28)),1)
((2,(Bob,27)),(1,(Alice,28)),7)

scala> userCallGraph.triplets.collect.foreach(x=>println(x.dstAttr))
(Alice,28)
(Bob,27)
(Alice,28)
(David,42)
(Bob,27)
(Charlie,65)
(Charlie,65)
(Fran,50)

scala> userCallGraph.triplets.collect.foreach(x=>println(x.srcAttr))
(Bob,27)
(Charlie,65)
(David,42)
(Bob,27)
(Ed,55)
(Ed,55)
(Ed,55)
(Charlie,65)

scala> userCallGraph.triplets.collect.foreach(x=>println(x.attr))
7
4
1
2
2
8
6
3

scala> userCallGraph.triplets.collect.foreach(x=>println(x.srcAttr._1+" like " + x.dstAttr._1 + " stage:"+x.attr))
Bob like Alice stage:7
Charlie like Bob stage:4
David like Alice stage:1
Bob like David stage:2
Ed like Bob stage:2
Ed like Charlie stage:8
Ed like Charlie stage:6
Charlie like Fran stage:3

scala> userCallGraph.triplets.filter(x=>x.attr>5).collect.foreach(x=>println(x.srcAttr._1+" like " + x.dstAttr._1 + " stage:"+x.attr))
Bob like Alice stage:7
Ed like Charlie stage:8
Ed like Charlie stage:6

六 图的算子

1、属性算子mapVertices&mapEdges

  • mapVertices
    • def mapVertices[VD2] => VD2)(implicit evidence$3: scala.reflect.ClassTag[VD2],implicit eq: =:=[(String, Int),VD2]): org.apache.spark.graphx.Graph[VD2,Int]

    • 修改顶点的属性

import org.apache.spark.graphx._

val info=sc.makeRDD(Array((1L,("Alice",28)),(2L,("Bob",27)),(3L,("Charlie",65)),(4L,("David",42)),(5L,("Ed",55)),(6L,("Fran",50))))

val rela=sc.makeRDD(Array(Edge(2L,1L,7),Edge(3L,2L,4),Edge(4L,1L,1),Edge(2L,4L,2),Edge(5L,2L,2),Edge(5L,3L,8),Edge(5L,6L,3),Edge(3L,6L,3)))

scala> val userCallGraph=Graph(info,rela)
userCallGraph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@21dc1d99

//mapVertices修改的是顶点的属性
scala>val t1_graph= userCallGraph.mapVertices{case (id,(name,age))=>(id,name)}
//上下等价
scala>val t2_graph=userCallGraph.mapVertices((id,attr)=>(id,attr._1))

scala> t1_graph.vertices.collect.foreach(println)
(1,(1,Alice))
(2,(2,Bob))
(3,(3,Charlie))
(4,(4,David))
(5,(5,Ed))
(6,(6,Fran))

scala> t2_graph.vertices.collect.foreach(println)
(1,(1,Alice))
(2,(2,Bob))
(3,(3,Charlie))
(4,(4,David))
(5,(5,Ed))
(6,(6,Fran))
  • mapEdges
    • def mapEdges[ED2](map: org.apache.spark.graphx.Edge[Int] => ED2)(implicit evidence$4: scala.reflect.ClassTag[ED2]): org.apache.spark.graphx.Graph[(String, Int),ED2]

    • 修改的是边的属性

scala> val t3_graph=userCallGraph.mapEdges(e=>Edge(e.srcId,e.dstId,e.attr*7.0))
t3_graph: org.apache.spark.graphx.Graph[(String, Int),org.apache.spark.graphx.Edge[Double]] = org.apache.spark.graphx.impl.GraphImpl@3b848cf3

scala> t3_graph.edges.collect.foreach(println)
Edge(2,1,Edge(2,1,49.0))
Edge(3,2,Edge(3,2,28.0))
Edge(4,1,Edge(4,1,7.0))
Edge(2,4,Edge(2,4,14.0))
Edge(5,2,Edge(5,2,14.0))
Edge(5,3,Edge(5,3,56.0))
Edge(5,6,Edge(5,6,21.0))
Edge(3,6,Edge(3,6,21.0))

scala> val t4_graph=userCallGraph.mapEdges(e=>e.attr*7.0)
t3_graph: org.apache.spark.graphx.Graph[(String, Int),Double] = org.apache.spark.graphx.impl.GraphImpl@174600ea

scala> t4_graph.edges.collect.foreach(println)
Edge(2,1,49.0)
Edge(3,2,28.0)
Edge(4,1,7.0)
Edge(2,4,14.0)
Edge(5,2,14.0)
Edge(5,3,56.0)
Edge(5,6,21.0)
Edge(3,6,21.0)

2、结构算子reverse&subgraph

  • reverse
    • def reverse: org.apache.spark.graphx.Graph[(String, Int),Int]
    • 调转图中箭头的指向,即把A->B变为B->A
scala> userCallGraph.triplets.collect.foreach(println)
((2,(Bob,27)),(1,(Alice,28)),7)
((3,(Charlie,65)),(2,(Bob,27)),4)
((4,(David,42)),(1,(Alice,28)),1)
((2,(Bob,27)),(4,(David,42)),2)
((5,(Ed,55)),(2,(Bob,27)),2)
((5,(Ed,55)),(3,(Charlie,65)),8)
((5,(Ed,55)),(6,(Fran,50)),3)
((3,(Charlie,65)),(6,(Fran,50)),3)

scala> val reverse_graph=userCallGraph.reverse
reverse_graph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@21b64cf

scala> reverse_graph.triplets.collect.foreach(println)
((1,(Alice,28)),(2,(Bob,27)),7)
((2,(Bob,27)),(3,(Charlie,65)),4)
((1,(Alice,28)),(4,(David,42)),1)
((4,(David,42)),(2,(Bob,27)),2)
((2,(Bob,27)),(5,(Ed,55)),2)
((3,(Charlie,65)),(5,(Ed,55)),8)
((6,(Fran,50)),(5,(Ed,55)),3)
((6,(Fran,50)),(3,(Charlie,65)),3)
  • subgraph
    • def subgraph(epred: org.apache.spark.graphx.EdgeTriplet[(String, Int),Int] => Boolean,vpred: (org.apache.spark.graphx.VertexId, (String, Int)) => Boolean): org.apache.spark.graphx.Graph[(String, Int),Int]
    • 符合条件的留下
scala> userCallGraph.triplets.collect.foreach(println)
((2,(Bob,27)),(1,(Alice,28)),7)
((3,(Charlie,65)),(2,(Bob,27)),4)
((4,(David,42)),(1,(Alice,28)),1)
((2,(Bob,27)),(4,(David,42)),2)
((5,(Ed,55)),(2,(Bob,27)),2)
((5,(Ed,55)),(3,(Charlie,65)),8)
((5,(Ed,55)),(6,(Fran,50)),3)
((3,(Charlie,65)),(6,(Fran,50)),3)

//epred可以选择是triplets的srcAttr或者dstAttr等部分符合条件的留下来
scala> userCallGraph.subgraph(epred=(ep)=>{println("epred",ep.srcAttr);ep.srcAttr._2<30}).triplets.collect.foreach(println)
(epred,(Charlie,65))
(epred,(Ed,55))
(epred,(Bob,27))
(epred,(Ed,55))
(epred,(Charlie,65))
(epred,(Bob,27))
(epred,(Ed,55))
(epred,(David,42))
((2,(Bob,27)),(1,(Alice,28)),7)
((2,(Bob,27)),(4,(David,42)),2)

---vpred需要triplets的出顶点和入顶点都符合条件才保留下来
scala> val sub_graph=userCallGraph.subgraph(vpred=(id,attr)=>{println("subgraph in",id,attr);attr._2<30})
sub_graph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@1018ba99

scala> sub_graph.triplets.collect.foreach(println)
(subgraph in,5,(Ed,55))
(subgraph in,2,(Bob,27))
(subgraph in,4,(David,42))
(subgraph in,5,(Ed,55))
(subgraph in,5,(Ed,55))
(subgraph in,4,(David,42))
(subgraph in,3,(Charlie,65))
(subgraph in,2,(Bob,27))
(subgraph in,1,(Alice,28))
(subgraph in,3,(Charlie,65))
((2,(Bob,27)),(1,(Alice,28)),7)

3、joinVertices

  • 从外部的RDDs加载数据,修改顶点属性
  • def joinVertices[U](table: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, U)])(mapFunc: (org.apache.spark.graphx.VertexId , (String, Int), U) => (String, Int))(implicit evidence$3: scala.reflect.ClassTag[U]): org.apache.spark.graphx.Graph[(String, Int),Int ]
scala> val two=sc.parallelize(Array((1L,"kgc.cn"),(2L,"qq.com"),(3L,"163.com"),(7L,"sohu.com")))
two: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[85] at parallelize at <console>:27

scala>val joinGraph= userCallGraph.joinVertices(two)((id,v,cmpy)=>(v._1+"@"+cmpy,v._2))
res25: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@41abf20f
//等价于
scala>val joinGraph1=userCallGraph.joinVertices(two){case (id,(name,age),cmpy)=>(name+"@"+cmpy,age)}

scala> joinGraph.triplets.collect.foreach(println)
((2,(Bob@qq.com,27)),(1,(Alice@kgc.cn,28)),7)
((3,(Charlie@163.com,65)),(2,(Bob@qq.com,27)),4)
((4,(David,42)),(1,(Alice@kgc.cn,28)),1)
((2,(Bob@qq.com,27)),(4,(David,42)),2)
((5,(Ed,55)),(2,(Bob@qq.com,27)),2)
((5,(Ed,55)),(3,(Charlie@163.com,65)),8)
((5,(Ed,55)),(6,(Fran,50)),3)
((3,(Charlie@163.com,65)),(6,(Fran,50)),3)

scala>joinGraph.vertices.collect.foreach(println)
(1,(Alice@kgc.cn,28))
(2,(Bob@qq.com,27))
(3,(Charlie@163.com,65))
(4,(David,42))
(5,(Ed,55))
(6,(Fran,50))

4、outerJoinVertices

  • 同样是对顶点的属性进行修改
  • def outerJoinVertices[U, VD2](other: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, U)])(mapFunc: (org.apache.spark.graphx.VertexId, (String, Int), Option[U]) => VD2)(implicit evidence$13: scala.reflect.ClassTag[U],implicit evidence$14: scala.reflect.ClassTag[VD2],implicit eq: =:=[(String, Int),VD2]): org.apache.spark.graphx.Graph[VD2,Int]
//outerJoinVertices
scala> case class User(name:String,age:Int,inDeg:Int,outDeg:Int)
defined class User

//第一个u指的是userCallGraph里顶点的属性,装入User,那么第二个u就成为了一个对象
scala> val outJoin=userCallGraph.outerJoinVertices(userCallGraph.inDegrees){case(id,u,indeg)=>User(u._1,u._2,indeg.getOrElse(0),0)}.outerJoinVertices(userCallGraph.outDegrees){case(id,u,outdeg)=>User(u.name,u.age,u.inDeg,outdeg.getOrElse(0))}
res31: org.apache.spark.graphx.Graph[User,Int] = org.apache.spark.graphx.impl.GraphImpl@1a1acf9e

scala> outJoin.vertices.collect.foreach(println)
(1,User(Alice,28,2,0))
(2,User(Bob,27,2,2))
(3,User(Charlie,65,1,2))
(4,User(David,42,1,1))
(5,User(Ed,55,0,3))
(6,User(Fran,50,2,0))

//顶点的入度即为粉丝数量
scala> for((id,prop)<-outJoin.vertices.collect){println(s"User $id is ${prop.name} and is liked by ${prop.inDeg} people")}
User 1 is Alice and is liked by 2 people
User 2 is Bob and is liked by 2 people
User 3 is Charlie and is liked by 1 people
User 4 is David and is liked by 1 people
User 5 is Ed and is liked by 0 people
User 6 is Fran and is liked by 2 people

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值