＜Zhuuu_ZZ＞基于Spark GraphX的图形数据分析

最新推荐文章于 2023-04-26 10:11:09 发布

Zhuuu_ZZ

最新推荐文章于 2023-04-26 10:11:09 发布

阅读量566

点赞数 1

分类专栏： Spark 文章标签： spark graphx 图形数据分析 graph

本文链接：https://blog.csdn.net/Zhuuu_ZZ/article/details/110111154

版权

Spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Spark GraphX

一为什么需要图计算
二图(Graph)的基本概念
三图的术语
四图的经典表示法-邻接矩阵
五 GraphX API
六图的算子

一为什么需要图计算

许多大数据以大规模图或网络的形式呈现
许多非图结构的大数据，常会被转为图模型进行分析
图数据结构很好地表达了数据之间的关联性

二图(Graph)的基本概念

图是由顶点集合(vertex)及顶点间的关系集合（边edge）组成的一种网状数据结构
- 通常表示为二元组：Gragh=（V，E）
- 可以对事物之间的关系建模
应用场景
- 在地图应用中寻找最短路径
- 社交网络关系
- 网页间超链接关系

三图的术语

1、顶点和边

顶点（Vertex）
边（Edge）

Graph=(V,E)
集合V={v1,v2,v3}
集合E={(v1,v2),(v1,v3),(v2,v3)}

在这里插入图片描述

2、有无向图

有向图

G=(V,E)
V={A,B,C,D,E}
E={<A,B>,<B,C>,<B,D>,<C,E>,<D,A>,<E,D>}

在这里插入图片描述

无向图

G=(V,E)
V={A,B,C,D,E}
E={(A,B),(A,D),(B,C),(B,D),(C,E),(D,E)}

在这里插入图片描述

3、有无环图

有环图
- 包含一系列顶点连接的回路（环路）
无环图
- DAG即为有向无环图

4、度(degrees)

度：一个顶点所有边的数量
- 出度(inDegrees)：指从当前顶点指向其他顶点的边的数量
- 入度(outDegrees)：其他顶点指向当前顶点的边的数量
查看图信息

class Graph[VD, ED] {
  val numEdges: Long  //边数量
  val numVertices: Long  //顶点数量
  val inDegrees: VertexRDD[Int] //入度
  val outDegrees: VertexRDD[Int] //出度
  val degrees: VertexRDD[Int]  //度
}

四图的经典表示法-邻接矩阵

在这里插入图片描述

对于每条边，矩阵中相应单元格值为1
对于每个循环，矩阵中相应单元格值为2，方便在行或列上求得顶点度数

五 GraphX API

1、通过两RDD创建Graph

//导入Spark Graph包
scala> import org.apache.spark.graphx._
import org.apache.spark.graphx._

//创建vertices 顶点rdd
scala> val vertices=sc.makeRDD(Seq((1L,1),(2L,2),(3L,3)))
vertices: org.apache.spark.rdd.RDD[(Long, Int)] = ParallelCollectionRDD[0] at makeRDD at <console>:27

//创建edges变rdd
scala> val edges=sc.makeRDD(Seq(Edge(1L,2L,1),Edge(2L,3L,2)))
edges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Int]] = ParallelCollectionRDD[1] at makeRDD at <console>:27

//创建graph对象
scala> val graph=Graph(vertices,edges)
graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@1caff0db

scala> graph.
aggregateMessages         getCheckpointFiles   ops                    staticPageRank
cache                     groupEdges           outDegrees             staticParallelPersonalizedPageRank
checkpoint                inDegrees            outerJoinVertices      staticPersonalizedPageRank
collectEdges              isCheckpointed       pageRank               stronglyConnectedComponents
collectNeighborIds        joinVertices         partitionBy            subgraph
collectNeighbors          mapEdges             persist                triangleCount
connectedComponents       mapTriplets          personalizedPageRank   triplets
convertToCanonicalEdges   mapVertices          pickRandomVertex       unpersist
degrees                   mask                 pregel                 unpersistVertices
edges                     numEdges             removeSelfEdges        vertices
filter                    numVertices          reverse

//获取graph图对象的vertices信息
scala> graph.vertices.collect
res2: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((1,1), (2,2), (3,3))

//获取graph图对象的edges信息
scala> graph.edges.collect
res3: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(1,2,1), Edge(2,3,2))

//获取graph图对象的triplets信息
scala> graph.triplets.collect
res4: Array[org.apache.spark.graphx.EdgeTriplet[Int,Int]] = Array(((1,1),(2,2),1), ((2,2),(3,3),2))

2、通过文件加载方式创建Graph

[root@hadoopwei kb09file]# vi followers.txt
[root@hadoopwei kb09file]# cat followers.txt
2 3
3 4
1 4
2 4

scala> val graphLoad=GraphLoader.edgeListFile(sc,"file:///kb09file/followers.txt")
graphLoad: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@cf7146

scala> graphLoad.vertices.collect
res6: Array[(org.apache.spark.graphx.VertexId, Int)] = Array((4,1), (2,1), (1,1), (3,1))

scala> graphLoad.edges.collect
res7: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(1,4,1), Edge(2,3,1), Edge(3,4,1), Edge(2,4,1))

scala> graphLoad.triplets.collect
res9: Array[org.apache.spark.graphx.EdgeTriplet[Int,Int]] = Array(((1,1),(4,1),1), ((2,1),(3,1),1), ((3,1),(4,1),1), ((2,1),(4,1),1))

3、构建用户关系属性图

顶点属性
- 用户名
- 职业
边属性
- 合作关系

scala> val users=sc.parallelize(Array((3L,("rxin","student")),(7L,("jgonzal","postdoc")),(5L,("franklin","professor")),(2L,("istorica","professor"))))
users: org.apache.spark.rdd.RDD[(Long, (String, String))] = ParallelCollectionRDD[52] at parallelize at <console>:27

scala>  val relationship=sc.parallelize(Array(Edge(3L,7L,"Colla"),Edge(5L,3L,"Advison"),Edge(2L,5L,"Colleague"),Edge(5L,7L,"Pi")))
relationship: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[53] at parallelize at <console>:27

scala> val graphUser=Graph(users,relationship)
graphUser: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@47ebbee1

scala> graphUser.edges.collect
res10: Array[org.apache.spark.graphx.Edge[String]] = Array(Edge(3,7,Colla), Edge(5,3,Advison), Edge(2,5,Colleague), Edge(5,7,Pi))

scala> graphUser.vertices.collect
res11: Array[(org.apache.spark.graphx.VertexId, (String, String))] = Array((2,(istorica,professor)), (3,(rxin,student)), (5,(franklin,professor)), (7,(jgonzal,postdoc)))

scala> graphUser.triplets.collect
res12: Array[org.apache.spark.graphx.EdgeTriplet[(String, String),String]] = Array(((3,(rxin,student)),(7,(jgonzal,postdoc)),Colla), ((5,(franklin,professor)),(3,(rxin,student)),Advison), ((2,(istorica,professor)),(5,(franklin,professor)),Colleague), ((5,(franklin,professor)),(7,(jgonzal,postdoc)),Pi))

4、构建用户社交网络关系

顶点：用户名、年龄
边：打call次数
属性
代码

val info=sc.makeRDD(Array((1L,("Alice",28)),(2L,("Bob",27)),(3L,("Charlie",65)),(4L,("David",42)),(5L,("Ed",55)),(6L,("Fran",50))))

val rela=sc.makeRDD(Array(Edge(2L,1L,7),Edge(3L,2L,4),Edge(4L,1L,1),Edge(2L,4L,2),Edge(5L,2L,2),Edge(5L,3L,8),Edge(5L,6L,3),Edge(3L,6L,3)))

scala> val userCallGraph=Graph(info,rela)
userCallGraph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@5c2e46c

scala> userCallGraph.vertices.collect
res13: Array[(org.apache.spark.graphx.VertexId, (String, Int))] = Array((1,(Alice,28)), (2,(Bob,27)), (3,(Charlie,65)), (4,(David,42)), (5,(Ed,55)), (6,(Fran,50)))

scala> userCallGraph.edges.collect
res14: Array[org.apache.spark.graphx.Edge[Int]] = Array(Edge(2,1,7), Edge(3,2,4), Edge(4,1,1), Edge(2,4,2), Edge(5,2,2), Edge(5,3,8), Edge(5,6,3), Edge(3,6,3))

scala> userCallGraph.triplets.collect
res15: Array[org.apache.spark.graphx.EdgeTriplet[(String, Int),Int]] = Array(((2,(Bob,27)),(1,(Alice,28)),7), ((3,(Charlie,65)),(2,(Bob,27)),4), ((4,(David,42)),(1,(Alice,28)),1), ((2,(Bob,27)),(4,(David,42)),2), ((5,(Ed,55)),(2,(Bob,27)),2), ((5,(Ed,55)),(3,(Charlie,65)),8), ((5,(Ed,55)),(6,(Fran,50)),3), ((3,(Charlie,65)),(6,(Fran,50)),3))

scala> userCallGraph.vertices.filter(v=>v._2._2>30).collect
res16: Array[(org.apache.spark.graphx.VertexId, (String, Int))] = Array((3,(Charlie,65)), (4,(David,42)), (5,(Ed,55)), (6,(Fran,50)))

scala> userCallGraph.vertices.filter{ case(id,(name,age))=>age>30}
res22: org.apache.spark.graphx.VertexRDD[(String, Int)] = VertexRDDImpl[109] at RDD at VertexRDD.scala:57

scala> userCallGraph.vertices.filter{ case(id,(name,age))=>age>30}.collect.foreach(println)
(3,(Charlie,65))
(4,(David,42))
(5,(Ed,55))
(6,(Fran,50))

scala> userCallGraph.vertices.filter(v=>v._2._2>30).collect.foreach(x=>{println(x._2._2)})
65
42
55
50

scala> userCallGraph.triplets
res23: org.apache.spark.rdd.RDD[org.apache.spark.graphx.EdgeTriplet[(String, Int),Int]] = MapPartitionsRDD[95] at mapPartitions at GraphImpl.scala:48  //不是元组，是对象

scala> userCallGraph.triplets.foreach(println)
((5,(Ed,55)),(2,(Bob,27)),2)
((2,(Bob,27)),(4,(David,42)),2)
((3,(Charlie,65)),(6,(Fran,50)),3)
((5,(Ed,55)),(3,(Charlie,65)),8)
((3,(Charlie,65)),(2,(Bob,27)),4)
((5,(Ed,55)),(3,(Charlie,65)),6)
((4,(David,42)),(1,(Alice,28)),1)
((2,(Bob,27)),(1,(Alice,28)),7)

scala> userCallGraph.triplets.collect.foreach(x=>println(x.dstAttr))
(Alice,28)
(Bob,27)
(Alice,28)
(David,42)
(Bob,27)
(Charlie,65)
(Charlie,65)
(Fran,50)

scala> userCallGraph.triplets.collect.foreach(x=>println(x.srcAttr))
(Bob,27)
(Charlie,65)
(David,42)
(Bob,27)
(Ed,55)
(Ed,55)
(Ed,55)
(Charlie,65)

scala> userCallGraph.triplets.collect.foreach(x=>println(x.attr))
7
4
1
2
2
8
6
3

scala> userCallGraph.triplets.collect.foreach(x=>println(x.srcAttr._1+" like " + x.dstAttr._1 + " stage:"+x.attr))
Bob like Alice stage:7
Charlie like Bob stage:4
David like Alice stage:1
Bob like David stage:2
Ed like Bob stage:2
Ed like Charlie stage:8
Ed like Charlie stage:6
Charlie like Fran stage:3

scala> userCallGraph.triplets.filter(x=>x.attr>5).collect.foreach(x=>println(x.srcAttr._1+" like " + x.dstAttr._1 + " stage:"+x.attr))
Bob like Alice stage:7
Ed like Charlie stage:8
Ed like Charlie stage:6

六图的算子

1、属性算子mapVertices&mapEdges

mapVertices
- def mapVertices[VD2] => VD2)(implicit evidence$3: scala.reflect.ClassTag[VD2],implicit eq: =:=[(String, Int),VD2]): org.apache.spark.graphx.Graph[VD2,Int]
- 修改顶点的属性

import org.apache.spark.graphx._

val info=sc.makeRDD(Array((1L,("Alice",28)),(2L,("Bob",27)),(3L,("Charlie",65)),(4L,("David",42)),(5L,("Ed",55)),(6L,("Fran",50))))

val rela=sc.makeRDD(Array(Edge(2L,1L,7),Edge(3L,2L,4),Edge(4L,1L,1),Edge(2L,4L,2),Edge(5L,2L,2),Edge(5L,3L,8),Edge(5L,6L,3),Edge(3L,6L,3)))

scala> val userCallGraph=Graph(info,rela)
userCallGraph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@21dc1d99

//mapVertices修改的是顶点的属性
scala>val t1_graph= userCallGraph.mapVertices{case (id,(name,age))=>(id,name)}
//上下等价
scala>val t2_graph=userCallGraph.mapVertices((id,attr)=>(id,attr._1))

scala> t1_graph.vertices.collect.foreach(println)
(1,(1,Alice))
(2,(2,Bob))
(3,(3,Charlie))
(4,(4,David))
(5,(5,Ed))
(6,(6,Fran))

scala> t2_graph.vertices.collect.foreach(println)
(1,(1,Alice))
(2,(2,Bob))
(3,(3,Charlie))
(4,(4,David))
(5,(5,Ed))
(6,(6,Fran))

mapEdges
- def mapEdges[ED2](map: org.apache.spark.graphx.Edge[Int] => ED2)(implicit evidence$4: scala.reflect.ClassTag[ED2]): org.apache.spark.graphx.Graph[(String, Int),ED2]
- 修改的是边的属性

scala> val t3_graph=userCallGraph.mapEdges(e=>Edge(e.srcId,e.dstId,e.attr*7.0))
t3_graph: org.apache.spark.graphx.Graph[(String, Int),org.apache.spark.graphx.Edge[Double]] = org.apache.spark.graphx.impl.GraphImpl@3b848cf3

scala> t3_graph.edges.collect.foreach(println)
Edge(2,1,Edge(2,1,49.0))
Edge(3,2,Edge(3,2,28.0))
Edge(4,1,Edge(4,1,7.0))
Edge(2,4,Edge(2,4,14.0))
Edge(5,2,Edge(5,2,14.0))
Edge(5,3,Edge(5,3,56.0))
Edge(5,6,Edge(5,6,21.0))
Edge(3,6,Edge(3,6,21.0))

scala> val t4_graph=userCallGraph.mapEdges(e=>e.attr*7.0)
t3_graph: org.apache.spark.graphx.Graph[(String, Int),Double] = org.apache.spark.graphx.impl.GraphImpl@174600ea

scala> t4_graph.edges.collect.foreach(println)
Edge(2,1,49.0)
Edge(3,2,28.0)
Edge(4,1,7.0)
Edge(2,4,14.0)
Edge(5,2,14.0)
Edge(5,3,56.0)
Edge(5,6,21.0)
Edge(3,6,21.0)

2、结构算子reverse&subgraph

reverse
- def reverse: org.apache.spark.graphx.Graph[(String, Int),Int]
- 调转图中箭头的指向，即把A->B变为B->A

scala> userCallGraph.triplets.collect.foreach(println)
((2,(Bob,27)),(1,(Alice,28)),7)
((3,(Charlie,65)),(2,(Bob,27)),4)
((4,(David,42)),(1,(Alice,28)),1)
((2,(Bob,27)),(4,(David,42)),2)
((5,(Ed,55)),(2,(Bob,27)),2)
((5,(Ed,55)),(3,(Charlie,65)),8)
((5,(Ed,55)),(6,(Fran,50)),3)
((3,(Charlie,65)),(6,(Fran,50)),3)

scala> val reverse_graph=userCallGraph.reverse
reverse_graph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@21b64cf

scala> reverse_graph.triplets.collect.foreach(println)
((1,(Alice,28)),(2,(Bob,27)),7)
((2,(Bob,27)),(3,(Charlie,65)),4)
((1,(Alice,28)),(4,(David,42)),1)
((4,(David,42)),(2,(Bob,27)),2)
((2,(Bob,27)),(5,(Ed,55)),2)
((3,(Charlie,65)),(5,(Ed,55)),8)
((6,(Fran,50)),(5,(Ed,55)),3)
((6,(Fran,50)),(3,(Charlie,65)),3)

subgraph
- def subgraph(epred: org.apache.spark.graphx.EdgeTriplet[(String, Int),Int] => Boolean,vpred: (org.apache.spark.graphx.VertexId, (String, Int)) => Boolean): org.apache.spark.graphx.Graph[(String, Int),Int]
- 符合条件的留下

scala> userCallGraph.triplets.collect.foreach(println)
((2,(Bob,27)),(1,(Alice,28)),7)
((3,(Charlie,65)),(2,(Bob,27)),4)
((4,(David,42)),(1,(Alice,28)),1)
((2,(Bob,27)),(4,(David,42)),2)
((5,(Ed,55)),(2,(Bob,27)),2)
((5,(Ed,55)),(3,(Charlie,65)),8)
((5,(Ed,55)),(6,(Fran,50)),3)
((3,(Charlie,65)),(6,(Fran,50)),3)

//epred可以选择是triplets的srcAttr或者dstAttr等部分符合条件的留下来
scala> userCallGraph.subgraph(epred=(ep)=>{println("epred",ep.srcAttr);ep.srcAttr._2<30}).triplets.collect.foreach(println)
(epred,(Charlie,65))
(epred,(Ed,55))
(epred,(Bob,27))
(epred,(Ed,55))
(epred,(Charlie,65))
(epred,(Bob,27))
(epred,(Ed,55))
(epred,(David,42))
((2,(Bob,27)),(1,(Alice,28)),7)
((2,(Bob,27)),(4,(David,42)),2)

---vpred需要triplets的出顶点和入顶点都符合条件才保留下来
scala> val sub_graph=userCallGraph.subgraph(vpred=(id,attr)=>{println("subgraph in",id,attr);attr._2<30})
sub_graph: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@1018ba99

scala> sub_graph.triplets.collect.foreach(println)
(subgraph in,5,(Ed,55))
(subgraph in,2,(Bob,27))
(subgraph in,4,(David,42))
(subgraph in,5,(Ed,55))
(subgraph in,5,(Ed,55))
(subgraph in,4,(David,42))
(subgraph in,3,(Charlie,65))
(subgraph in,2,(Bob,27))
(subgraph in,1,(Alice,28))
(subgraph in,3,(Charlie,65))
((2,(Bob,27)),(1,(Alice,28)),7)

3、joinVertices

从外部的RDDs加载数据，修改顶点属性
def joinVertices[U](table: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, U)])(mapFunc: (org.apache.spark.graphx.VertexId , (String, Int), U) => (String, Int))(implicit evidence$3: scala.reflect.ClassTag[U]): org.apache.spark.graphx.Graph[(String, Int),Int ]

scala> val two=sc.parallelize(Array((1L,"kgc.cn"),(2L,"qq.com"),(3L,"163.com"),(7L,"sohu.com")))
two: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[85] at parallelize at <console>:27

scala>val joinGraph= userCallGraph.joinVertices(two)((id,v,cmpy)=>(v._1+"@"+cmpy,v._2))
res25: org.apache.spark.graphx.Graph[(String, Int),Int] = org.apache.spark.graphx.impl.GraphImpl@41abf20f
//等价于
scala>val joinGraph1=userCallGraph.joinVertices(two){case (id,(name,age),cmpy)=>(name+"@"+cmpy,age)}

scala> joinGraph.triplets.collect.foreach(println)
((2,(Bob@qq.com,27)),(1,(Alice@kgc.cn,28)),7)
((3,(Charlie@163.com,65)),(2,(Bob@qq.com,27)),4)
((4,(David,42)),(1,(Alice@kgc.cn,28)),1)
((2,(Bob@qq.com,27)),(4,(David,42)),2)
((5,(Ed,55)),(2,(Bob@qq.com,27)),2)
((5,(Ed,55)),(3,(Charlie@163.com,65)),8)
((5,(Ed,55)),(6,(Fran,50)),3)
((3,(Charlie@163.com,65)),(6,(Fran,50)),3)

scala>joinGraph.vertices.collect.foreach(println)
(1,(Alice@kgc.cn,28))
(2,(Bob@qq.com,27))
(3,(Charlie@163.com,65))
(4,(David,42))
(5,(Ed,55))
(6,(Fran,50))

4、outerJoinVertices

同样是对顶点的属性进行修改
def outerJoinVertices[U, VD2](other: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, U)])(mapFunc: (org.apache.spark.graphx.VertexId, (String, Int), Option[U]) => VD2)(implicit evidence$13: scala.reflect.ClassTag[U],implicit evidence$14: scala.reflect.ClassTag[VD2],implicit eq: =:=[(String, Int),VD2]): org.apache.spark.graphx.Graph[VD2,Int]

//outerJoinVertices
scala> case class User(name:String,age:Int,inDeg:Int,outDeg:Int)
defined class User

//第一个u指的是userCallGraph里顶点的属性，装入User，那么第二个u就成为了一个对象
scala> val outJoin=userCallGraph.outerJoinVertices(userCallGraph.inDegrees){case(id,u,indeg)=>User(u._1,u._2,indeg.getOrElse(0),0)}.outerJoinVertices(userCallGraph.outDegrees){case(id,u,outdeg)=>User(u.name,u.age,u.inDeg,outdeg.getOrElse(0))}
res31: org.apache.spark.graphx.Graph[User,Int] = org.apache.spark.graphx.impl.GraphImpl@1a1acf9e

scala> outJoin.vertices.collect.foreach(println)
(1,User(Alice,28,2,0))
(2,User(Bob,27,2,2))
(3,User(Charlie,65,1,2))
(4,User(David,42,1,1))
(5,User(Ed,55,0,3))
(6,User(Fran,50,2,0))

//顶点的入度即为粉丝数量
scala> for((id,prop)<-outJoin.vertices.collect){println(s"User $id is ${prop.name} and is liked by ${prop.inDeg} people")}
User 1 is Alice and is liked by 2 people
User 2 is Bob and is liked by 2 people
User 3 is Charlie and is liked by 1 people
User 4 is David and is liked by 1 people
User 5 is Ed and is liked by 0 people
User 6 is Fran and is liked by 2 people

Zhuuu_ZZ

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
＜Zhuuu_ZZ＞基于Spark GraphX的图形数据分析

Spark GraphX一为什么需要图计算二图(Graph)的基本概念三图的术语1、顶点和边2、有无向图3、有无环图4、度(degrees)四图的经典表示法-邻接矩阵五 GraphX API1、通过两RDD创建Graph2、通过文件加载方式创建Graph3、构建用户关系属性图4、构建用户社交网络关系六图的算子1、属性算子mapVertices&mapEdges&mapTriplets一为什么需要图计算许多大数据以大规模图或网络的形式呈现许多非图结构的大数据，常会被转为图模
复制链接

扫一扫