Spark GraphX是一个分布式的图处理框架。社交网络中,用户与用户之间会存在错综复杂的联系,如微信、QQ、微博的用户之间的好友、关注等关系,构成了一张巨大的图,单机无法处理,只能使用分布式图处理框架处理,Spark GraphX就是一种分布式图处理框架。
1. POM文件
在项目的pom文件中加上Spark GraphX的包:
org.apache.spark
spark-graphx_2.10
1.6.0
2. 设置运行环境
//设置运行环境
val conf = new SparkConf().setAppName("Simple GraphX").setMaster("spark://master:7077").setJars(Seq("E:\\Intellij\\Projects\\SimpleGraphX\\SimpleGraphX.jar"))
val sc= new SparkContext(conf)
3. 图的构造
图是由若干顶点和边构成的,Spark GraphX里面的图也是一样的,所以在初始图之前,先要定义若干的顶点和边:
//顶点
val vertexArray =Array(
(1L,("Alice", 38)),
(2L,("Henry", 27)),
(3L,("Charlie", 55)),
(4L,("Peter", 32)),
(5L,("Mike", 35)),
(6L,("Kate", 23))
)//边
val edgeArray =Array(
Edge(2L, 1L, 5),
Edge(2L, 4L, 2),
Edge(3L, 2L, 7),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 3),
Edge(5L, 3L, 8),
Edge(5L, 6L, 8)
)
然后再利用点和边生成各自的RDD:
//构造vertexRDD和edgeRDD
val vertexRDD:RDD[(Long,(String,Int))] =sc.parallelize(vertexArray)
val edgeRDD:RDD[Edge[Int]]= sc.parallelize(edgeArray)
最后利用两个RDD生成图:
//构造图
val graph:Graph[(String,Int),Int] = Graph(vertexRDD, edgeRDD)
4. 图的属性操作
Spark GraphX图的属性包括:
(1) Graph.vertices:图中的所有顶点;
(2) Graph.edges:图中所有的边;
(3) Graph.triplets:由三部分组成,源顶点,目的顶点,以及两个顶点之间的边;
(4) Graph.degrees:图中所有顶点的度;
(5) Graph.inDegrees:图中所有顶点的入度;
(6) Graph.outDegrees:图中所有顶点的出度;
对这些属性的操作,直接上代码:
//图的属性操作
println("*************************************************************")
println("属性演示")
println("*************************************************************")//方法一
println("找出图中年龄大于20的顶点方法之一:")
graph.vertices.filter{case(id,(name,age)) => age>20}.collect.foreach {case(id,(name,age)) => println(s"$name is $age")
}//方法二
println("找出图中年龄大于20的顶点方法之二:")
graph.vertices.filter(v=> v._2._2>20).collect.foreach {
v=> println(s"${v._2._1} is ${v._2._2}")
}//边的操作
println("找出图中属性大于3的边:")
graph.edges.filter(e=> e.attr>3).collect.foreach(e => println(s"${e.srcId} to ${e.dstId} att ${e.attr}"))
println//Triplet操作
println("列出所有的Triples:")for(triplet
println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}")
}
println
println("列出边属性>3的Triples:")for(triplet t.attr > 3).collect){
println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}")
}
println//Degree操作
println("找出图中最大的出度,入度,度数:")
def max(a:(VertexId,Int), b:(VertexId,Int)):(VertexId,Int)={if (a._2>b._2) a elseb
}
println("Max of OutDegrees:" +graph.outDegrees.reduce(max))
println("Max of InDegrees:" +graph.inDegrees.reduce(max))
println("Max of Degrees:" +graph.degrees.reduce(max))
println
运行结果:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/05/22 20:45:35 INFO Slf4jLogger: Slf4jLogger started
17/05/22 20:45:35 INFO Remoting: Starting remoting
17/05/22 20:45:35 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.1.101:53375]
*************************************************************
属性演示
*************************************************************
找出图中年龄大于20的顶点方法之一:
Peter is 32
Alice is 38
Charlie is 55
Mike is 35
找出图中年龄大于20的顶点方法之二:
Peter is 32
Alice is 38
Charlie is 55
Mike is 35
找出图中属性大于3的边:
3 to 2 att 7
5 to 3 att 8
5 to 6 att 8
列出所有的Triples:
Henry likes Alice
Henry likes Peter
Charlie likes Henry
Charlie likes Kate
Peter likes Alice
Mike likes Henry
Mike likes Charlie
Mike likes Kate
列出边属性>3的Triples:
Charlie likes Henry
Mike likes Charlie
Mike likes Kate
找出图中最大的出度,入度,度数:
Max of OutDegrees:(5,3)
Max of InDegrees:(1,2)
Max of Degrees:(2,4)