SparkGraphX构建图案例各方法对比与总结
要构建一个图,可以调用这个看起来像构造函数的Graph()。
当一个Scala的类或对象中定义了函数apply()时,在调用apply()时可以省略apply,即Graph.apply()简写为Graph () 。所以Graph()看起来像是一个构造函数,但实际上它是在调用apply()函数。
弹性分布式数据集RDD是构建Spark程序的基础模块,它提供了灵活、高效、并行化数据处理和容错等特性。在GraphX中,图的基础类为Graph,它包含两个RDD : 一个为边RDD,另一个为顶点RDD。
案例一:分析-协作数据
顶点的构建:2种方式:
方式1:RDD构建;
方式2:优化版本VertextRDD
VertextRDD[VD]理解为RDD[Vertext[VD]]的扩展和优化
import org.apache.spark.graphx.impl.EdgeRDDImpl
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
/**
* 1-边的创建
* 2-顶点的创建
* 3-图的创建
*/
object CreateGraph {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkGraphX_helloworld")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
//顶点的创建方式1
val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")),
(7L, ("jg", "postdoc")),
(5L, ("franklin", "prof")),
(2L, ("isnoic", "prof"))))
users.foreach(println(_))
println("====")
//顶点的创建方式2
val user1: VertexRDD[(String, String)] = VertexRDD[(String, String)](users)
user1.foreach(println(_))
边的构建2种方法:
方式1:RDD方式构建;
方式2:EdgeRDD构建方式:
EdgeRDD[ED]理解为RDD[Edge[ED]]的扩展和优化,在GraphX中边是由ED类型的边RDD构成的。
//边的创建方式1
// RDD[Edge[String]]
val relationship: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "PI")))
relationship.foreach(println(_))
println("====")
//边的创建方式2
val relationship1: EdgeRDDImpl[String, Nothing] = EdgeRDD.fromEdges(relationship)
relationship1.foreach(println(_))
println("====")
val relationship2: RDD[(VertexId, VertexId)] = sc.parallelize(Array((3L, 7L), (5L, 3L), (2L, 5L), (5L, 7L)))
relationship2.foreach(println(_))
// (2,5)
// (5,7)
// (3,7)
// (5,3)
图的构建3种方法:
way1:apply方法
way2:fromEdgeTuple方法
way3:fromEdge边的属性
//1-构建图的部分---关系--第一种方法
val defaultVertex = ("Jack", "Missing")
val graph = Graph(users, relationship, defaultVertex)
graph.vertices.collect.foreach(println(_))
// (5,(franklin,prof))
// (2,(isnoic,prof))
// (3,(rxin,student))
// (7,(jg,postdoc))
//2-图的创建----fromEdgesTuple根据两个顶点创建图
val graph3: Graph[(String, String), PartitionID] = Graph.fromEdgeTuples[(String, String)](relationship2,defaultVertex)//defaultValue = ("","")其中defaultValue:VD为顶点的默认数据,
//用于当顶点在边RDD存在但是在顶点RDD不存在为顶点提供默认值。
graph3.vertices.collect.foreach(println(_))
// (5,(Jack,Missing))
// (2,(Jack,Missing))
// (3,(Jack,Missing))
// (7,(Jack,Missing))
//3-图的创建方法--fromEdge 根据边创建图
val graph4: Graph[(String, String), String] = Graph.fromEdges(relationship,defaultVertex)
graph4.vertices.collect.foreach(println(_))
}
}
案例二:分析-社交网络数据
社交网络数据分析:
图的定义:
分析代码:
//顶点定义
val myVertices=sc.makeRDD(Array((1L,"Ann"),(2L,"Bill"),(3L,"Charles"),(4l,"Diane"),(5L,"Went to gym this morning")))
//边定义
val myEdge=sc.makeRDD(Array(Edge(1L,2L,"is_friends-with"),Edge(2L,3L,"is_friends-with"),Edge(3L,4L,"is_friends-with"),Edge(4L,5L,"Like-status"),Edge(3L,5L,"write-status")))
//图的定义
val myGraph=Graph(myVertices,myEdge)
myGraph.vertices.collect
全部代码:
#社交网络数据构建图
val myVertices=sc.makeRDD(Array((1L,"Ann"),(2L,"Bill"),(3L,"Charles"),(4L,"Diane"),(5L,"Went to gym this morning")))
val myEdge=sc.makeRDD(Array(Edge(1L,2L,"is_friend"),Edge(2L,3L,"is_friend"),Edge(3L,4L,"is_friend"),Edge(4L,5L,"Like-Status"),Edge(3L,5L,"write-status")))
myEdge.collect
val myGraph=Graph(myVertices,myEdge)
myGraph.vertices.collect
myGraph.edges.collect
#顶点构建
import org.apache.spark.graphx.VertexRDD
1.val v1: RDD[(VertexId, String)] =
sc.parallelize(Array((1L,"Ann"),(2L,"Bill"),(3L,"Charles"),(4L,"Diane"),(5L,"Went to gym this morning")))
2.val v2:VertexRDD[String]=VertexRDD(String)(v1)
al v2 = VertexRDD(v1)
* val someData: RDD[(VertexId, SomeType)] = loadData(someFile)
* val vset = VertexRDD(someData)
* // If there were redundant values in someData we would use a reduceFunc
* val vset2 = VertexRDD(someData, reduceFunc)
*
构建边方法:
val r1: RDD[Edge[String]] =sc.parallelize(Array(Edge(1L,2L,"is_friend"),Edge(2L,3L,"is_friend"),Edge(3L,4L,"is_friend"),Edge(4L,5L,"Like-Status"),Edge(3L,5L,"write-status")))
import org.apache.spark.graphx.EdgeRDD
val r2:EdgeRDD[String]=EdgeRDD.fromEdges(r1)
relationships.collect
relationships1.collect
构建图的方法:
val myGraph=Graph(myVertices,myEdge)
val myGraph1=Graph.apply(myVertices,myEdge)
2.fromEdgeTuple
defaultUser=("jack","none)
val r3:RDD[(VertexId, VertexId)]=sc.parallelize(Array((1L,2L), (2L,3L), (3L,4L), (4L,5L),(3L,5L)))
val myGraph4=Graph.fromEdgeTuples[(String,String)](r3,defaultUser)
myGraph4.vertices.collect.foreach(println _)
myGraph4.edges.collect.foreach(println _)
对比fromedges
defaultUser=("jack","none)
val myGraph3=Graph.fromEdges(r2,defaultUser)
myGraph3.vertices.collect.foreach(println _)
myGraph3.edges.collect.foreach(println _)