SparkGraphX对图的操作
一、图的基本信息
- 图的入度:指向图的节点的边个数
- 图的出度:当前顶点指向其他顶点的个数
- 图的度:入度+出度
- 图的边的个数
- 图的顶点的个数
二、SparkGraphX的转换操作
mapVertices
mapEdges
mapTriplet
三、SparkGraphX的结构操作
reverse
subgraph
mask
gruopEdges
四、SparkGraphX关联操作
joinVertices,底层实现的也是outerJoinVertices的操作
outerJoinVertices
import org.apache.spark.graphx.{Edge, EdgeRDD, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* DESC: 构建社交网络图示
* Complete data processing and modeling process steps:
* 1-准备spark环境
* 2-准备顶点的数据
* 3-准备边的数据
* 4-建立SparkGraphX的图
* 5-打印图的边和图的顶点的信息
*/
object socialGraphTest3 {
def main(args: Array[String]): Unit = {
// * 1-准备spark环境
val conf: SparkConf = new SparkConf().setAppName("socialGraphTest").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
// * 2-准备顶点的数据-- RDD[(VertexId, VD)-->VD-(String, String)
val vertexData: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "stu")), (5L, ("franklin", "prof")), (7L, ("jg", "postdoc")), (2L, ("isca", "prof"))))
// * 3-准备边的数据--RDD[Edge[ED]]
val edgeData: RDD[Edge[String]] = sc.parallelize(Array(Edge(5L, 3L, "Advisor"), Edge(3L, 7L, "collab"), Edge(5L, 7L, "PI"), Edge(2L, 5L, "colleague"), Edge(2L, 8L, "colleague")))
// * 4-建立SparkGraphX的图
//vertices: RDD[(VertexId, VD)],
//edges: RDD[Edge[ED]],
val defaultAttr = ("Jack", "missing")
// val graph = Graph(vertexData, edgeData,defaultAttr)
val graph = Graph.apply(vertexData, edgeData, defaultAttr)
// * 5-打印图的边和图的顶点的信息
graph.edges.foreach(println(_))
graph.vertices.foreach(println(_))
graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.foreach(println(_))
graph.edges.filter(edge => edge.srcId > edge.dstId).foreach(println(_))
graph.inDegrees.foreach(println(_))
graph.outDegrees.foreach(println(_))
graph.degrees.foreach(println(_))
val numedges = graph.numEdges
println("numedges:" + numedges)
val numvertex = graph.numVertices
println("numvertex:" + numvertex)
println("==" * 100)
//(map: (VertexId, VD) => VD2)-----(3,(rxin,stu))
graph.mapVertices((VertexId, VD) => VD._1 + ":" + VD._2).vertices.foreach(println(_))
graph.mapEdges(edge => "name:" + edge.attr).edges.foreach(println(_))
graph.mapTriplets(tri => "name" + tri.attr).edges.foreach(println(_))
graph.mapTriplets(tri => "name" + tri.attr).vertices.foreach(println(_))
println("reverse graph:")
graph.reverse.edges.foreach(println(_))
println("subgraph graph:")
graph.subgraph(vpred = (VertexId, VD) => VD._2 != "missing").edges.foreach(println(_))
val subgraph = graph.subgraph(vpred = (VertexId, VD) => VD._2 != "missing")
println("subgraph graph prof:")
graph.subgraph(Triplet => Triplet.attr.startsWith("c"), (VertexId, VD) => VD._2.startsWith("pro")).edges.foreach(println(_))
val conn: Graph[VertexId, String] = graph.connectedComponents()
val makeGraph: Graph[VertexId, String] = conn.mask(subgraph)
makeGraph.edges.foreach(println(_))
println("joinVertices")
// def joinVertices[U: ClassTag](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD)
val join: RDD[(Long, String)] = sc.parallelize(Array((3L, "1234")))
val graph3: Graph[(String, String), String] = graph.joinVertices(join)((VertexId, VD, U) => (VD._1, VD._2 + U))
graph3.vertices.foreach(println(_))
val graph4: Graph[(String, String), String] = graph.outerJoinVertices(join)((VertexId, VD, U) => (VD._1, VD._2 + U))
graph4.vertices.foreach(println(_))
}
}
五、图的缓存
在Spark中,RDD默认是不缓存的。为了避免重复计算,当需要多次利用它们时,我们必须显示地缓存它们。GraphX中的图也有相同的方式。当利用到图多次时,确保首先访问Graph.cache()方法。
在迭代计算中,为了获得最佳的性能,不缓存可能是必须的。默认情况下,缓存的RDD和图会一直保留在内存中直到因为内存压力迫使它们以LRU的顺序删除。对于迭代计算,先前的迭代的中间结果将填充到缓存中。虽然它们最终会被删除,但是保存在内存中的不需要的数据将会减慢垃圾回收。只有中间结果不需要,不缓存它们是更高效的。然而,因为图是由多个RDD组成的,正确的不持久化它们是困难的。对于迭代计算,我们建议使用Pregel API,它可以正确的不持久化中间结果。
GraphX中的缓存操作有cache,persist,unpersist和unpersistVertices。它们的接口分别是:
def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
def cache(): Graph[VD, ED]
def unpersist(blocking: Boolean = true): Graph[VD, ED]
def unpersistVertices(blocking: Boolean = true): Graph[VD, ED]