GraphX主要提供了如下图所示的5类操作接口:
为了详细了解每一个图运算符的功能,我在Spark集群中,运行了这些方法,其中采用Spark GraphX官方网站提供的图,进行操作,如下:
首先,在Spark Shell中运行一下代码,存储这张图:
import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD
// Assume the SparkContext has already been constructed
//val sc: SparkContext
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),(5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),
Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)
1. numVertices: Long = graph.vertices.count(),计算图的顶点总数,返回Long型
2. numEdges: Long = graph.edges.count(),计算图的边总数
3. degrees: VertexRDD[Int],计算图中各顶点的度
4. mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2): Graph[VD2, ED],调用Spark中的map操作,更新顶点的属性值,由VD-->VD2
原VD:
5. mapEdges[ED2: ClassTag](map: Edge[ED] => ED2): Graph[VD, ED2],调用Spark中的map操作更新边的属性值,由ED-->ED2
6. mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2], 调用Spark中的map操作更新边的属性值,由ED-->ED2
7. reverse: Graph[VD, ED],改变图中边的方向,即把srcId与dstId对换
8. mask[VD2: ClassTag, ED2: ClassTag](other: Graph[VD2, ED2]): Graph[VD, ED],对图this与图other,保留两者公共的点和边,并保留this中点和边的属性。
9. subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true), vpred: (VertexId, VD) => Boolean = ((v, d) => true)): Graph[VD, ED], 求子图运算,保留点和边满足如下关系:
{
V' = {v : for all v in V where vpred(v)}
E' = {(u,v): for all (u,v) in E where epred((u,v)) &&
vpred(u) && vpred(v)}
}
10. groupEdges(merge: (ED, ED) => ED): Graph[VD, ED],根据merge函数,将图中多重边的属性值进行合并,保证图中对应(srcID,dstID)只有一条边。
11. mapReduceTriplets[A: ClassTag]( mapFunc: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], reduceFunc: (A, A) => A, activeSetOpt: Option[(VertexRDD[_], EdgeDirection)] = None ): VertexRDD[A],
对EdgeTriplet中的每个元素进行计算,根据reduceFunc函数中定义的运算,得到计算后的VertexRDD
参数mapFunc:用户自定义函数,返回0或多个消息给邻居顶点
参数reduceFunc:用户自定义函数,对map阶段收集到的结果进行汇总
参数activeSetOpt:可选项,限定mapFunc函数的运行条件,当active vertice与EdgeDirection(In, Out, Both, Either) == true,才执行mapFunc函数
12. collectNeighbors(edgeDirection:EdgeDirection): VertexRDD[Array[(VertexId, VD)]],沿边的方向(edgeDirection),收集邻居顶点的VertexId,Attr。
13. cache(): Graph[VD, ED],将图graph缓存到内存中。由于在图的计算过程中,RDD并不是一直都保存在内存中,然而,在计算过程中,可能会多次用到graph,为了避免开销,将graph缓存到内存中。
14. unpersistVertices(blocking: Boolean = true): Graph[VD, ED],释放内存中缓存的vertices。适用于只修改点的属性值,但会重复使用边进行计算的迭代操作。此方法可以释放先前迭代的顶点属性(当其不再需要的时候),提高GC性能。