GraphX之Connected Components算法及Spark实现
Connected Components
源码
Compute the connected component membership of each vertex and return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.
Demo
- links.csv
1,2,friend
1,3,sister
2,4,brother
3,2,boss
4,5,client
1,9,friend
6,7,cousin
7,9,coworker
8,9,father
10,11,colleague
10,12,colleague
11,12,colleague
- people.csv
4,Dave,25
6,Faith,21
8,Harvey,47
2,Bob,18
1,Alice,20
3,Charlie,30
7,George,34
9,Ivy,21
5,Eve,30
10,Lily,35
11,Helen,35
12,Ann,35
- 图结构
- 代码
package nj.zb.kb09.suanfa
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.graphx._
object ConnectedComponents {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("ConnectedComponents").getOrCreate()
val sc: SparkContext = spark.sparkContext
//定义样例类
case class Person(name:String,age:Int)
//读取people.csv文件
val people: RDD[String] = sc.textFile("in/people.csv")
//根据“,”进行切割,变成二元组的形式
val peopleRDD: RDD[(VertexId, Person)] = people.map(x=>x.split(",")).map(row=>(row(0).toInt,Person(row(1),row(2).toInt)))
peopleRDD.collect.foreach(println)
println("----------------------")
//读取links.csv文件
val links: RDD[String] = sc.textFile("in/links.csv")
//根据“,”进行切割
val linksRDD: RDD[Edge[String]] = links.map({x=> val row=x.split(",") ;Edge(row(0).toInt,row(1).toInt,row(2))})
linksRDD.collect.foreach(println)
println("-----------------------")
//生成Graph
val tinySocial: Graph[Person, String] = Graph(peopleRDD,linksRDD)
val cc: Graph[VertexId, String] = tinySocial.connectedComponents()
cc.triplets.collect.foreach(println)
}
}
结果展示:
(4,Person(Dave,25))
(6,Person(Faith,21))
(8,Person(Harvey,47))
(2,Person(Bob,18))
(1,Person(Alice,20))
(3,Person(Charlie,30))
(7,Person(George,34))
(9,Person(Ivy,21))
(5,Person(Eve,30))
(10,Person(Lily,35))
(11,Person(Helen,35))
(12,Person(Ann,35))
----------------------
Edge(1,2,friend)
Edge(1,3,sister)
Edge(2,4,brother)
Edge(3,2,boss)
Edge(4,5,client)
Edge(1,9,friend)
Edge(6,7,cousin)
Edge(7,9,coworker)
Edge(8,9,father)
Edge(10,11,colleague)
Edge(10,12,colleague)
Edge(11,12,colleague)
-----------------------
((1,1),(2,1),friend)
((1,1),(3,1),sister)
((1,1),(9,1),friend)
((2,1),(4,1),brother)
((3,1),(2,1),boss)
((4,1),(5,1),client)
((6,1),(7,1),cousin)
((7,1),(9,1),coworker)
((8,1),(9,1),father)
((10,10),(11,10),colleague)
((10,10),(12,10),colleague)
((11,10),(12,10),colleague)
从结果中可以看到通过计算之后的图,每个顶点多了一个属性,这个属性表示的就是这个顶点所在的连通图中的最小顶点id。例如顶点11所在的连通图中的最小顶点id是10,顶点4所在的连通图中的最小顶点id是1
连通图:
在图论中,连通图基于连通的概念。
在一个无向图G中,若从顶点i到顶点j有路径相连(当然从j到i也有路径),则称i和j是连通的。
如果G是有向图,那么连接i和j的路径中所有的边必须同向。
如果图中任意两点都是连通的,那么图被称作连通图。如果此图是有向图,则称为强连通图(注意:需要双向都有路径)。
图的连通性是图的基本性质。
扩展
经过connectedComponents得到的结果,可以知道哪些顶点在一个连通图中,这样就可以将一个大图拆分成若干个连通子图。
package nj.zb.kb09.suanfa
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.graphx._
object ConnectedComponents {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("ConnectedComponents").getOrCreate()
val sc: SparkContext = spark.sparkContext
//定义样例类
case class Person(name:String,age:Int)
//读取people.csv文件
val people: RDD[String] = sc.textFile("in/people.csv")
//根据“,”进行切割,变成二元组的形式
val peopleRDD: RDD[(VertexId, Person)] = people.map(x=>x.split(",")).map(row=>(row(0).toLong,Person(row(1),row(2).toInt)))
peopleRDD.collect.foreach(println)
println("----------------------")
//读取links.csv文件
val links: RDD[String] = sc.textFile("in/links.csv")
//根据“,”进行切割
val linksRDD: RDD[Edge[String]] = links.map({ x=> val row=x.split(",") ;Edge(row(0).toInt,row(1).toInt,row(2))})
linksRDD.collect.foreach(println)
println("-----------------------")
//生成Graph
val tinySocial: Graph[Person, String] = Graph(peopleRDD,linksRDD)
val cc: Graph[VertexId, String] = tinySocial.connectedComponents()
cc.triplets.collect.foreach(println)
println("-----------------------")
val newGraph: Graph[(VertexId, String, PartitionID), String] = cc.outerJoinVertices(peopleRDD)((id,cc,p)=>(cc,p.get.name,p.get.age))
cc.vertices.map(_._2).collect.distinct.foreach(id=>{
val sub: Graph[(VertexId, String, PartitionID), String] =newGraph.subgraph(vpred = (id1, id2)=>id2._1==id)
sub.triplets.collect.foreach(println)
println()
})
}
}
结果展示:
(4,Person(Dave,25))
(6,Person(Faith,21))
(8,Person(Harvey,47))
(2,Person(Bob,18))
(1,Person(Alice,20))
(3,Person(Charlie,30))
(7,Person(George,34))
(9,Person(Ivy,21))
(5,Person(Eve,30))
(10,Person(Lily,35))
(11,Person(Helen,35))
(12,Person(Ann,35))
----------------------
Edge(1,2,friend)
Edge(1,3,sister)
Edge(2,4,brother)
Edge(3,2,boss)
Edge(4,5,client)
Edge(1,9,friend)
Edge(6,7,cousin)
Edge(7,9,coworker)
Edge(8,9,father)
Edge(10,11,colleague)
Edge(10,12,colleague)
Edge(11,12,colleague)
-----------------------
((1,1),(2,1),friend)
((1,1),(3,1),sister)
((1,1),(9,1),friend)
((2,1),(4,1),brother)
((3,1),(2,1),boss)
((4,1),(5,1),client)
((6,1),(7,1),cousin)
((7,1),(9,1),coworker)
((8,1),(9,1),father)
((10,10),(11,10),colleague)
((10,10),(12,10),colleague)
((11,10),(12,10),colleague)
-----------------------
((1,(1,Alice,20)),(2,(1,Bob,18)),friend)
((1,(1,Alice,20)),(3,(1,Charlie,30)),sister)
((1,(1,Alice,20)),(9,(1,Ivy,21)),friend)
((2,(1,Bob,18)),(4,(1,Dave,25)),brother)
((3,(1,Charlie,30)),(2,(1,Bob,18)),boss)
((4,(1,Dave,25)),(5,(1,Eve,30)),client)
((6,(1,Faith,21)),(7,(1,George,34)),cousin)
((7,(1,George,34)),(9,(1,Ivy,21)),coworker)
((8,(1,Harvey,47)),(9,(1,Ivy,21)),father)
((10,(10,Lily,35)),(11,(10,Helen,35)),colleague)
((10,(10,Lily,35)),(12,(10,Ann,35)),colleague)
((11,(10,Helen,35)),(12,(10,Ann,35)),colleague)
分析:
1、通过connectedComponents得到的新图的顶点属性已经没有了原始的那些信息,所以需要和原始信息做一个join,例如:val newGraph: Graph[(VertexId, String, PartitionID), String] = cc.outerJoinVertices(peopleRDD)((id,cc,p)=>(cc,p.get.name,p.get.age))
2、cc.vertices.map(_._2).collect.distinct会得到所有连通图中id最小的顶点编号
3、通过连通图中最小顶点编号,使用subgraph方法得到每个连通子图