大数据——GraphX之Connected Components算法及Spark实现

最新推荐文章于 2022-05-26 20:18:06 发布

蜂蜜柚子加苦茶

最新推荐文章于 2022-05-26 20:18:06 发布

阅读量777

点赞数

文章标签： spark 算法大数据 scala graphx

本文链接：https://blog.csdn.net/dsjia2970727/article/details/110197071

版权

GraphX之Connected Components算法及Spark实现

Connected Components
Demo
扩展

Connected Components

源码

Compute the connected component membership of each vertex and return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex.

Demo

links.csv

1,2,friend
1,3,sister
2,4,brother
3,2,boss
4,5,client
1,9,friend
6,7,cousin
7,9,coworker
8,9,father
10,11,colleague
10,12,colleague
11,12,colleague

people.csv

4,Dave,25
6,Faith,21
8,Harvey,47
2,Bob,18
1,Alice,20
3,Charlie,30
7,George,34
9,Ivy,21
5,Eve,30
10,Lily,35
11,Helen,35
12,Ann,35

图结构
代码

package nj.zb.kb09.suanfa

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

import org.apache.spark.graphx._
object ConnectedComponents {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("ConnectedComponents").getOrCreate()
    val sc: SparkContext = spark.sparkContext

    //定义样例类
    case class Person(name:String,age:Int)

    //读取people.csv文件
    val people: RDD[String] = sc.textFile("in/people.csv")

    //根据“，”进行切割，变成二元组的形式
    val peopleRDD: RDD[(VertexId, Person)] = people.map(x=>x.split(",")).map(row=>(row(0).toInt,Person(row(1),row(2).toInt)))
    peopleRDD.collect.foreach(println)

    println("----------------------")
    //读取links.csv文件
    val links: RDD[String] = sc.textFile("in/links.csv")

    //根据“，”进行切割
    val linksRDD: RDD[Edge[String]] = links.map({x=> val row=x.split(",") ;Edge(row(0).toInt,row(1).toInt,row(2))})
    linksRDD.collect.foreach(println)
    println("-----------------------")
    //生成Graph
    val tinySocial: Graph[Person, String] = Graph(peopleRDD,linksRDD)
    val cc: Graph[VertexId, String] = tinySocial.connectedComponents()
    cc.triplets.collect.foreach(println)
  }
}

结果展示：

(4,Person(Dave,25))
(6,Person(Faith,21))
(8,Person(Harvey,47))
(2,Person(Bob,18))
(1,Person(Alice,20))
(3,Person(Charlie,30))
(7,Person(George,34))
(9,Person(Ivy,21))
(5,Person(Eve,30))
(10,Person(Lily,35))
(11,Person(Helen,35))
(12,Person(Ann,35))
----------------------
Edge(1,2,friend)
Edge(1,3,sister)
Edge(2,4,brother)
Edge(3,2,boss)
Edge(4,5,client)
Edge(1,9,friend)
Edge(6,7,cousin)
Edge(7,9,coworker)
Edge(8,9,father)
Edge(10,11,colleague)
Edge(10,12,colleague)
Edge(11,12,colleague)
-----------------------
((1,1),(2,1),friend)
((1,1),(3,1),sister)
((1,1),(9,1),friend)
((2,1),(4,1),brother)
((3,1),(2,1),boss)
((4,1),(5,1),client)
((6,1),(7,1),cousin)
((7,1),(9,1),coworker)
((8,1),(9,1),father)
((10,10),(11,10),colleague)
((10,10),(12,10),colleague)
((11,10),(12,10),colleague)

从结果中可以看到通过计算之后的图，每个顶点多了一个属性，这个属性表示的就是这个顶点所在的连通图中的最小顶点id。例如顶点11所在的连通图中的最小顶点id是10，顶点4所在的连通图中的最小顶点id是1

连通图：
在图论中，连通图基于连通的概念。
在一个无向图G中，若从顶点i到顶点j有路径相连（当然从j到i也有路径），则称i和j是连通的。
如果G是有向图，那么连接i和j的路径中所有的边必须同向。
如果图中任意两点都是连通的，那么图被称作连通图。如果此图是有向图，则称为强连通图（注意：需要双向都有路径）。
图的连通性是图的基本性质。

在这里插入图片描述

扩展

经过connectedComponents得到的结果，可以知道哪些顶点在一个连通图中，这样就可以将一个大图拆分成若干个连通子图。

package nj.zb.kb09.suanfa

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession

import org.apache.spark.graphx._
object ConnectedComponents {
  def main(args: Array[String]): Unit = {
    val spark: SparkSession = SparkSession.builder().master("local[*]").appName("ConnectedComponents").getOrCreate()
    val sc: SparkContext = spark.sparkContext

    //定义样例类
    case class Person(name:String,age:Int)

    //读取people.csv文件
    val people: RDD[String] = sc.textFile("in/people.csv")

    //根据“，”进行切割，变成二元组的形式
    val peopleRDD: RDD[(VertexId, Person)] = people.map(x=>x.split(",")).map(row=>(row(0).toLong,Person(row(1),row(2).toInt)))
    peopleRDD.collect.foreach(println)

    println("----------------------")
    //读取links.csv文件
    val links: RDD[String] = sc.textFile("in/links.csv")

    //根据“，”进行切割
    val linksRDD: RDD[Edge[String]] = links.map({ x=> val row=x.split(",") ;Edge(row(0).toInt,row(1).toInt,row(2))})
    linksRDD.collect.foreach(println)
    println("-----------------------")
    //生成Graph
    val tinySocial: Graph[Person, String] = Graph(peopleRDD,linksRDD)
    val cc: Graph[VertexId, String] = tinySocial.connectedComponents()
    cc.triplets.collect.foreach(println)
    println("-----------------------")
    val newGraph: Graph[(VertexId, String, PartitionID), String] = cc.outerJoinVertices(peopleRDD)((id,cc,p)=>(cc,p.get.name,p.get.age))

    cc.vertices.map(_._2).collect.distinct.foreach(id=>{
      val sub: Graph[(VertexId, String, PartitionID), String] =newGraph.subgraph(vpred = (id1, id2)=>id2._1==id)
      sub.triplets.collect.foreach(println)
      println()
    })
  }
}

结果展示：

(4,Person(Dave,25))
(6,Person(Faith,21))
(8,Person(Harvey,47))
(2,Person(Bob,18))
(1,Person(Alice,20))
(3,Person(Charlie,30))
(7,Person(George,34))
(9,Person(Ivy,21))
(5,Person(Eve,30))
(10,Person(Lily,35))
(11,Person(Helen,35))
(12,Person(Ann,35))
----------------------
Edge(1,2,friend)
Edge(1,3,sister)
Edge(2,4,brother)
Edge(3,2,boss)
Edge(4,5,client)
Edge(1,9,friend)
Edge(6,7,cousin)
Edge(7,9,coworker)
Edge(8,9,father)
Edge(10,11,colleague)
Edge(10,12,colleague)
Edge(11,12,colleague)
-----------------------
((1,1),(2,1),friend)
((1,1),(3,1),sister)
((1,1),(9,1),friend)
((2,1),(4,1),brother)
((3,1),(2,1),boss)
((4,1),(5,1),client)
((6,1),(7,1),cousin)
((7,1),(9,1),coworker)
((8,1),(9,1),father)
((10,10),(11,10),colleague)
((10,10),(12,10),colleague)
((11,10),(12,10),colleague)
-----------------------
((1,(1,Alice,20)),(2,(1,Bob,18)),friend)
((1,(1,Alice,20)),(3,(1,Charlie,30)),sister)
((1,(1,Alice,20)),(9,(1,Ivy,21)),friend)
((2,(1,Bob,18)),(4,(1,Dave,25)),brother)
((3,(1,Charlie,30)),(2,(1,Bob,18)),boss)
((4,(1,Dave,25)),(5,(1,Eve,30)),client)
((6,(1,Faith,21)),(7,(1,George,34)),cousin)
((7,(1,George,34)),(9,(1,Ivy,21)),coworker)
((8,(1,Harvey,47)),(9,(1,Ivy,21)),father)

((10,(10,Lily,35)),(11,(10,Helen,35)),colleague)
((10,(10,Lily,35)),(12,(10,Ann,35)),colleague)
((11,(10,Helen,35)),(12,(10,Ann,35)),colleague)

分析：
1、通过connectedComponents得到的新图的顶点属性已经没有了原始的那些信息，所以需要和原始信息做一个join，例如：val newGraph: Graph[(VertexId, String, PartitionID), String] = cc.outerJoinVertices(peopleRDD)((id,cc,p)=>(cc,p.get.name,p.get.age))
2、cc.vertices.map(_._2).collect.distinct会得到所有连通图中id最小的顶点编号
3、通过连通图中最小顶点编号，使用subgraph方法得到每个连通子图

蜂蜜柚子加苦茶

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
大数据——GraphX之Connected Components算法及Spark实现

GraphX之Connected Components算法及Spark实现Connected ComponentsDemo扩展Connected Components源码Compute the connected component membership of each vertex and return a graph with the vertex value containing the lowest vertex id in the connected component containing
复制链接

扫一扫