Spark GraphX 入门实例完整scala代码

原创 2014年12月20日 20:49:16

由于天然符合互联网中很多场景的需求,图计算正受到越来越多的青睐。Spark GraphX 是作为 Spark 技术堆栈中的一员,担负起了 Spark 在图计算领域中的重任。网络上已经有很多图计算和 Spark GraphX 的概念介绍,此处就不再赘述。 本文将一篇很好的 Spark GraphX 入门文章中代码块整合为一个完整的可执行类,并加上必要注释以及执行结果,以方便有兴趣的朋友快速从 API 角度了解 Spark GraphX。

本文引用的代码块和多数文字描述均摘引自网文 graph-analytics-with-graphx, 在此特向作者表以感谢!


[1] 完整可执行scala 代码:

package scala.spark.graphx

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
import org.apache.spark._
import org.apache.spark.SparkContext._

object GraphXExample {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("GraphXExample")
    val sc = new SparkContext(conf)

    // [A] creating the Property Graph from arrays of vertices and edges
    println("[A] creating the Property Graph from arrays of vertices and edges");
    // Each vertex is keyed by a unique 64-bit long identifier (VertexID), like '1L'
    val vertexArray = Array(
      (1L, ("Alice", 28)),
      (2L, ("Bob", 27)),
      (3L, ("Charlie", 65)),
      (4L, ("David", 42)),
      (5L, ("Ed", 55)),
      (6L, ("Fran", 50)))
    // the Edge class stores a srcId, a dstId and the edge property
    val edgeArray = Array(
      Edge(2L, 1L, 7),
      Edge(2L, 4L, 2),
      Edge(3L, 2L, 4),
      Edge(3L, 6L, 3),
      Edge(4L, 1L, 1),
      Edge(5L, 2L, 2),
      Edge(5L, 3L, 8),
      Edge(5L, 6L, 3))

    // construct the following RDDs from the vertexArray and edgeArray variables.
    val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
    val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)

    // build a Property Graph
    val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

    // [B] Extract the vertex and edge RDD views of a graph
    println("[B] Extract the vertex and edge RDD views of a graph");
    // Solution 1
    println("Solution 1:============")
    graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
      case (id, (name, age)) => println(s"$name is $age")
    }

    // Solution 2
    println("Solution 2:============")
    graph.vertices.filter(v => v._2._2 > 30).collect.foreach(v => println(s"${v._2._1} is ${v._2._2}"))

    // Solution 3
    println("Solution 3:============")
    for ((id, (name, age)) <- graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect) {
      println(s"$name is $age")
    }

    // [C] Exposes a triplet view which logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD, ED]]
    println("[C] Exposes a triplet view which logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD, ED]]");
    println("Use the graph.triplets view to display who likes who: ")
    for (triplet <- graph.triplets.collect) {
      println(s"${triplet.srcAttr._1} likes ${triplet.dstAttr._1}")
    }

    // For extra credit, find the lovers.
    // If someone likes someone else more than 5 times than that relationship is getting pretty serious.
    println("For extra credit, find the lovers if has:============")
    for (triplet <- graph.triplets.filter(t => t.attr > 5).collect) {
      println(s"${triplet.srcAttr._1} loves ${triplet.dstAttr._1}")
    }

    // [D] Graph Operators
    // Property Graphs also have a collection of basic operations
    println("[D] Graph Operators")

    // compute the in-degree of each vertex
    val inDegrees: VertexRDD[Int] = graph.inDegrees

    // Define a class to more clearly model the user property
    case class User(name: String, age: Int, inDeg: Int, outDeg: Int)
    // Create a user Graph
    val initialUserGraph: Graph[User, Int] = graph.mapVertices { case (id, (name, age)) => User(name, age, 0, 0) }

    // Fill in the degree information
    val userGraph = initialUserGraph.outerJoinVertices(initialUserGraph.inDegrees) {
      case (id, u, inDegOpt) => User(u.name, u.age, inDegOpt.getOrElse(0), u.outDeg)
    }.outerJoinVertices(initialUserGraph.outDegrees) {
      case (id, u, outDegOpt) => User(u.name, u.age, u.inDeg, outDegOpt.getOrElse(0))
    }

    // Here we use the outerJoinVertices method of Graph which has the following (confusing) type signature:
    // def outerJoinVertices[U, VD2](other: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, Option[U]) => VD2): Graph[VD2, ED]

    // Using the degreeGraph print the number of people who like each user:
    println("Using the degreeGraph print the number of people who like each user:============")
    for ((id, property) <- userGraph.vertices.collect) {
      println(s"User $id is called ${property.name} and is liked by ${property.inDeg} people.")
    }

    // Print the names of the users who are liked by the same number of people they like.
    userGraph.vertices.filter {
      case (id, u) => u.inDeg == u.outDeg
    }.collect.foreach {
      case (id, property) => println(property.name)
    }

    // [D.1] The Map Reduce Triplets Operator
    // The mapReduceTriplets operator enables neighborhood aggregation and find the oldest follower of each user
    println("[D.1] The Map Reduce Triplets Operator")
    // Find the oldest follower for each user
    println("Find the oldest follower for each user:============")
    val oldestFollower: VertexRDD[(String, Int)] = userGraph.mapReduceTriplets[(String, Int)](
      // For each edge send a message to the destination vertex with the attribute of the source vertex
      edge => Iterator((edge.dstId, (edge.srcAttr.name, edge.srcAttr.age))),
      // To combine messages take the message for the older follower
      (a, b) => if (a._2 > b._2) a else b)
    userGraph.vertices.leftJoin(oldestFollower) { (id, user, optOldestFollower) =>
      optOldestFollower match {
        case None => s"${user.name} does not have any followers."
        case Some((name, age)) => s"${name} is the oldest follower of ${user.name}."
      }
    }.collect.foreach { case (id, str) => println(str) }

    // Try finding the average follower age of the followers of each user
    println("Try finding the average follower age of the followers of each user:============")
    val averageAge: VertexRDD[Double] = userGraph.mapReduceTriplets[(Int, Double)](
      // map function returns a tuple of (1, Age)
      edge => Iterator((edge.dstId, (1, edge.srcAttr.age.toDouble))),
      // reduce function combines (sumOfFollowers, sumOfAge)
      (a, b) => ((a._1 + b._1), (a._2 + b._2))).mapValues((id, p) => p._2 / p._1)

    // Display the results
    userGraph.vertices.leftJoin(averageAge) { (id, user, optAverageAge) =>
      optAverageAge match {
        case None => s"${user.name} does not have any followers."
        case Some(avgAge) => s"The average age of ${user.name}\'s followers is $avgAge."
      }
    }.collect.foreach { case (id, str) => println(str) }

    // [D.2] Subgraph
    // The subgraph operator that takes vertex and edge predicates and returns the graph 
    // containing only the vertices that satisfy the vertex predicate (evaluate to true) 
    // and edges that satisfy the edge predicate and connect vertices that satisfy the 
    // vertex predicate.
    println("[D.2] Subgraph")
    // restrict our graph to the users that are 30 or older
    println("restrict our graph to the users that are 30 or older:============")
    val olderGraph = userGraph.subgraph(vpred = (id, user) => user.age >= 30)
    // compute the connected components
    val cc = olderGraph.connectedComponents
    // display the component id of each user:
    olderGraph.vertices.leftJoin(cc.vertices) {
      case (id, user, comp) => s"${user.name} is in component ${comp.get}"
    }.collect.foreach { case (id, str) => println(str) }

  }

}


[2] 执行结果:

[A] creating the Property Graph from arrays of vertices and edges
[B] Extract the vertex and edge RDD views of a graph
Solution 1:============
David is 42
Fran is 50
Charlie is 65
Ed is 55
Solution 2:============
David is 42
Fran is 50
Charlie is 65
Ed is 55
Solution 3:============
David is 42
Fran is 50
Charlie is 65
Ed is 55
[C] Exposes a triplet view which logically joins the vertex and edge properties yielding an RDD[EdgeTriplet[VD, ED]]
Use the graph.triplets view to display who likes who: 
Bob likes Alice
Bob likes David
Charlie likes Bob
Charlie likes Fran
David likes Alice
Ed likes Bob
Ed likes Charlie
Ed likes Fran
For extra credit, find the lovers if has:============
Bob loves Alice
Ed loves Charlie
[D] Graph Operators
Using the degreeGraph print the number of people who like each user:============
User 4 is called David and is liked by 1 people.
User 6 is called Fran and is liked by 2 people.
User 2 is called Bob and is liked by 2 people.
User 1 is called Alice and is liked by 2 people.
User 3 is called Charlie and is liked by 1 people.
User 5 is called Ed and is liked by 0 people.
David
Bob
[D.1] The Map Reduce Triplets Operator
Find the oldest follower for each user:============
Bob is the oldest follower of David.
Charlie is the oldest follower of Fran.
Charlie is the oldest follower of Bob.
David is the oldest follower of Alice.
Ed is the oldest follower of Charlie.
Ed does not have any followers.
Try finding the average follower age of the followers of each user:============
The average age of David's followers is 27.0.
The average age of Fran's followers is 60.0.
The average age of Bob's followers is 60.0.
The average age of Alice's followers is 34.5.
The average age of Charlie's followers is 55.0.
Ed does not have any followers.
[D.2] Subgraph
restrict our graph to the users that are 30 or older:============
David is in component 4
Fran is in component 3
Charlie is in component 3
Ed is in component 3

版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

Spark GraphX 入门实例完整Scala代码

原文:http://blog.csdn.net/samhacker/article/details/42045539 [1] 完整可执行Scala 代码: [plain] view pla...

Intellij IDEA中java和scala开发流程

在IDEA 14.1中新建maven工程spark-graphx-test 然后找到:File->Project Structure->Project Settings->Modules->So...

Spark入门实战系列--9.Spark GraphX介绍及实例

Spark GraphX是一个分布式图处理框架,它是基于Spark平台提供对图计算和图挖掘简洁易用的而丰富的接口,极大的方便了对分布式图处理的需求。 众所周知•,社交网络中人与人之间有很多关系链,例如...

Spark学习笔记-GraphX-1

Spark GraphX是一个分布式图处理框架,Spark GraphX基于Spark平台提供对图计算和图挖掘简洁易用的而丰富多彩的接口,极大的方便了大家对分布式图处理的需求。Spark GraphX...

Spark的Graphx学习笔记--Pregel

hi

scala 代码示例

import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql.{...

Scala 强大的集合数据操作示例

Scala是数据挖掘算法领域最有力的编程语言之一,语言本身是面向函数,这也符合了数据挖掘算法的常用场景:在原始数据集上应用一系列的变换,语言本身也对集合操作提供了众多强大的函数,本文将以List类型为...

Spark图计算GraphX介绍及实例

1、GraphX介绍 1.1 GraphX应用背景 Spark GraphX是一个分布式图处理框架,它是基于Spark平台提供对图计算和图挖掘简洁易用的而丰富的接口,极大的方便了对分布式图处理的需...

大数据Spark “蘑菇云”行动第81课:Spark GraphX 综合案例作业讲解和源码深度剖析

大数据Spark “蘑菇云”行动第81课:Spark GraphX 综合案例作业讲解和源码深度剖析   聚合操作是分布式系统中最重要的操作   which fields should be inc...

大数据Spark “蘑菇云”行动第79课:Spark GraphX 代码实战及源码剖析

大数据Spark “蘑菇云”行动第79课:Spark GraphX 代码实战及源码剖析
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:深度学习:神经网络中的前向传播和反向传播算法推导
举报原因:
原因补充:

(最多只允许输入30个字)