SparkGraph 与SparkDataFrame 两种方式计算朋友的二度关系

最新推荐文章于 2021-01-03 01:37:09 发布

chengujun7940

最新推荐文章于 2021-01-03 01:37:09 发布

阅读量817

点赞数

文章标签：大数据人工智能 python

原文链接：https://my.oschina.net/u/3455048/blog/1609264

版权

例如现在有这些数据：

10010   95555   2016-11-11  15:55:54
10010   95556   2016-11-11  15:55:54
10010   95557   2016-11-11  15:55:54
10086   95555   2016-11-11  15:55:54
10086   95558   2016-11-11  15:55:54
10000   95555   2016-11-11  15:55:54
10000   95558   2016-11-11  15:55:54

第一列代表是用户这个手机号，第二列代表是用户的朋友的手机号，然后计算用户与用户之间有几个共同好友号码

用sparkgraph代码如下

package spark_graph

import org.apache.spark.graphx.{Edge, _}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

/**
  * Created by dongdong on 18/1/18.
  */
object Spark_Contact_Test {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf()
      .setAppName("Spark_Contact_Test")
      .setMaster("local[4]")

    val sc = new SparkContext(conf)

    //    构造顶点
    val userVertex = sc.textFile("/Users/dongdong/Desktop/spark_contact/contact/data_test/date_contact").map(line => {
      //分词
      val words = line.split("\\s+")
      //      自己的号码
      val self_mobile = words(0).split("[^0-9]").filter(_.length > 0).mkString("")
      //      朋友的号码
      val relation_mobile = words(1).split("[^0-9]").filter(_.length > 0).mkString("")
      (self_mobile, relation_mobile)
    }
    ).filter(t =>
      //      号码不是5位的过滤掉 样例数据是5位，如果生产数据是11位，这边需改动一下
      t._1.length == 5 && t._2.length == 5
    ).map(x =>
      //      前面一个代表的是这个顶点，第二个为这个顶点的属性  例如（10010   95555 ）
      (x._1.toLong, x._2)
    )

    //    构造边
    val edge = userVertex.map(vertex => {
      //      边与边之间一定要是long类型的 构想是这个号码与号码之间关联，关联关系为1
      Edge(vertex._1, vertex._2.toLong, 1)
    })
    //    默认顶点
    val defaultVertex = ("00000")

    //     顶点、边 默认顶点可以构成一个图
    val graphContact = Graph(userVertex, edge, defaultVertex)

    //Triplets(三元组)，包含源点、源点属性、目标点、目标点属性、边属性
    //    源点=10010源点属性=95555边属性=1目标点=95555目标点属性=00000
    //    graphContact.triplets.map(triplet => {
    //      "源点=" + triplet.srcId + "源点属性=" + triplet.srcAttr + "边属性=" + triplet.attr + "目标点=" + triplet.dstId + "目标点属性=" + triplet.dstAttr
    //
    //    }).collect().foreach(print(_))

    /*
    源码：
     def aggregateMessages[A: ClassTag](
      sendMsg: EdgeContext[VD, ED, A] => Unit,
      mergeMsg: (A, A) => A,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[A] = {
    aggregateMessagesWithActiveSet(sendMsg, mergeMsg, tripletFields, None)
     */
    /*
    这一步是为了将relation_mobile 有多少个self_mobile并且是以","进行分割符
    (95555,10010,10086,10000)
    (95556,10010)
    (95558,10000,10086)
    */
    val aggregateMessages: VertexRDD[String] = graphContact.aggregateMessages(msgFun, reduceFun)

    /*
    这步是将values 进行排序去重并且过滤掉key为1的情况 例如是 (95556,10010)这个tuple，因为没有多个关联
    (95558,List(10000, 10086))
    (95555,List(10000, 10010, 10086))
     */
    val sortAndFilter = aggregateMessages.mapValues(
      tuple => {
        val list = tuple.split(",").toList.sorted.distinct
        list
      }
    ).filter(_._2.size > 1)

    /*
    主要是为了 self_moble 与self_moble 在同一个key里
    Map(10000,10086 -> 95558)
    Map(10000,10086 -> 95555, 10000,10010 -> 95555, 10010,10086 -> 95555)
     */
    val hmRDD = sortAndFilter.map(t => {
      var hm = new scala.collection.mutable.HashMap[String, String]()
      for (i <- 0 until t._2.size; j <- i + 1 until t._2.size) {
        if (i != j) {
          var key = t._2(i) + "," + t._2(j)
          var value = t._1.toString
          hm(key) = value
        }
      }
      hm
    })


    /*
    为了将map flatmap一下 变成元组，再进行groupbykey
     (10000,10086,CompactBuffer(95558, 95555))
     (10010,10086,CompactBuffer(95555))
     (10000,10010,CompactBuffer(95555))
     */
    val hm2TupleRDD = hmRDD.map(t => {
      t.toList.map(x => {
        (x._1, x._2)
      })
    })
      .flatMap(t => {
        t
      })
      .groupByKey()


    /*
    第二个图的顶点
     (10000,2)
     (10086,2)
     (10010,3)
      前面一个代表的是这个顶点，第二个为这个顶点的属性
     */
    val userVertexTwo = userVertex.groupByKey().mapValues(t => {
      t.size
    })
    //    第二图默认顶点属性
    val defaultVertexTwo = (0)

    //  构造第二个图的边
    val edgeTwo = hm2TupleRDD.map(t => {
      val split = t._1.split(",")
      Edge(split(0).toLong, split(1).toLong, t._2.toList.size)
    })

    val graphContactTwo = Graph(userVertexTwo, edgeTwo, defaultVertexTwo)


    /*
10000与10086相关连	10000的用户朋友数量=2相似度=1.0共同朋友数量=2.0	10086与10000相关连	10086的用户朋友数量=2相似度=1.0共同朋友数量=2.0
10000与10010相关连	10000的用户朋友数量=2相似度=0.5共同朋友数量=1.0	10010与10000相关连	10010的用户朋友数量=3相似度=0.3共同朋友数量=1.0
10010与10086相关连	10010的用户朋友数量=3相似度=0.3共同朋友数量=1.0	10086与10010相关连	10086的用户朋友数量=2相似度=0.5共同朋友数量=1.0
 */
    val result = graphContactTwo.triplets.map(t => {
      val usr1 = t.srcId
      val usr2 = t.dstId
      val common_friend = t.attr.toFloat
      val usr1_usr2 = common_friend / t.srcAttr.toFloat
      val usr2_usr1 = common_friend / t.dstAttr.toFloat
      usr1.toString + "与" + usr2.toString + "相关连" + "\t" + usr1.toString + "的用户朋友数量=" + t.srcAttr + "相似度=" + usr1_usr2.toString + "共同朋友数量=" + common_friend + "\t" + usr2.toString + "与" + usr1.toString + "相关连" + "\t" + usr2.toString + "的用户朋友数量=" + t.dstAttr + "相似度=" + usr2_usr1.toString + "共同朋友数量=" + common_friend
    })
      .foreach(println(_))

    sc.stop()
  }

  //  map 函数  把 self_mobile 发送过去
  def msgFun(triplet: EdgeContext[(String), Int, String]) {
    triplet.sendToDst(triplet.srcId.toString)
  }

  //  reduce 函数  relation_mobile作为key  reducebykey(_+_)
  def reduceFun(a: (String), b: (String)): String = a + "," + b

}

写完map和reduce函数和一系列的rdd函数，我自己写完感觉都快别恶心死了，这种方式处理代码太过于复杂

下面这种方式是用dataframe进行处理

代码如下：

package spark_graph

import org.apache.spark.sql.SparkSession

/**
  * Created by dongdong on 18/1/18.
  */
object ContactDataFrame {


  def main(args: Array[String]): Unit = {

    val spark = SparkSession
      .builder()
      .appName("ContactDataFrame")
      .master("local")
      .getOrCreate()

    import spark.implicits._
//    数据变成dataframe
    val userDataFrame = spark.sparkContext.textFile("/Users/dongdong/Desktop/spark_contact/contact/data_test/date_contact").map(line => {
      //分词
      val words = line.split("\\s+")
      //      自己的号码
      val self_mobile = words(0).split("[^0-9]").filter(_.length > 0).mkString("")
      //      朋友的号码
      val relation_mobile = words(1).split("[^0-9]").filter(_.length > 0).mkString("")
      (self_mobile, relation_mobile)
    }
    ).filter(t =>
      //      号码不是5位的过滤掉 样例数据是5位，如果生产数据是11位，这边需改动一下
      t._1.length == 5 && t._2.length == 5
    ).toDF("self_mobile","relation_mobile")

//    dataframe注册成一张表
    userDataFrame.createOrReplaceTempView("t_user_contact")
//    把这张表现cache到内存里
    spark.catalog.cacheTable("t_user_contact")

   val resultDataframe= spark.sql(
     """
       |select
       |user_mobile,
       |friend_mobile,
       |count(1) as common_mobile_cnt
       |from
       |(select
       |a.self_mobile as user_mobile,
       |b.self_mobile as friend_mobile,
       |a.relation_mobile as common_mobile
       |from
       |(
       |select
       |distinct
       |self_mobile,
       |relation_mobile
       |from
       |t_user_contact
       |)a
       |inner join
       |(select
       |distinct
       |self_mobile,
       |relation_mobile
       |from
       |t_user_contact
       |)b
       |on
       |a.relation_mobile=b.relation_mobile
       |)c
       |where user_mobile!=friend_mobile
       |group by user_mobile,friend_mobile
       |
     """.stripMargin)

    resultDataframe.show(false)

//    清除内存
    spark.catalog.clearCache()
    spark.stop()


  }

}

结果如下:

+-----------+-------------+-----------------+
|user_mobile|friend_mobile|common_mobile_cnt|
+-----------+-------------+-----------------+
|10010      |10086        |1                |
|10086      |10010        |1                |
|10000      |10010        |1                |
|10086      |10000        |2                |
|10000      |10086        |2                |
|10010      |10000        |1                |
+-----------+-------------+-----------------+

超级简洁，明瞭。建议能sql搞定的，都不要用函数。

转载于:https://my.oschina.net/u/3455048/blog/1609264

chengujun7940

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
SparkGraph 与SparkDataFrame 两种方式计算朋友的二度关系

例如现在有这些数据： 10010 95555 2016-11-11 15:55:5410010 95556 2016-11-11 15:55:5410010 95557 2016-11-11 15:55:5410086 95555 2016-11-...
复制链接

扫一扫