Spark DataFrame转化为RDD[Vector],应用于KMeans聚类,monotonically_increasing_id()方法进行结果合并

56 篇文章 4 订阅
47 篇文章 3 订阅

将DataFrame转化为RDD[Vector],应用于KMeans聚类

  • 模型训练
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD 
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.sql.{DataFrame, Row}


var res_data = data1
var bucketnum = 3
val numClusters = bucketnum
val numIterations = 20

var p = res_data.select("AMOUNT").rdd.map{case Row(s:Double)=> Vectors.dense(Array(s))}

val clusters = KMeans.train(p, numClusters, numIterations)
var tt_data = clusters.predict(p)

tt_data.collect().toList



也可以打印结果:

tt_data.collect().foreach {println}

1
1
0
1
0
0
2






  • 将结果合并回原表

关键点是如何将原始数据和得到类别按照相同的顺序进行拼接:

利用functions里面的***monotonically_increasing_id()***,生成单调递增,不保证连续,最大64bit,的一列.分区数不变。
注:
2.0版本之前使用monotonicallyIncreasingId
2.0之后变为monotonically_increasing_id()
在这里插入图片描述图片来源该博客

// 将结果由RDD转为DF 并按顺序排序

import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.expressions.Window

var index_data = tt_data.toDF("tmp").withColumn("tindex",monotonicallyIncreasingId).withColumn("index", row_number().over(Window.orderBy("tindex"))).drop("tindex")

// 将原始数据按顺序排序

import org.apache.spark.sql.functions._ 
 

var org_data =  data1.withColumn("tindex",monotonicallyIncreasingId).withColumn("index", row_number().over(Window.orderBy("tindex"))).drop("tindex")

var res_data = org_data.join(index_data, Seq("index"), "left").drop("index").withColumn("tmp", col("tmp").cast("double"))
res_data.show()

+-----+-------+------+----+------+---+
|label| AMOUNT|Pclass|name|MAC_id|tmp|
+-----+-------+------+----+------+---+
|  0.0| 2002.0|   196| 1.5|   bai|1.0|
|  1.0| 4004.0|   192| 2.1|  wang|1.0|
|  0.0| 7007.0|    95| 2.1|  wang|0.0|
|  0.0| 4004.0|     4| 3.4|    wa|1.0|
|  1.0| 7007.0|    15| 3.4|    wa|0.0|
|  1.0| 7007.0|    15| 3.4|    wa|0.0|
|  1.0|    0.0|    14| 4.7|   zhu|2.0|
|  0.0| 9009.0|    96| 1.5|   bai|0.0|
|  0.0| 2002.0|   196| 1.5|   bai|1.0|
|  1.0| 4004.0|   192| 2.1|  wang|1.0|
|  0.0| 7007.0|    95| 2.1|  wang|0.0|
|  0.0| 4004.0|     4| 3.4|    wa|1.0|
|  1.0| 7007.0|    15| 3.4|    wa|0.0|
|  1.0| 7007.0|    15| 3.4|    wa|0.0|
|  1.0|    0.0|    14| 4.7|   zhu|2.0|
|  0.0| 9009.0|    96| 1.5|   bai|0.0|
|  1.0| 9009.0|   126| 1.5|   bai|0.0|
|  1.0| 9009.0|   126| 5.9|   wei|0.0|
|  0.0|10010.0| 19219| 5.9|   wei|0.0|
+-----+-------+------+----+------+---+

完成!!!!!

拓展:Spark DataFrame将一列转化为Vector

import spark.implicits._

var data1 = Seq(
  ("0.0", "2002", "196", "1.5", "bai"),
  ("1.0", "4004", "192", "2.1", "wang"),
  ("0.0", "7007", "95", "2.1", "wang"),
  ("0.0", "4004", "4", "3.4", "wa"),
  ("1.0", "7007", "15", "3.4", "wa"),
  ("1.0", "7007", "15", "3.4", "wa"),
  ("1.0", "0",    "14", "4.7", "zhu"),
  ("0.0", "9009", "96", "1.5", "bai"),
  ("0.0", "2002", "196", "1.5", "bai"),
  ("1.0", "4004", "192", "2.1", "wang"),
  ("0.0", "7007", "95", "2.1", "wang"),
  ("0.0", "4004", "4", "3.4", "wa"),
  ("1.0", "7007", "15", "3.4", "wa"),
  ("1.0", "7007", "15", "3.4", "wa"),
  ("1.0", "0",    "14", "4.7", "zhu"),
  ("0.0", "9009", "96", "1.5", "bai"),
  ("1.0", "9009", "126", "1.5", "bai"),
  ("1.0", "9009", "126", "5.9", "wei"),
  ("0.0","10010", "19219", "5.9", "wei")
).toDF("label", "AMOUNT", "Pclass", "name", "MAC_id")
import org.apache.spark.sql.functions._

data1 = data1.withColumn("AMOUNT", col("AMOUNT").cast("double"))
data1 = data1.withColumn("name", col("name").cast("double"))
data1 = data1.withColumn("label", col("label").cast("double"))

data1.show()

var q = Vectors.dense(res_data.select("AMOUNT").collect().map(_(0)).map(_.toString.toDouble))

结果:
q: org.apache.spark.mllib.linalg.Vector = [2002.0,4004.0,7007.0,4004.0,7007.0,7007.0,0.0,9009.0,2002.0,4004.0,7007.0,4004.0,7007.0,7007.0,0.0,9009.0,9009.0,9009.0,10010.0]

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值