Spark DataFrame 常用操作 Filter/groupBy/agg/pivot 方法 (scala版)

SparkDataFrame 常用操作 Filter/groupBy/agg/pivot方法

先构造一组数据

val dataDF = List(
  ("id1", "click","0108",1,1.0),
	("id1", "view","0101",2,1.0),
  ("id2", "buy","0105",3,7.0),
	("id2", "click","0104",4,9.0),
	("id2", "click","0105",5,1.0),
  ("id3", "buy","0106",6,1.0),
  ("id3", "view","0105",7,1.0),
	("id3", "view","0105",8,1.0),
	("id3", "view","0106",9,6.0),
	("id3", "view","0105",10,10.0),
  ("id5", "view","0103",11,1.0),
  ("id5", "click","0106",12,1.0)).toDF("id", "action","date","grade","rate")
dataDF.show()

查看一下

scala> dataDF.show()
+---+------+----+-----+----+
| id|action|date|grade|rate|
+---+------+----+-----+----+
|id1| click|0108|    1| 1.0|
|id1|  view|0101|    2| 1.0|
|id2|   buy|0105|    3| 7.0|
|id2| click|0104|    4| 9.0|
|id2| click|0105|    5| 1.0|
|id3|   buy|0106|    6| 1.0|
|id3|  view|0105|    7| 1.0|
|id3|  view|0105|    8| 1.0|
|id3|  view|0106|    9| 6.0|
|id3|  view|0105|   10|10.0|
|id5|  view|0103|   11| 1.0|
|id5| click|0106|   12| 1.0|
+---+------+----+-----+----+

一、filter方法

先看一下各列数据类型

val typeMap = dataDF.dtypes.toMap[String,String]

结果:

scala> val typeMap = dataDF.dtypes.toMap[String,String]
typeMap: scala.collection.immutable.Map[String,String] = Map(rate -> DoubleType, id -> StringType, date -> StringType, grade -> IntegerType, action -> StringType)

对DoubleType/IntegerType类型过滤

//DoubleType
dataDF.filter("rate=1.0").show()
dataDF.filter("rate>1.0").show()
dataDF.filter("rate>=1.0").show()
dataDF.filter("rate!=1.0").show()
//IntegerType
dataDF.filter("grade=1.0").show()
dataDF.filter("grade>1.0").show()
dataDF.filter("grade>=1.0").show()
dataDF.filter("grade!=1.0").show()

//或者使用
import spark.implicits._ //使用$方法必须
val colname = "rate"
dataDF.filter($"rate" === 2).show()
dataDF.filter($"rate" > 2).show()
dataDF.filter($"rate" =!= 2).show()

需要传递参数的情况

val colname = "rate"
val ratenum = 1.0
dataDF.filter($"$colname" === ratenum).show()
dataDF.filter($"$colname" > ratenum).show()
dataDF.filter($"$colname" =!= ratenum).show()

对StringType类型过滤

import spark.implicits._ 
val colname = "action"
dataDF.filter($"$colname".equalTo("view")).show()

过滤非空取值

dataDF.filter($"$colname".isNotNull).show()

逻辑条件判断

dataDF.filter($"$colname".isNotNull && ($"grade" < 2.0 || $"grade" > 5.0 ) ).show() 

二、withColumn/groupBy/agg/pivot方法

1. 方法概述

方法概述

withColumn 新增一列

dataDF.withColumn("total1",lit(1)).show()
dataDF.withColumn("total1",lit("hello")).show()
groupBy	分组聚合	import org.apache.spark.sql.functions.count
import org.apache.spark.sql.functions.{col, count, explode, lit, max, min,sum,avg}
//分组计数
dataDF.groupBy("id").count().show()
//分组求最大/平均/最小/求和
dataDF.groupBy("id").max("rate").show()
dataDF.groupBy("id").min("rate").show()
dataDF.groupBy("id").avg("rate").show()

agg 分组聚合,需要结合groupBy使用

dataDF.groupBy("id").agg(count("action"),max("rate"),min("rate"),avg("grade"),sum("grade")).show

alias 为新增列重命名

dataDF.groupBy("id").agg(count("action").alias("action_count")).show() //返回值只有两列("id","action_count")

我的博客

  • 0
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值