SparkDataFrame 常用操作 Filter/groupBy/agg/pivot方法
先构造一组数据
val dataDF = List(
("id1", "click","0108",1,1.0),
("id1", "view","0101",2,1.0),
("id2", "buy","0105",3,7.0),
("id2", "click","0104",4,9.0),
("id2", "click","0105",5,1.0),
("id3", "buy","0106",6,1.0),
("id3", "view","0105",7,1.0),
("id3", "view","0105",8,1.0),
("id3", "view","0106",9,6.0),
("id3", "view","0105",10,10.0),
("id5", "view","0103",11,1.0),
("id5", "click","0106",12,1.0)).toDF("id", "action","date","grade","rate")
dataDF.show()
查看一下
scala> dataDF.show()
+---+------+----+-----+----+
| id|action|date|grade|rate|
+---+------+----+-----+----+
|id1| click|0108| 1| 1.0|
|id1| view|0101| 2| 1.0|
|id2| buy|0105| 3| 7.0|
|id2| click|0104| 4| 9.0|
|id2| click|0105| 5| 1.0|
|id3| buy|0106| 6| 1.0|
|id3| view|0105| 7| 1.0|
|id3| view|0105| 8| 1.0|
|id3| view|0106| 9| 6.0|
|id3| view|0105| 10|10.0|
|id5| view|0103| 11| 1.0|
|id5| click|0106| 12| 1.0|
+---+------+----+-----+----+
一、filter方法
先看一下各列数据类型
val typeMap = dataDF.dtypes.toMap[String,String]
结果:
scala> val typeMap = dataDF.dtypes.toMap[String,String]
typeMap: scala.collection.immutable.Map[String,String] = Map(rate -> DoubleType, id -> StringType, date -> StringType, grade -> IntegerType, action -> StringType)
对DoubleType/IntegerType类型过滤
//DoubleType
dataDF.filter("rate=1.0").show()
dataDF.filter("rate>1.0").show()
dataDF.filter("rate>=1.0").show()
dataDF.filter("rate!=1.0").show()
//IntegerType
dataDF.filter("grade=1.0").show()
dataDF.filter("grade>1.0").show()
dataDF.filter("grade>=1.0").show()
dataDF.filter("grade!=1.0").show()
//或者使用
import spark.implicits._ //使用$方法必须
val colname = "rate"
dataDF.filter($"rate" === 2).show()
dataDF.filter($"rate" > 2).show()
dataDF.filter($"rate" =!= 2).show()
需要传递参数的情况
val colname = "rate"
val ratenum = 1.0
dataDF.filter($"$colname" === ratenum).show()
dataDF.filter($"$colname" > ratenum).show()
dataDF.filter($"$colname" =!= ratenum).show()
对StringType类型过滤
import spark.implicits._
val colname = "action"
dataDF.filter($"$colname".equalTo("view")).show()
过滤非空取值
dataDF.filter($"$colname".isNotNull).show()
逻辑条件判断
dataDF.filter($"$colname".isNotNull && ($"grade" < 2.0 || $"grade" > 5.0 ) ).show()
二、withColumn/groupBy/agg/pivot方法
1. 方法概述
withColumn 新增一列
dataDF.withColumn("total1",lit(1)).show()
dataDF.withColumn("total1",lit("hello")).show()
groupBy 分组聚合 import org.apache.spark.sql.functions.count
import org.apache.spark.sql.functions.{col, count, explode, lit, max, min,sum,avg}
//分组计数
dataDF.groupBy("id").count().show()
//分组求最大/平均/最小/求和
dataDF.groupBy("id").max("rate").show()
dataDF.groupBy("id").min("rate").show()
dataDF.groupBy("id").avg("rate").show()
agg 分组聚合,需要结合groupBy使用
dataDF.groupBy("id").agg(count("action"),max("rate"),min("rate"),avg("grade"),sum("grade")).show
alias 为新增列重命名
dataDF.groupBy("id").agg(count("action").alias("action_count")).show() //返回值只有两列("id","action_count")