Spark的Dataset操作(三)-分组,聚合,排序
上一篇就说了下次主题是分组聚合。内容还挺多的,时间紧,任务重,就不瞎BB了。
数据还是用上一篇里造的那个dataset:
scala> val df = spark.createDataset(Seq(
("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6)) ).toDF("key1","key2","key3")
df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]
scala> df.printSchema
root
|-- key1: string (nullable = true)
|-- key2: integer (nullable = false)
|-- key3: integer (nullable = false)
scala> df.show
+----+----+----+
|key1|key2|key3|
+----+----+----+
| aaa| 1| 2|
| bbb| 3| 4|
| ccc| 3| 5|
| bbb| 4| 6|
+----+----+----+
先来个最简单的分组计数:
/*
等价SQL: select key1, count(*) from table group by key1
*/
scala> df.groupBy("key1").count.show
+----+-----+
|key1|count|
+----+-----+
| ccc| 1|
| aaa| 1|
| bbb| 2|
+----+-----+
注意,上面代码中的count不是记录数,而是对groupBy的聚合结果的计数。如果是要看分组后有多少条记录,代码如下。可以看到在这个示例数据集中结果应该是3条:
/*
等价SQL: select distinct key1 from table
*/
scala> df.select("key1").distinct.show
+----+
|key1|
+----+
| ccc|
| aaa|
| bbb|
+----+
/*
等价SQL:
select count(distinct key1) from table
*/
scala> df.select("key1").distinct.count
res3: Long = 3
上面的结果中,如果你跟我一样有强迫症的话,显然应该注意到了key1的显示没有排序,不能忍。修改如下:
/*
等价sql:
select key1 , count(*) from table
group by key1
order by key1
*/
scala> df.groupBy("key1").count.sort("key1").show
+----+-----+
|key1|count|
+----+-----+
| aaa| 1|
| bbb| 2|
| ccc| 1|
+----+-----+
/*
等价sql:
select key1 , count(*) from table
group by key1
order by key1 desc
*/
scala> df.groupBy("key1").count.sort($"key1".desc).show
+----+-----+
|key1|count|
+----+-----+
| ccc| 1|
| bbb| 2|
| aaa| 1|
+----+-----+
注意,上面一个是升序,一个是降序。和"select key1 , count(*) from table group by key1 order by key1 desc"降序的时候指定desc的时候,前面的key1跟了一个 前 缀 , 上 一 篇 说 过 了 , 这 个 是 c o l ( c o l u m n − n a m e ) 的 语 法 糖 。 以 后 的 前缀,上一篇说过了,这个是col(column-name)的语法糖。以后的 前缀,上一篇说过了,这个是col(column−name)的语法糖。以后的$前缀就不再解释了。
继续完善下,之前默认是按照分组计数的大小的升序排列的。如果要按分组计数的大小的逆序排序要怎么做呢?看之前的show结果,计数列显示的列名就是count。所以,自然就能想到下面的写法:
/*
等价sql:
select key1 , count(*) from table
group by key1
order by count(*) desc
*/
scala> df.groupBy("key1").count.sort($"count".desc).show
+----+-----+
|key1|count|
+----+-----+
| bbb| 2|
| ccc| 1|
| aaa| 1|
+----+-----+
或者是用withColumnRenamed函数给列重命名:
/*
等价sql:
select key1 , count(*) as cnt from table
group by key1
order by cnt desc
*/
scala> df.groupBy("key1").count.withColumnRenamed("count", "cnt").sort($"cnt".desc).show
+----+---+
|key1|cnt|
+----+---+
| bbb| 2|
| aaa| 1|
| ccc| 1|
+----+---+
更常用的方法是,直接给count(*)来个别名。如下:
/*
等价sql:
select key1 , count(*) as cnt from table
group by key1
order by cnt desc
*/
scala> df.groupBy("key1").agg(count("key1").as("cnt")).show
+----+---+
|key1|cnt|
+----+---+
| ccc| 1|
| aaa| 1|
| bbb| 2|
+----+---+
嗯,我们看到这里引入了聚合函数agg。这函数通常是配合groupBy的,用法灵活。下面用几个示例代码直接上,注意区别Column类型参数和String类型参数:
def agg(expr: Column, exprs: Column*): DataFrame
def agg(exprs: Map[String, String]): DataFrame
def agg(aggExpr: (String, String),
aggExprs: (String, String)*): DataFrame
/*
等价sql:
select key1, count(key1), max(key2), avg(key3)
from table
group by key1
*/
scala> df.groupBy("key1").agg(count("key1"), max("key2"), avg("key3")).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc| 1| 3| 5.0|
| aaa| 1| 1| 2.0|
| bbb| 2| 4| 5.0|
+----+-----------+---------+---------+
scala> df.groupBy("key1")
.agg("key1"->"count", "key2"->"max", "key3"->"avg").show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc| 1| 3| 5.0|
| aaa| 1| 1| 2.0|
| bbb| 2| 4| 5.0|
+----+-----------+---------+---------+
scala> df.groupBy("key1").agg(Map(("key1","count"), ("key2","max"), ("key3","avg"))).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc| 1| 3| 5.0|
| aaa| 1| 1| 2.0|
| bbb| 2| 4| 5.0|
+----+-----------+---------+---------+
scala> df.groupBy("key1")
.agg(("key1","count"), ("key2","max"), ("key3","avg")).show
+----+-----------+---------+---------+
|key1|count(key1)|max(key2)|avg(key3)|
+----+-----------+---------+---------+
| ccc| 1| 3| 5.0|
| aaa| 1| 1| 2.0|
| bbb| 2| 4| 5.0|
+----+-----------+---------+---------+
/*
等价sql:
select key1, count(key1) cnt,
max(key2) max_key2, avg(key3) avg_key3
from table
group by key1
order by key1, max_key2 desc
*/
scala> df.groupBy("key1").agg(count("key1").as("cnt"), max("key2").as("max_key2"), avg("key3").as("avg_key3")).sort($"cnt",$"max_key2".desc).show
+----+---+--------+--------+
|key1|cnt|max_key2|avg_key3|
+----+---+--------+--------+
| ccc| 1| 3| 5.0|
| aaa| 1| 1| 2.0|
| bbb| 2| 4| 5.0|
+----+---+--------+--------+
嗯,大概就这些吧。sort()和orderBy()是一样的, orderBy函数就是sort函数的别名。
喜欢这些内容的话,可以关注下公众号哈~