分组函数groupBy
(1)分组计数
select address,count(1) from people group by address; 等价的算子如下
scala> peopleDF.show()
+--------+---+--------+
| name|age| address|
+--------+---+--------+
|zhangsan| 22| chengdu|
| wangwu| 33| beijing|
| lisi| 28|shanghai|
|xiaoming| 28| beijing|
| mm| 21| chengdu|
|xiaoming| 18| beijing|
| mm| 11| chengdu|
+--------+---+--------+
scala> peopleDF.groupBy("address").count().show
+--------+-----+
| address|count|
+--------+-----+
| beijing| 3|
| chengdu| 3|
|shanghai| 1|
+--------+-----+
(2)分组后求最值、平均值、求和的方法
//等价于select address,max(age) from people group by address;
scala> peopleDF.groupBy("address").max("age").show
+--------+--------+
| address|max(age)|
+--------+--------+
| beijing| 33|
| chengdu| 22|
|shanghai| 28|
+--------+--------+
//等价于select address,avg(age) from people group by address;
scala> peopleDF.groupBy("address").avg("age").show
+--------+------------------+
| address| avg(age)|
+--------+------------------+
| beijing|26.333333333333332|
| chengdu| 18.0|
|shanghai| 28.0|
+--------+------------------+
//等价于select address,min(age) from people group by address;
scala> peopleDF.groupBy("address").min("age").show
+--------+--------+
| address|min(age)|
+--------+--------+
| beijing| 18|
| chengdu| 11|
|shanghai| 28|
+--------+--------+
//等价于select address,sum(age) from people group by address;
scala> peopleDF.groupBy("address").sum("age").show
+--------+--------+
| address|sum(age)|
+--------+--------+
| beijing| 79|
| chengdu| 54|
|shanghai| 28|
+--------+--------+
(3)分组后,求多个聚合值(最值、平均值等)。使用算子groupBy+agg
//等价于select address,count(age),max(age),min(age),avg(age),sum(age) from people group by address;
scala> peopleDF.groupBy("address").agg(count("age"),max("age"),min("age"),avg("age"),sum("age")).show
+--------+----------+--------+--------+------------------+--------+
| address|count(age)|max(age)|min(age)| avg(age)|sum(age)|
+--------+----------+--------+--------+------------------+--------+
| beijing| 3| 33| 18|26.333333333333332| 79|
| chengdu| 3| 22| 11| 18.0| 54|
|shanghai| 1| 28| 28| 28.0| 28|
+--------+----------+--------+--------+------------------+--------+
(4)分组聚合后取别名
scala> peopleDF.groupBy("address").agg(count("age").as("cnt"),avg("age").as("avg")).show
+--------+---+------------------+
| address|cnt| avg|
+--------+---+------------------+
| beijing| 3|26.333333333333332|
| chengdu| 3| 18.0|
|shanghai| 1| 28.0|
+--------+---+------------------+
(5)分组后行转列,使用pivot。
//求同名用户在同一个地址的平均年龄
//把name的不同值作为列名
scala> peopleDF.groupBy("address").pivot("name").avg("age").show
+--------+----+----+------+--------+--------+
| address|lisi| mm|wangwu|xiaoming|zhangsan|
+--------+----+----+------+--------+--------+
| beijing|null|null| 33.0| 23.0| null|
| chengdu|null|16.0| null| null| 22.0|
|shanghai|28.0|null| null| null| null|
+--------+----+----+------+--------+--------+