SparkSQL之内置函数--groupBy()和agg()

创建Dataframe

scala>val df= Seq(
     |     ("01","Jack","08012345566","28","SALES","1000",1),
     |     ("02","Tom", "08056586761","19","MANAGEMENT","2500",1),
     |     ("03","Mike","08009097878","25","MARKET","2000",1),
     |     ("04","Tina","07099661234","30","LOGISTICS","3000",0),
     |     ("05","Alex","08019208960","18","MARKET","3500",1),
     |     ("06","Bob","08011223344","22","CLERK","1500",1),
     |     ("07","Dvaid","08022557788","25","CLERK","2500",1),
     |     ("08","Ben","08080201682","35","MARKET","500",1),
     |     ("09","Allen","08099206680","20","MARKET","2500",1),
     |     ("10","Caesar","09011020806","32","SALES","1000",1)).toDF("id","name","cellphone","age","department","expense","gender")
df: org.apache.spark.sql.DataFrame= [id: string, name: string ...5 more fields]

show

scala> df.show
+---+------+-----------+---+----------+-------+------+
| id|  name|  cellphone|age|department|expense|gender|
+---+------+-----------+---+----------+-------+------+
| 01|  Jack|08012345566| 28|     SALES|   1000|     1|
| 02|   Tom|08056586761| 19|MANAGEMENT|   2500|     1|
| 03|  Mike|08009097878| 25|    MARKET|   2000|     1|
| 04|  Tina|07099661234| 30| LOGISTICS|   3000|     0|
| 05|  Alex|08019208960| 18|    MARKET|   3500|     1|
| 06|   Bob|08011223344| 22|     CLERK|   1500|     1|
| 07| Dvaid|08022557788| 25|     CLERK|   2500|     1|
| 08|   Ben|08080201682| 35|    MARKET|    500|     1|
| 09| Allen|08099206680| 20|    MARKET|   2500|     1|
| 10|Caesar|09011020806| 32|     SALES|   1000|     1|
+---+------+-----------+---+----------+-------+------+

根据性别,统计人数。

scala> df.groupBy("gender").count.show()
+------+-----+
|gender|count|
+------+-----+
|     1|    9|
|     0|    1|
+------+-----+

按照department分组后求出组中expense的最大值、最小值和总和,平均年龄。

scala> df.groupBy("department").agg(max("expense"), min("expense"),sum("expense"),mean("age")).show
+----------+------------+------------+------------+--------+
|department|max(expense)|min(expense)|sum(expense)|avg(age)|
+----------+------------+------------+------------+--------+
|     CLERK|        2500|        1500|      4000.0|    23.5|
|     SALES|        1000|        1000|      2000.0|    30.0|
|    MARKET|         500|        2000|      8500.0|    24.5|
| LOGISTICS|        3000|        3000|      3000.0|    30.0|
|MANAGEMENT|        2500|        2500|      2500.0|    19.0|
+----------+------------+------------+------------+--------+

或
scala> df.groupBy("department").agg(max("expense"), min("expense"),sum("expense"),avg("age")).show

过滤出cellphone中包含"080"的电话号码

scala> df.filter($"cellphone".contains("080")).show
+---+------+-----------+---+----------+-------+------+
| id|  name|  cellphone|age|department|expense|gender|
+---+------+-----------+---+----------+-------+------+
| 01|  Jack|08012345566| 28|     SALES|   1000|     1|
| 02|   Tom|08056586761| 19|MANAGEMENT|   2500|     1|
| 03|  Mike|08009097878| 25|    MARKET|   2000|     1|
| 05|  Alex|08019208960| 18|    MARKET|   3500|     1|
| 06|   Bob|08011223344| 22|     CLERK|   1500|     1|
| 07| Dvaid|08022557788| 25|     CLERK|   2500|     1|
| 08|   Ben|08080201682| 35|    MARKET|    500|     1|
| 09| Allen|08099206680| 20|    MARKET|   2500|     1|
| 10|Caesar|09011020806| 32|     SALES|   1000|     1|
+---+------+-----------+---+----------+-------+------+

过滤出cellphone中包含"080"的后,按照department分组后求出组中expense的总和,并按department排序。

scala> df.filter($"cellphone".contains("080")).groupBy($"department").agg(sum($"expense")).orderBy($"department").show(false)
+----------+------------+
|department|sum(expense)|
+----------+------------+
|CLERK     |4000.0      |
|MANAGEMENT|2500.0      |
|MARKET    |8500.0      |
|SALES     |2000.0      |
+----------+------------+

  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

大数据翻身

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值