pyspark---agg的用法

group by agg用法

pyspark中的agg聚合运算应该才能达到聚合字段的目的, apply的运算都是一行一行的运算且并没有真实的聚合.

pyspark中已经对agg操作定义了很多方便的运算函数,可以直接调用来对其进行运算.

from:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10279.9003906|   13|    0.3|    0.5|    2.5|    3.0|540928.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        0|10748.7001953|   13|    0.3|    0.5|    2.5|    3.0|541243.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

to:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id       |           P |index|xinf   |xup    |yinf   |ysup   |     M  |
+---------+-------------+-----+-------+-------+-------+-------+--------+
|        0|10514.3002929|   13|    0.3|    0.5|    2.5|    3.0|540928.0,541243.0|
|        2|12024.2998047|   13|    0.3|    0.5|    2.5|    3.0|541278.0|
|        1|      10988.5|   13|    0.3|    0.5|    2.5|    3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+

# 如果要想根据id聚合,计算P的均值,并且连接M
from pyspark.sql.functions import first, collect_list, mean

In:
df.groupBy("ID").agg(mean("P"), first("index"),
                     first("xinf"), first("xup"), 
                     first("yinf"), first("ysup"), 
                     collect_list("M"))

下面再看几个例子

from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
import os
os.environ['JAVA_HOME'] = r'C:\servies\Java\jdk8'  
os.environ['SPARK_HOME'] = r'D:\software\spark-2.2.0-bin-hadoop2.7'
os.environ['PYTHONPATH'] = r'D:\software\spark-2.2.0-bin-hadoop2.7\python'

spark = SparkSession.builder.appName('test').getOrCreate()

data = [
    ("Carol","Data Scientist","USA",70000,5),
    ("Peter","Data Scientist","USA",90000,7),
    ("Clark","Data Scientist","UK",111000,10),
    ("Jean","Data Scientist","UK",220000,30),
    ("Bruce","Data Engineer","UK",80000,4),
    ("Thanos","Data Engineer","USA",115000,13),
    ("Scott","Data Engineer","UK",180000,15),
    ("T'challa","CEO","USA",300000,20),
    ("Xavier","Marketing","USA",100000,11),
    ("Wade","Marketing","UK",60000,2)
]

column = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=data, schema=column)
# df.printSchema()
df.show()

在这里插入图片描述
单列聚合,求每个job的最高工资

# 单列聚合,求每个job的最高工资
df1 = df.groupBy('Job').max('salary')
df1.show()

在这里插入图片描述
多关键字分组, 对多列进行同样的聚合操作,求每个Job,每个Country的最高工资和最高资备

# 多关键字分组, 对多列进行同样的聚合操作,求每个Job,每个Country的最高工资和最高资备
df.groupBy('Job','Country').agg(fn.max('salary'), fn.max('seniority')).show()

在这里插入图片描述
对多列进行不同的聚合操作, 并修改相应的列名

# 对多列进行不同的聚合操作, 并修改相应的列名
df.groupBy('job').agg(
    fn.sum("salary").alias("sum_salary"),
    fn.avg("salary").alias("avg_salary"),
    fn.min("salary").alias("min_salary"),
    fn.max("salary").alias("max_salary"),
    fn.mean("salary").alias("mean_salary")
).show(truncate=False)      # truncate=False:左对齐

在这里插入图片描述

dataframe agg 直接计算

spark的agg可以直接对DataFrame进行聚合运算, 简单情况即每一列是可以直接列举的

复杂情况是无法直接穷举每一列的表达式, 而是需要创建表达式集合的情况

# =========================简单情况=========================
data.show(5)
+--------+-------+--------+--------------------+-----+--------+
|glass_id|step_id|equip_id|             timekey|label| unit_id|
+--------+-------+--------+--------------------+-----+--------+
|Y95PR090|  14200|A2PDC100|20190601094814153863|    1|A2PDC100|
|Y95PR090|  14207|A2VTM100|20190601120431648744|    1|A2VTM100|
|Y95PR090|  1420V|A2PVD100|20190601120428511673|    1|A2PVD100|
|Y95PR090|  14300|A2RSM100|20190601125957981111|    1|A2RSM100|
|Y95PR090|  14315|A2PHT500|20190601150105054455|    1|A2PHT500|
+--------+-------+--------+--------------------+-----+--------+


data.agg(mean('label')).show()
+------------------+
|        avg(label)|
+------------------+
|0.7411402157164869|
+------------------+


# ============直接使用循环来创建表达式的集合===============
tips_.show(2)
+----------+----+----+
|total_bill| tip|size|
+----------+----+----+
|     16.99|1.01| 2.0|
|     10.34|1.66| 3.0|
+----------+----+----+


agglist = [mean(x) for x in tips_.columns]

agglist
Out[109]: [Column<b'avg(total_bill)'>, Column<b'avg(tip)'>, Column<b'avg(size)'>]

tips_.agg(*agglist)
Out[111]: DataFrame[avg(total_bill): double, avg(tip): double, avg(size): double]

tips_.agg(*agglist).show()
+------------------+----------------+-----------------+
|   avg(total_bill)|        avg(tip)|        avg(size)|
+------------------+----------------+-----------------+
|19.785942643392282|2.99827868821191|2.569672131147541|
+------------------+----------------+-----------------+

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

WGS.

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值