group by agg用法
pyspark中的agg聚合运算应该才能达到聚合字段的目的, apply的运算都是一行一行的运算且并没有真实的聚合.
pyspark中已经对agg操作定义了很多方便的运算函数,可以直接调用来对其进行运算.
from:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|ID | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10279.9003906| 13| 0.3| 0.5| 2.5| 3.0|540928.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 0|10748.7001953| 13| 0.3| 0.5| 2.5| 3.0|541243.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
to:
+---------+-------------+-----+-------+-------+-------+-------+--------+
|Id | P |index|xinf |xup |yinf |ysup | M |
+---------+-------------+-----+-------+-------+-------+-------+--------+
| 0|10514.3002929| 13| 0.3| 0.5| 2.5| 3.0|540928.0,541243.0|
| 2|12024.2998047| 13| 0.3| 0.5| 2.5| 3.0|541278.0|
| 1| 10988.5| 13| 0.3| 0.5| 2.5| 3.0|540917.0|
+---------+-------------+-----+-------+-------+-------+-------+--------+
# 如果要想根据id聚合,计算P的均值,并且连接M
from pyspark.sql.functions import first, collect_list, mean
In:
df.groupBy("ID").agg(mean("P"), first("index"),
first("xinf"), first("xup"),
first("yinf"), first("ysup"),
collect_list("M"))
下面再看几个例子
from pyspark.sql import SparkSession
from pyspark.sql import functions as fn
import os
os.environ['JAVA_HOME'] = r'C:\servies\Java\jdk8'
os.environ['SPARK_HOME'] = r'D:\software\spark-2.2.0-bin-hadoop2.7'
os.environ['PYTHONPATH'] = r'D:\software\spark-2.2.0-bin-hadoop2.7\python'
spark = SparkSession.builder.appName('test').getOrCreate()
data = [
("Carol","Data Scientist","USA",70000,5),
("Peter","Data Scientist","USA",90000,7),
("Clark","Data Scientist","UK",111000,10),
("Jean","Data Scientist","UK",220000,30),
("Bruce","Data Engineer","UK",80000,4),
("Thanos","Data Engineer","USA",115000,13),
("Scott","Data Engineer","UK",180000,15),
("T'challa","CEO","USA",300000,20),
("Xavier","Marketing","USA",100000,11),
("Wade","Marketing","UK",60000,2)
]
column = ["Name","Job","Country","salary","seniority"]
df = spark.createDataFrame(data=data, schema=column)
# df.printSchema()
df.show()
单列聚合,求每个job的最高工资
# 单列聚合,求每个job的最高工资
df1 = df.groupBy('Job').max('salary')
df1.show()
多关键字分组, 对多列进行同样的聚合操作,求每个Job,每个Country的最高工资和最高资备
# 多关键字分组, 对多列进行同样的聚合操作,求每个Job,每个Country的最高工资和最高资备
df.groupBy('Job','Country').agg(fn.max('salary'), fn.max('seniority')).show()
对多列进行不同的聚合操作, 并修改相应的列名
# 对多列进行不同的聚合操作, 并修改相应的列名
df.groupBy('job').agg(
fn.sum("salary").alias("sum_salary"),
fn.avg("salary").alias("avg_salary"),
fn.min("salary").alias("min_salary"),
fn.max("salary").alias("max_salary"),
fn.mean("salary").alias("mean_salary")
).show(truncate=False) # truncate=False:左对齐
dataframe agg 直接计算
spark的agg可以直接对DataFrame进行聚合运算, 简单情况即每一列是可以直接列举的
复杂情况是无法直接穷举每一列的表达式, 而是需要创建表达式集合的情况
# =========================简单情况=========================
data.show(5)
+--------+-------+--------+--------------------+-----+--------+
|glass_id|step_id|equip_id| timekey|label| unit_id|
+--------+-------+--------+--------------------+-----+--------+
|Y95PR090| 14200|A2PDC100|20190601094814153863| 1|A2PDC100|
|Y95PR090| 14207|A2VTM100|20190601120431648744| 1|A2VTM100|
|Y95PR090| 1420V|A2PVD100|20190601120428511673| 1|A2PVD100|
|Y95PR090| 14300|A2RSM100|20190601125957981111| 1|A2RSM100|
|Y95PR090| 14315|A2PHT500|20190601150105054455| 1|A2PHT500|
+--------+-------+--------+--------------------+-----+--------+
data.agg(mean('label')).show()
+------------------+
| avg(label)|
+------------------+
|0.7411402157164869|
+------------------+
# ============直接使用循环来创建表达式的集合===============
tips_.show(2)
+----------+----+----+
|total_bill| tip|size|
+----------+----+----+
| 16.99|1.01| 2.0|
| 10.34|1.66| 3.0|
+----------+----+----+
agglist = [mean(x) for x in tips_.columns]
agglist
Out[109]: [Column<b'avg(total_bill)'>, Column<b'avg(tip)'>, Column<b'avg(size)'>]
tips_.agg(*agglist)
Out[111]: DataFrame[avg(total_bill): double, avg(tip): double, avg(size): double]
tips_.agg(*agglist).show()
+------------------+----------------+-----------------+
| avg(total_bill)| avg(tip)| avg(size)|
+------------------+----------------+-----------------+
|19.785942643392282|2.99827868821191|2.569672131147541|
+------------------+----------------+-----------------+