分组中位数计算
原始数据如下:
from pyspark.sql import Row
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
from pyspark.sql import functions as F
col_names = ["name", "date", "score"]
value = [
("Ali", "2020-01-01", 10.0),
("Ali", "2020-01-02", 15.0),
("Ali", "2020-01-03", 20.0),
("Ali", "2020-01-04", 25.0),
("Ali", "2020-01-05", 30.0),
("Bob", "2020-01-01", 15.0),
("Bob", "2020-01-02", 20.0),
("Bob", "2020-01-03", 30.0)
]
df = spark.createDataFrame(value, col_names)
如果想要计算每个对象的中位数,可以用如下代码:
wind = Window.partitionBy('name')
med = F.expr('percentile_approx(score, 0.5)')
df.groupBy('name').agg(med.alias('med_val')).show()
+----+-------+
|name|med_val|
+----+-------+
| Bob| 20.0|
| Ali| 20.0|
+----+-------+
expr函数可以解析hive SQL函数,具体可参考文末链接。
假如不仅要计算每个人的中位数,还想保留原始数据,可以用如下代码:
df.withColumn('med_val', med.over(wind)).show()
+----+----------+-----+-------+
|name| date|score|med_val|
+----+----------+-----+-------+
| Bob|2020-01-01| 15.0| 20.0|
| Bob|2020-01-02| 20.0| 20.0|
| Bob|2020-01-03| 30.0| 20.0|
| Ali|2020-01-01| 10.0| 20.0|
| Ali|2020-01-03| 20.0| 20.0|
| Ali|2020-01-04| 25.0| 20.0|
| Ali|2020-01-02| 15.0| 20.0|
| Ali|2020-01-05| 30.0| 20.0|
+----+----------+-----+-------+
拓展一下,如果想要计算不同比重的分位数,可以用如下代码:
med = F.expr('percentile_approx(score, array(0.25, 0.5, 0.75))')
df.withColumn('med_val', med.over(wind)).show()
+----+----------+-----+------------------+
|name| date|score| med_val|
+----+----------+-----+------------------+
| Bob|2020-01-01| 15.0|[15.0, 20.0, 30.0]|
| Bob|2020-01-02| 20.0|[15.0, 20.0, 30.0]|
| Bob|2020-01-03| 30.0|[15.0, 20.0, 30.0]|
| Ali|2020-01-02| 15.0|[15.0, 20.0, 25.0]|
| Ali|2020-01-05| 30.0|[15.0, 20.0, 25.0]|
| Ali|2020-01-01| 10.0|[15.0, 20.0, 25.0]|
| Ali|2020-01-03| 20.0|[15.0, 20.0, 25.0]|
| Ali|2020-01-04| 25.0|[15.0, 20.0, 25.0]|
+----+----------+-----+------------------+
单列分位数计算
有些情况下,需要单独对某一列数值做分析,可以采用dataframe的approxQuantile属性:
df.approxQuantile("score", [0.2,0.3,0.6,1.0], 0)
结果为:
[15.0, 15.0, 20.0, 30.0]
参考链接: