下面是使用Python中的Dataframe API(Spark 1.6)的示例实现.
import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import FloatType
假设我们在“工资”火花数据框中为客户提供月薪,例如:
月| customer_id |薪水
我们希望在整个月内找到每位客户的中位数工资
步骤1:编写用户定义的函数来计算中位数
def find_median(values_list):
try:
median = np.median(values_list) #get the median of values in a list in each row
return round(float(median),2)
except Exception:
return None #if there is anything wrong with the given values
median_finder = F.udf(find_median,FloatType())
第2步:通过将工资列收集到每行的工资列表中来汇总工资列:
salaries_list = salaries.groupBy("customer_id").agg(F.collect_list("salary").alias("salaries"))
步骤3:在薪水栏上调用median_finder udf,并将中值添加为新列
salaries_list = salaries_list.withColumn("median",median_finder("salaries"))