sparksql 定义聚合函数_SparkSQL 内置函数: 聚合函数,集合函数,日期操作,数学,字符串操作...

SparkSQL自带了一组功能丰富的内置函数,这些函数既可以用于DataFrame API, 也可以用于SQL接口。

内置函数可以分为几类: 聚合操作, 集合操作, 日期/时间, 数学, 字符串,窗口操作,其他。

from pyspark.sql import SparkSession

import pyspark.sql.functions as F

spark = SparkSession \

.builder \

.appName("Python Spark SQL basic example") \

.config("spark.some.config.option", "some-value") \

.getOrCreate()

# 创建DataFrame, customers, products

customers = [(1,'James',21,'M',100), (2, "Liz",25,"F", 150), (3, "John", 31, "M", 120),\

(4, "Jennifer", 45, "F", 300), (5, "Robert", 41, "M", 120), (6, "Sandra", 45, "F", 200)]

df_customers = spark.createDataFrame(customers, ["cID", "name", "age", "gender", "income"]) # list -> DF

## 聚合操作

- approx_count_distinct/countDistinct: 计算某一列或几列不同元素的个数

- avg: 平均数, count: 个数; first: 首个元素, last:最后一个元素; max/min, mean, sum, sumDistinct

- var_pop: 总体方差, var_samp/variance 计算样本方差; stddev_pop, stddev_samp: 标准差

- covar_pop, covar_samp 协方差,.corr 相关系数

- kurtosis: 峰度; .skewness:偏度

df = spark.createDataFrame([('2015-04-08',)], ['dt'])

df_customers.agg(F.approx_count_distinct(df_customers.age).alias('distinct_ages')).collect()# [Row(distinct_ages=5)]

df_customers.agg(F.countDistinct("age", "name").alias('distinct_ages')).collect() # [Row(distinct_ages=6)]

df_customers.select(F.avg('age')).collect()

df_customers.select(F.mean('age')).collect() # [Row(avg(age)=34.666666666666664)]

df_customers.select(F.sum('age')).collect() # [Row(sum(age)=208)]

df_customers.select(F.max('age')).collect() # [Row(max(age)=45)]

df_customers.select(F.min('age').alias('min_age')).collect() # [Row(min_age=21)]

df_customers.groupby('gender').agg(F.first('age').alias('first_age')).collect() # [Row(gender='F', first_age=25), Row(gender='M', first_age=21)]

df_customers.select(F.last('age').alias('last_age')).collect() # [Row(last_age=45)]

df_customers.select(F.sumDistinct('age').alias('sumdis_age')).collect() # [Row(sumdis_age=163)]

df_customers.select(F.var_samp('age').alias("variance")).collect() # [Row(variance=109.46666666666665)]

df_customers.select(F.covar_samp("age", "income")).collect() # [Row(covar_samp(age, income)=534.0000000000001)]

## 集合操作

- array_contains: 包含某个元素

- array_distinct: 删除重复的元素

- array_except 返回一个array 在第一个col 不在第二个col的元素

- array_intersect(col1, col2) 交集

- array_max, array_min, array_position(col, value):第一次出现value的位置,从1开始

- array_remove(col, element) 删除出现的元素

- array_repeat(col, count): 创造一个数组,count为重复的个数

- array_sort(col): 对数组进行排序

- array_union(col1, col2), 合并操作

- array_overlap, 判断是否有重叠

- array_zip, 合并,类似于python的zip

- reverse: 逆序; shuffle: 打乱; size, slice, sort_array

df = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])

df.show()

| data|

+---------+

|[a, b, c]|

| []|

+---------+

df.select(F.array_contains(df.data, "a")).collect() ## df.data 每一个元素需要是一个array

# [Row(array_contains(data, a)=True), Row(array_contains(data, a)=False)]

df = spark.createDataFrame([([1, 2, 3, 2],), ([4, 5, 5, 4],)], ['data'])

df.select(F.array_distinct(df.data)).collect()

# [Row(array_distinct(data)=[1, 2, 3]), Row(array_distinct(data)=[4, 5])]

from pyspark.sql import Row

df = spark.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])])

df.select(F.array_except(df.c1, df.c2)).collect()

# [Row(array_except(c1, c2)=['b'])]

df = spark.createDataFrame([(["c", "b", "a"],), ([],)], ['data'])

df.select(F.array_position(df.data, "a")).collect()

#[Row(array_position(data, a)=3), Row(array_position(data, a)=0)]

df = spark.createDataFrame([Row(c1=["b", "a", "c"], c2=["c", "d", "a", "f"])])

df.select(F.array_union(df.c1, df.c2)).collect()

# [Row(array_union(c1, c2)=['b', 'a', 'c', 'd', 'f'])]

## faltten: 拉直操作, 如果有多个水平,合并成一维数组

df = spark.createDataFrame([([[1, 2, 3], [4, 5], [6]],), ([None, [4, 5]],)], ['data'])

df.select(F.flatten(df.data).alias('r')).collect()

# [Row(r=[1, 2, 3, 4, 5, 6]), Row(r=None)]

## 日期操作

### 转换

# 将unix 时间戳转化为时间戳字符串

time_df = spark.createDataFrame([(1576425600,)], ['unix_time'])

time_df.select(F.from_unixtime('unix_time').alias('ts')).collect()

# [Row(ts='2019-12-16 00:00:00')]

# 将时间戳字符串转化为unix时间戳

time_df = spark.createDataFrame([('2019-12-16',)], ['dt'])

time_df.select(F.unix_timestamp('dt', 'yyyy-MM-dd').alias('unix_time')).collect()

# [Row(unix_time=1576425600)]

## to_date, Converts a Column of pyspark.sql.types.StringType or pyspark.sql.types.TimestampType into pyspark.sql.types.DateType

time_df.select(F.to_date(time_df.dt).alias('date')).collect() # [Row(date=datetime.date(2019, 12, 16))]

time_df.select(F.to_timestamp(time_df.dt).alias('date')).collect() # [Row(date=datetime.datetime(2019, 12, 16, 0, 0))]

time_df.select(F.to_utc_timestamp(time_df.dt,"PST").alias('utc_time')).collect() # [Row(utc_time=datetime.datetime(2019, 12, 16, 8, 0))]

# date_format(date, format)

time_df.select(F.date_format('dt', 'MM/dd/yyy').alias('date')).collect() # [Row(date='12/16/2019')]

### 抽取

# year, month, week,

time_df = spark.createDataFrame([('2019-12-16 13:08:15',)], ['dt'])

time_df.select(F.year(time_df.dt).alias('year')).collect() # [Row(year=2019)]

# 提取一年的第几周 weekofyear(col); dayofyear(col), weekofmonth, dayofmonth,

time_df.select(F.weekofyear(time_df.dt).alias('week')).collect() # [Row(week=51)]

time_df.select(F.dayofyear(time_df.dt).alias('day')).collect() # [Row(day=350)]

time_df.select(F.dayofmonth(time_df.dt).alias('day')).collect() # [Row(day=16)]

time_df.select(F.month(time_df.dt).alias('month')).collect() # [Row(month=12)]

time_df.select(F.hour(time_df.dt).alias('hour')).collect() # [Row(hour=13)]

time_df.select(F.minute(time_df.dt).alias('minute')).collect() # [Row(minute=8)]

time_df.select(F.second(time_df.dt).alias('second')).collect() # [Row(second=15)]

### 时间运算

# 下一天

time_df.select(F.date_add(time_df.dt, 1).alias('next_date')).collect() # [Row(next_date=datetime.date(2019, 12, 17))]

# 前一天

time_df.select(F.date_sub(time_df.dt, 1).alias('next_date')).collect() # [Row(next_date=datetime.date(2019, 12, 15))]

# 日期截断

df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['t'])

df.select(F.date_trunc('year', df.t).alias('year')).collect() # [Row(year=datetime.datetime(1997, 1, 1, 0, 0))]

df.select(F.date_trunc('month', df.t).alias('month')).collect() # [Row(year=datetime.datetime(1997, 2, 1, 0, 0))]

# 日期差

df = spark.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2'])

df.select(F.datediff(df.d2, df.d1).alias('diff')).collect() # [Row(diff=32)]

## last_day, next_day, months_between

df = spark.createDataFrame([('1997-02-28 05:02:11',)], ['dt'])

df.select(F.add_months(df.dt, 1).alias('next_month')).collect() # [Row(next_month=datetime.date(1997, 3, 31))]

df.select(F.last_day(df.dt).alias('last_day')).collect() # [Row(last_day=datetime.date(1997, 2, 28))]

df.select(F.next_day(df.dt, 'Sun').alias('next_day')).collect() # [Row(next_day=datetime.date - 这个网站可出售。 - 最佳的Server monitoring 来源和相关信息。(1997, 3, 2))]

df = spark.createDataFrame([('1997-02-28 10:30:00', '1996-10-30')], ['date1', 'date2'])

df.select(F.months_between(df.date1, df.date2).alias('months')).collect() # [Row(months=3.94959677)]

## 数学

用于处理数值类型的列

'''

abs, ceil, floor

cos, acos, sin, asin, tan, atan, atan2

exp, factorial(阶乘), hypot(平方和开根号), log, log10, log2, pow, round, shiftLeft, sqrt, log1p(col):log(x+1)

hex (十六进制)

'''

spark.createDataFrame([(21,)], ['a']).select(F.shiftLeft('a', 3).alias('r')).collect()# [Row(r=168)] 21*2^3

## 字符串

字符串分割,大小写转化,删除首尾空白, 正则表达式匹配

'''

ltrim, rtrim, trim

ascii, base64, unbase64, decode, encode, format_string

repeat, reverse

instr(str, substr): 第一次出现的位置; locate(substr, str, pos=1)

字符拼接:concat, concat_ws

分割,子串: split, substring, substring_index

isnan, isnull, nanvl(col1, col2): 返回col1, 如果col1不为nan, 否则返回col2

lower, upper

'''

## 其他

lit(col): reates a Column of literal value.

rand, randn 产生随机数, sequence(start, stop, step=None), shuffle 打乱

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值