sparkSql中的那些函数

对于sparksql的应用企业基本只要属于大数据相关的互联网公司都会安装和使用spark,而sparksql对于对于那些不熟悉sparkapi的人更是一件利器,这对于熟悉mysql的人如虎添翼,好了,废话不多说,我们看下sparksql中的那些很少被用到却非常有用的函数。

 lit:Creates a [[Column]] of literal value.创建一个字面 值得列;eg:df.select(lit("2020-02-19").as("now"))直接创建了一个时间now列;

 typedLit:The difference between this function and [[lit]] is that this function * can handle parameterized scala types e.g.: List, Seq and Map.意思就是可以传集合作为列。

Sort functions:
  desc,asc两种

Aggregate functions:聚合不用多说,用的最多的,如求每个学生的所有成绩,每个部门的人数等。
  
 approx_count_distinct:Aggregate function: returns the approximate number of distinct items in a group
返回聚合组中的不同项目的成员
SPARK SQL AGGREGATE FUNCTIONS	FUNCTION DESCRIPTION
approx_count_distinct(e: Column)	Returns the count of distinct items in a group.
approx_count_distinct(e: Column, rsd: Double)	Returns the count of distinct items in a group.
avg(e: Column)	Returns the average of values in the input column.
collect_list(e: Column)	Returns all values from an input column with duplicates.
collect_set(e: Column)	Returns all values from an input column with duplicate values .eliminated.
corr(column1: Column, column2: Column)	Returns the Pearson Correlation Coefficient for two columns.
count(e: Column)	Returns number of elements in a column.
countDistinct(expr: Column, exprs: Column*)	Returns number of distinct elements in the columns.
covar_pop(column1: Column, column2: Column)	Returns the population covariance for two columns.
covar_samp(column1: Column, column2: Column)	Returns the sample covariance for two columns.
first(e: Column, ignoreNulls: Boolean)	Returns the first element in a column when ignoreNulls is set to true, it returns first non null element.
first(e: Column): Column	Returns the first element in a column.
grouping(e: Column)	Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
kurtosis(e: Column)	Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Boolean)	Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element.
last(e: Column)	Returns the last element in a column.
max(e: Column)	Returns the maximum value in a column.
mean(e: Column)	Alias for Avg. Returns the average of the values in a column.
min(e: Column)	Returns the minimum value in a column.
skewness(e: Column)	Returns the skewness of the values in a group.
stddev(e: Column)	alias for `stddev_samp`.
stddev_samp(e: Column)	Returns the sample standard deviation of values in a column.
stddev_pop(e: Column)	Returns the population standard deviation of the values in a column.
sum(e: Column)	Returns the sum of all values in a column.
sumDistinct(e: Column)	Returns the sum of all distinct values in a column.
variance(e: Column)	alias for `var_samp`.
var_samp(e: Column)	Returns the unbiased variance of the values in a column.
var_pop(e: Column)	returns the population variance of the values in a column.

下面看一下不是常用,但却很有用的window funtion,什么是window funciton?

a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way.

Spark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. Spark SQL supports three kinds of window functions:

WINDOW FUNCTIONS USAGE & SYNTAXSPARK SQL WINDOW FUNCTIONS DESCRIPTION
row_number(): ColumnReturns a sequential number starting from 1 within a window partition
rank(): ColumnReturns the rank of rows within a window partition, with gaps.
percent_rank(): ColumnReturns the percentile rank of rows within a window partition.
dense_rank(): ColumnReturns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): ColumnReturns the ntile id in a window partition
cume_dist(): ColumnReturns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column
lag(columnName: String, offset: Int): Column
lag(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int): Column
lead(columnName: String, offset: Int, defaultValue: Any): Column
returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值