sparkSql中的那些函数

最新推荐文章于 2024-06-01 08:43:14 发布

humanity11

最新推荐文章于 2024-06-01 08:43:14 发布

阅读量580

点赞数

分类专栏：大数据 spark 文章标签： spark

本文链接：https://blog.csdn.net/humanity11/article/details/104394274

版权

大数据同时被 2 个专栏收录

18 篇文章 2 订阅

订阅专栏

spark

8 篇文章 0 订阅

订阅专栏

对于sparksql的应用企业基本只要属于大数据相关的互联网公司都会安装和使用spark，而sparksql对于对于那些不熟悉sparkapi的人更是一件利器，这对于熟悉mysql的人如虎添翼，好了，废话不多说，我们看下sparksql中的那些很少被用到却非常有用的函数。

lit：Creates a [[Column]] of literal value.创建一个字面值得列；eg:df.select(lit("2020-02-19").as("now"))直接创建了一个时间now列;

typedLit:The difference between this function and [[lit]] is that this function * can handle parameterized scala types e.g.: List, Seq and Map.意思就是可以传集合作为列。

Sort functions：
  desc,asc两种

Aggregate functions：聚合不用多说，用的最多的，如求每个学生的所有成绩，每个部门的人数等。

 approx_count_distinct:Aggregate function: returns the approximate number of distinct items in a group
返回聚合组中的不同项目的成员

SPARK SQL AGGREGATE FUNCTIONS	FUNCTION DESCRIPTION
approx_count_distinct(e: Column)	Returns the count of distinct items in a group.
approx_count_distinct(e: Column, rsd: Double)	Returns the count of distinct items in a group.
avg(e: Column)	Returns the average of values in the input column.
collect_list(e: Column)	Returns all values from an input column with duplicates.
collect_set(e: Column)	Returns all values from an input column with duplicate values .eliminated.
corr(column1: Column, column2: Column)	Returns the Pearson Correlation Coefficient for two columns.
count(e: Column)	Returns number of elements in a column.
countDistinct(expr: Column, exprs: Column*)	Returns number of distinct elements in the columns.
covar_pop(column1: Column, column2: Column)	Returns the population covariance for two columns.
covar_samp(column1: Column, column2: Column)	Returns the sample covariance for two columns.
first(e: Column, ignoreNulls: Boolean)	Returns the first element in a column when ignoreNulls is set to true, it returns first non null element.
first(e: Column): Column	Returns the first element in a column.
grouping(e: Column)	Indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
kurtosis(e: Column)	Returns the kurtosis of the values in a group.
last(e: Column, ignoreNulls: Boolean)	Returns the last element in a column. when ignoreNulls is set to true, it returns last non null element.
last(e: Column)	Returns the last element in a column.
max(e: Column)	Returns the maximum value in a column.
mean(e: Column)	Alias for Avg. Returns the average of the values in a column.
min(e: Column)	Returns the minimum value in a column.
skewness(e: Column)	Returns the skewness of the values in a group.
stddev(e: Column)	alias for `stddev_samp`.
stddev_samp(e: Column)	Returns the sample standard deviation of values in a column.
stddev_pop(e: Column)	Returns the population standard deviation of the values in a column.
sum(e: Column)	Returns the sum of all values in a column.
sumDistinct(e: Column)	Returns the sum of all distinct values in a column.
variance(e: Column)	alias for `var_samp`.
var_samp(e: Column)	Returns the unbiased variance of the values in a column.
var_pop(e: Column)	returns the population variance of the values in a column.

下面看一下不是常用，但却很有用的window funtion，什么是window funciton？

a window function calculates a return value for every input row of a table based on a group of rows, called the Frame. Every input row can have a unique frame associated with it. This characteristic of window functions makes them more powerful than other functions and allows users to express various data processing tasks that are hard (if not impossible) to be expressed without window functions in a concise way.

Spark Window functions operate on a group of rows (like frame, partition) and return a single value for every input row. Spark SQL supports three kinds of window functions:

WINDOW FUNCTIONS USAGE & SYNTAX	SPARK SQL WINDOW FUNCTIONS DESCRIPTION
row_number(): Column	Returns a sequential number starting from 1 within a window partition
rank(): Column	Returns the rank of rows within a window partition, with gaps.
percent_rank(): Column	Returns the percentile rank of rows within a window partition.
dense_rank(): Column	Returns the rank of rows within a window partition without any gaps. Where as Rank() returns rank with gaps.
ntile(n: Int): Column	Returns the ntile id in a window partition
cume_dist(): Column	Returns the cumulative distribution of values within a window partition
lag(e: Column, offset: Int): Column lag(columnName: String, offset: Int): Column lag(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows before the current row, and `null` if there is less than `offset` rows before the current row.
lead(columnName: String, offset: Int): Column lead(columnName: String, offset: Int): Column lead(columnName: String, offset: Int, defaultValue: Any): Column	returns the value that is `offset` rows after the current row, and `null` if there is less than `offset` rows after the current row.

humanity11

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
sparkSql中的那些函数

对于sparksql的应用企业基本只要属于大数据相关的互联网公司都会安装和使用spark，而sparksql对于对于那些不熟悉sparkapi的人更是一件利器，这对于熟悉mysql的人如虎添翼，好了，废话不多说，我们看下sparksql中的那些很少被用到却非常有用的函数。lit：Creates a [[Column]] of literal value.创建一个字面值得列；eg:df.se...
复制链接

扫一扫