SQL内置函数

最新推荐文章于 2022-07-29 15:23:04 发布

没有合适的昵称

最新推荐文章于 2022-07-29 15:23:04 发布

阅读量4.8k

点赞数 7

分类专栏： spark

本文链接：https://blog.csdn.net/weixin_42411818/article/details/98942225

版权

spark 专栏收录该内容

23 篇文章 0 订阅

订阅专栏

别问博主哪里来的,呕心沥血,整理来当做字典查看就好了,所有的内置函数名参考:org.apache.spark.sql.catalyst.analysis.FunctionRegistry
特别说明:内置函数过多,记住常用的,其他的有点映像就行了,收藏我的这个文章,到时候来查字典就好了

聚合函数

准备点数据:

val spark = SparkSession
      .builder()
      .appName("AggregateFunctionTest")
      .getOrCreate()

    import spark.implicits._

    val df = spark.sparkContext.parallelize(
      TestData2(1, 1) ::
        TestData2(1, 2) ::
        TestData2(2, 1) ::
        TestData2(2, 2) ::
        TestData2(3, 1) ::
        TestData2(3, 2) :: Nil, 2).toDF()
    df.createOrReplaceTempView("testData2")

1.approx_count_distinct
返回count distinct的估计值，当数据量很大的时候可以近似估值

	spark.sql("select approx_count_distinct(a) from testData2").show() //输出：3.0
    spark.sql("select approx_count_distinct(a, 0.04) from testData2").show() //输出：3.0

2.avg 返回平均值

spark.sql("select avg(a) from testData2").show() //输出：2.0

3.corr(expr, expr)
返回两个expr的Pearson correlation,对于Pearson correlation，可以参考：http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

spark.sql("select corr(a, b) from testData2").show() //输出：9.06493303673679E-17

4.count(expr)

spark.sql("select count(a) from testData2").show() //输出：3

5.covar_pop
用法covar_pop(expr1, expr2) - 返回两个表达式之间的总体协方差

spark.sql("select covar_pop(a, b) from testData2").show() //输出： 3.700743415417188

6.covar_pop
用法covar_samp(expr1, expr2) - 返回两个表达式之间的样本协方差

spark.sql("select covar_samp(a, b) from testData2").show() //输出： 4.440892098500626

7.first
用法first(expr[, isIgnoreNull]) - 返回表达式expr的第一个值，如果isIgnoreNull为true的话则返回非null的值
first_value和first是一样的

spark.sql("select first(a) from testData2").show() //输出：1

8.kurtosis
用法kurtosis(expr) - 返回expr的kurtosis值

spark.sql("select kurtosis(a) from testData2").show() //输出：-1.5

9.last
用法last(expr[, isIgnoreNull]) - 返回表达式expr的最后一个值，如果isIgnoreNull为true的话则返回非null的值
last_value和last是一样的

spark.sql("select last(a) from testData2").show() //输出：3

10.max
用法max(expr) 返回expr的最大值

spark.sql("select max(a) from testData2").show() //输出：3

11.mean
用法mean(expr) 返回expr的平均值,和avg是一样的

spark.sql("select mean(a) from testData2").show() //输出：2.0

12.min
用法min(expr) 返回expr的最小值

spark.sql("select min(a) from testData2").show() //输出：1

13.percentile
用法percentile(expr, percentage) 返回expr中为percentage百分位的值

	spark.sql("select percentile(a, 0.5) from testData2").show() //输出：2
    spark.sql("select percentile(a, array(0.1, 0.2, 0.3, 0.5)) from testData2").show() //输出：[1.0, 1.0, 1.5, 2.0]

14.percentile_approx
用法percentile_approx(expr, percentage[, accuracy]) 返回expr中为percentage百分位的近似值

	spark.sql("select percentile_approx(a, 0.5, 100) from testData2").show() //输出：2.0
    spark.sql("select percentile_approx(a, array(0.1, 0.2, 0.3, 0.5), 100) from testData2").show() //输出：[1.0, 1.0, 1.0, 2.0]

15.skewness
用法skewness(expr) 返回expr的skewness值

spark.sql("select skewness(a) from testData2").show() //输出：-5.09902483316444...

16.std
用法std(expr) 返回expr的样本标准偏差值
stddev以及stddev_samp和std是一样的

spark.sql("select std(a) from testData2").show() //输出：0.8944271909999159

17.stddev_pop
用法stddev_pop(expr) 返回expr的总体标准偏差值

spark.sql("select stddev_pop(a) from testData2").show() //输出：0.816496580927726

18.sum
用法sum(expr) 返回expr的总值

spark.sql("select sum(a) from testData2").show() //输出：12

19.variance
用法variance(expr) 返回expr的样本方差,var_samp和variance一样

spark.sql("select variance(a) from testData2").show() //输出：0.8

20.collect_list
用法collect_list(expr) 返回由所有expr列组成的list

 spark.sql("select collect_list(a) from testData2").show() //输出：[1, 1, 2, 2, 3, 3]

21.collect_set
用法collect_set(expr) 返回由所有expr列组成的set

spark.sql("select collect_set(a) from testData2").show() //输出：[1, 2, 3]

位运算函数

22.&
按位取与运算

spark.sql("select 3 & 5").show() //输出为：1

23.~
按位取非运算

spark.sql("select ~(5)").show() //输出为：-6

24.|
按位取或运算

 spark.sql("select 3 | 5").show() //输出为：7

25.^
按位取异或运算

spark.sql("select 3 ^ 5").show() //输出为：6

cast类型转换函数

26.cast 类型转换

 spark.sql("select cast('10' as int)").show() // 输出类型为int的10

集合函数

27.array
用法 array(expr, …) 返回元素expr, …组成的数组

spark.sql("SELECT array(1,2,3)").show(false) //输出：[1, 2, 3]'

28.array_contains
用法 array_contains(array, value) 如果array中含有value的话则返回true

spark.sql("SELECT array_contains(array(1, 2, 3), 2)").show(false) //输出：true

29.map
用法 map(key0, value0, key1, value1, …) 用指定的key value创建一个map

spark.sql("SELECT map(1.0, '2', 3.0, '4')").show(false) //输出：Map(1.0 -> 2, 3.0 -> 4)

30.named_struct
用法 named_struct(name1, val1, name2, val2, …) 用给定的域名和值创建一个struct

 spark.sql("SELECT named_struct(\"a\", 1, \"b\", 2, \"c\", 3)").show(false) //输出：[1, 2, 3]

31.map_keys
用法 map_keys(map) 返回map中的所有的key

spark.sql("SELECT map_keys(map(1, 'a', 2, 'b'))").show(false) //输出：[1, 2]

32.map_values
用法 map_values(expr, …) 返回map中的所有的value

spark.sql("SELECT map_values(map(1, 'a', 2, 'b'))").show(false) //输出：[a, b]

33.size
用法 size(expr) 返回一个array或者map的长度

	spark.sql("SELECT size(array('b', 'd', 'c', 'a'))").show(false) //输出：4
    spark.sql("SELECT size(map('b', 'd', 'c', 'a'))").show(false) //输出：2

34.sort_array
用法 sort_array(array[, ascendingOrder]) 给array按照desc或者asc进行排序，如果ascendingOrder为true则表示按照asc排序

spark.sql("SELECT sort_array(array('b', 'd', 'c', 'a'), true)").show(false) //输出：[a, b, c, d]

对比函数

准备点数据做演示

	val spark = SparkSession
      .builder()
      .appName("PredicatesFunctionTest")
      .master("local")
      .getOrCreate()

    import spark.implicits._

    val numberDF = Seq(10, 40, 40, 50, 50, 60, 90, 90).toDF("score")

    numberDF.createOrReplaceTempView("numbers")

35.= 和 ==
功能是一样的

	spark.sql("select * from numbers where score = 40").show()
    spark.sql("select * from numbers where score == 40").show()

36.<=>
用法，如果字段值不为null的话，则功能和=是一样的
如果比较的两个字段的值都是null的话则返回true，如果有一个字段的值为null的话则返回false

spark.sql("select * from numbers where score <=> 40").show()
spark.sql("select * from numbers where null <=> null").show()
spark.sql("select * from numbers where score <=> null").show()
spark.sql("select * from numbers where score > 40").show()
spark.sql("select * from numbers where score >= 40").show()
spark.sql("select * from numbers where score < 40").show()
 spark.sql("select * from numbers where score <= 40").show()
 spark.sql("select * from numbers where !(score <= 40)").show()

时间函数

37.add_months
用法 add_months(start_date, num_months) 将日期start_date增加num_months个月

spark.sql("SELECT add_months('2019-08-09', 1)").show() //输出：2019-09-09

38.current_date
用法 current_date() 返回这个sql执行的当前时间

spark.sql("SELECT current_date()").show() //输出：2019-08-09

39.current_timestamp
用法 current_timestamp() 返回这个sql执行的当前时间戳
now和current_timestamp功能是一样的

spark.sql("SELECT current_timestamp()").show(false) //输出：2019-08-09 22:36:49.54

40.datediff
用法 datediff(endDate, startDate) 返回endDate比startDate多多少天

	spark.sql("SELECT datediff('2019-07-31', '2019-07-30')").show(false) //输出：1
    spark.sql("SELECT datediff('2019-07-30', '2019-07-31')").show(false) //输出：-1

41.date_add
用法 date_add(start_date, num_days) 将日期start_date增加num_days天

spark.sql("SELECT date_add('2019-07-30', 1)").show(false) //输出：2019-07-31

42.date_sub
用法 date_sub(start_date, num_days) 将日期start_date减掉num_days天

spark.sql("SELECT date_sub('2019-07-30', 1)").show(false) //输出：2019-07-29

43.date_format
用法 date_format(date, fmt) 返回时间date指定的fmt格式

	spark.sql("SELECT date_format('2019-08-08', 'y')").show(false) //输出：2019
    spark.sql("SELECT date_format('2019-08-08', 'yyyy')").show(false) //输出：2019

44.day
用法 day(date) 返回date是在这个月中的第几天,dayofmonth的功能和day是一样的

spark.sql("SELECT day('2019-07-30')").show(false) //输出：30

45.hour
用法 hour(timestamp) 返回timestamp中的第几个小时

spark.sql("SELECT hour('2019-07-30 12:58:59')").show(false) //输出：12

46.last_day
用法 last_day(date) 返回date所在的月的最后一天的字符串日期

spark.sql("SELECT last_day('2019-01-12')").show(false) //输出：2019-01-31

47.minute
用法 minute(timestamp) 返回timestamp中的分钟的数值

spark.sql("SELECT minute('2009-07-30 12:58:59')").show(false) //输出：30

48.second
用法 second(timestamp) 返回timestamp中的秒的数值

spark.sql("SELECT second('2009-07-30 12:58:59')").show(false) //输出：59

49.month
用法 month(date) 返回date中的月份

spark.sql("SELECT month('2016-07-30')").show(false) //输出：7

50.year
用法 year(date) 返回date中的年份

spark.sql("SELECT year('2016-07-30')").show(false) //输出：2016

51.dayofyear
用法 dayofyear(date) 返回date是在这个年中的第几天

spark.sql("SELECT dayofyear('2016-04-09')").show(false) //输出：100

52.from_unixtime
用法 from_unixtime(unix_time, format) 返回用format格式化的unix_time的字符串时间

spark.sql("SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss')").show(false) //输出：1970-01-01 08:00:00

53.from_utc_timestamp
用法 from_utc_timestamp(timestamp, timezone) 将utc时间的timestamp计算并返回timestamp在timezone的时间

spark.sql("SELECT from_utc_timestamp('2016-08-31', 'Asia/Seoul')").show(false) //输出：2016-08-31 09:00:00

54.months_between
用法 months_between(timestamp1, timestamp2) 返回timestamp1和timestamp2之间有多少个月

spark.sql("SELECT months_between('1997-02-28 10:30:00', '1996-10-30')").show(false) //输出：3.94959677

55.next_day
用法 next_day(start_date, day_of_week) 返回start_date的下一个星期day_of_week

spark.sql("SELECT next_day('2015-01-14', 'TU')").show(false) //表示输出2015-01-14的洗衣歌星期二，输出：2015-01-20

56.quarter
用法 quarter(date) 返回date所属的季节(用1,2,3,4表示)

spark.sql("SELECT quarter('2016-08-31')").show(false) //输出：3

57.to_date
用法 to_date(expr) 返回expr的date时间

spark.sql("SELECT to_date('2009-07-30 04:17:52')").show(false) //输出：2009-07-30

58.to_unix_timestamp
用法 to_unix_timestamp(expr[, pattern]) 将pattern格式的expr转成时间戳

spark.sql("SELECT to_unix_timestamp('2016-04-08', 'yyyy-MM-dd')").show(false) //输出：1460044800

59.to_utc_timestamp
用法 to_utc_timestamp(timestamp, timezone) 将timezone时区的时间timestamp转成utc时区的timestamp

spark.sql("SELECT to_utc_timestamp('2016-08-31', 'Asia/Seoul')").show(false) //输出：2016-08-30 15:00:00

60.trunc
用法 trunc(date, fmt) 按照格式fmt将时间清零，只能清零月的和年的

	spark.sql("SELECT trunc('2009-02-12', 'MM')").show(false) //输出：2009-02-01
    spark.sql("SELECT trunc('2015-10-27', 'YEAR')").show(false) //输出：2015-01-01

61.unix_timestamp
用法 unix_timestamp([expr[, pattern]]) 返回当前的时间戳或者返回指定的时间的时间戳

	spark.sql("SELECT unix_timestamp()").show(false) //输出：1509376609
    spark.sql("SELECT unix_timestamp('2016-04-08', 'yyyy-MM-dd')").show(false) //输出：1460044800

62.weekofyear
用法 weekofyear(date) 返回指定日期date所在的当年中的第几个星期

spark.sql("SELECT weekofyear('2008-02-20')").show(false) //输出：8

63.window
用法 window(expr) 返回expr的date时间
备注,这玩意好像还是个bug
之前执行是这样的:

spark.sql("SELECT window('2016-04-08', '10 second', '1 second', '0 second')").show(false) //输出：2009-07-30

现在执行是这样的:
±-----------------------------------------+
|window |
±-----------------------------------------+
|[2016-04-07 23:59:51, 2016-04-08 00:00:01]|
|[2016-04-07 23:59:52, 2016-04-08 00:00:02]|
|[2016-04-07 23:59:53, 2016-04-08 00:00:03]|
|[2016-04-07 23:59:54, 2016-04-08 00:00:04]|
|[2016-04-07 23:59:55, 2016-04-08 00:00:05]|
|[2016-04-07 23:59:56, 2016-04-08 00:00:06]|
|[2016-04-07 23:59:57, 2016-04-08 00:00:07]|
|[2016-04-07 23:59:58, 2016-04-08 00:00:08]|
|[2016-04-07 23:59:59, 2016-04-08 00:00:09]|
|[2016-04-08 00:00:00, 2016-04-08 00:00:10]|
±-----------------------------------------+
如果有小伙伴知道怎么用,在下面帮忙评论下嘛,万分感谢~

分组函数

这个就当看看吧,看不懂没啥事,这个我都不咋用
参考这几个文章：
https://msdn.microsoft.com/zh-cn/library/ms175939(SQL.90).aspx
https://msdn.microsoft.com/zh-cn/library/ms189305(v=sql.90).aspx
https://msdn.microsoft.com/zh-cn/library/bb510624(v=sql.105).aspx
准备点数据做演示:

		val spark = SparkSession
      .builder()
      .appName("GroupingSetsFunctionTest")
      .master("local")
      .getOrCreate()
       import spark.implicits._

   	   val dataSeq = Seq("Table,Blue,124", "Table,Red,223", "Chair,Blue,101", "Chair,Red,210")

   	   val df = spark.read.csv(dataSeq.toDS()).toDF("Item", "Color", "Quantity")

   	  df.createOrReplaceTempView("Inventory")

64.cube

spark.sql(
      """
        |SELECT Item, Color, SUM(Quantity) AS QtySum
        |FROM Inventory
        |GROUP BY Item, Color WITH CUBE
      """.stripMargin).show()

65.GROUPING + cube

spark.sql(
      """
        SELECT CASE WHEN (GROUPING(Item) = 1) THEN 'ALL'
       |            ELSE nvl(Item, 'UNKNOWN')
       |       END AS Item,
       |       CASE WHEN (GROUPING(Color) = 1) THEN 'ALL'
       |            ELSE nvl(Color, 'UNKNOWN')
       |       END AS Color,
       |       SUM(Quantity) AS QtySum
       |FROM Inventory
       |GROUP BY Item, Color WITH CUBE
      """.stripMargin).show()

66.GROUPING + ROLLUP

spark.sql(
      """
        SELECT CASE WHEN (GROUPING(Item) = 1) THEN 'ALL'
       |            ELSE nvl(Item, 'UNKNOWN')
       |       END AS Item,
       |       CASE WHEN (GROUPING(Color) = 1) THEN 'ALL'
       |            ELSE nvl(Color, 'UNKNOWN')
       |       END AS Color,
       |       SUM(Quantity) AS QtySum
       |FROM Inventory
       |GROUP BY Item, Color WITH ROLLUP
      """.stripMargin).show()

67.GROUPING + ROLLUP + GROUPING_ID

spark.sql(
      """
        SELECT CASE WHEN (GROUPING(Item) = 1) THEN 'ALL'
       |            ELSE nvl(Item, 'UNKNOWN')
       |       END AS Item,
       |       CASE WHEN (GROUPING(Color) = 1) THEN 'ALL'
       |            ELSE nvl(Color, 'UNKNOWN')
       |       END AS Color,
       |       GROUPING_ID(Item, Color) AS GroupingId,
       |       SUM(Quantity) AS QtySum
       |FROM Inventory
       |GROUP BY Item, Color WITH ROLLUP
      """.stripMargin).show()

数学函数

68.acos
用法：acos(expr) - 如果-1<=expr<=1则返回the inverse cosine (a.k.a. arccosine) of expr 否则返回 NaN.

	spark.sql("SELECT acos(1)").show() //输出为1
    spark.sql("SELECT acos(2)").show() //输出为NaN

69.asin
用法：asin(expr) - 如果-1<=expr<=1则返回the inverse sine (a.k.a. arcsine) of expr 否则返回 NaN.

	spark.sql("SELECT asin(0)").show() //输出为0.0
    spark.sql("SELECT asin(2)").show() //输出为NaN

70.atan
用法：atan(expr) - 返回the inverse tangent (a.k.a. arctangent).

spark.sql("SELECT atan(0)").show() //输出为0.0

71.atan2
用法：atan2(expr1, expr2)

spark.sql("SELECT atan2(0, 0)").show() //输出为0.0

72.bin
用法：bin(expr) - 返回Long类型的expr的二进制的字符串数据

	spark.sql("SELECT bin(13)").show() //输出为：1101
    spark.sql("SELECT bin(-13)").show() //输出为1111111111111111111111111111111111111111111111111111111111110011
    spark.sql("SELECT bin(13.3)").show() //输出为：1101

73.bround
用法：bround(expr, d) - 按照四舍五入的规则保留expr的小数点后d位

spark.sql("SELECT bround(2.6, 0)").show() //输出为3

74.cbrt
用法：cbrt(expr) - 返回expr的立方根

spark.sql("SELECT cbrt(27)").show() //返回3.0

75.ceil
用法：ceil(expr) - 返回不比expr小的最小的整数

	 spark.sql("SELECT ceil(-0.1)").show() //返回0

76.ceiling
用法：和cell是一样的
77.cos
用法： cos(expr) - Returns the cosine of expr.

spark.sql("SELECT cos(0)").show() //返回1.0

78.cosh
用法：cosh(expr) - Returns the hyperbolic cosine of expr.

spark.sql("SELECT cosh(0)").show() //返回1.0

79.conv
用法：conv(num, from_base, to_base) - Convert num from from_base to to_base.

	//表示将2进制的100转换成十进制
	spark.sql("SELECT conv('100', 2, 10)").show() // 4
    spark.sql("SELECT conv(-10, 16, -10)").show() //16

80.degrees
用法： degrees(expr) - Converts radians(弧度) to degrees(角度).

spark.sql("SELECT degrees(3.141592653589793)").show() //180.0

81.e
用法： e() - Returns Euler’s number, e.

spark.sql("SELECT e()").show() //输出为2.718281828459045

82.exp
用法：exp(expr) - Returns e to the power of expr.

spark.sql("SELECT exp(0)").show() //输出为1.0

83.expm1
用法： expm1(expr) - Returns exp(expr) - 1.

spark.sql("SELECT expm1(0)").show() //输出为0

84.floor
用法： floor(expr) - 返回不比expr大的最大整数.

	spark.sql("SELECT floor(-0.1)").show() //输出-1
    spark.sql("SELECT floor(5.4)").show() //输出5

85.factorial
用法：factorial(expr) 返回expr的阶乘，expr的取值范围为[0,20]，超过这个范围就返回null

spark.sql("SELECT factorial(5)").show() //输出120

86.hex
用法：hex(expr) 将expr转化成16进制

	spark.sql("SELECT hex(17)").show() //输出11
    spark.sql("SELECT hex(Spark SQL)").show() //输出537061726B2053514C

87.sqrt
用法：sqrt(expr)返回expr的平方根

spark.sql("SELECT sqrt(4)").show() //输出是2.0

88.hypot
用法：返回hypot(expr1**2 + expr2**2)

spark.sql("SELECT hypot(3, 4)").show() //输出是5.0

89.log
用法：log(base, expr) - Returns the logarithm of expr with base

spark.sql("SELECT log(10, 100)").show() //输出是2.0

90.log10
用法：log10(expr) - Returns the logarithm of expr with base 10

spark.sql("SELECT log10(10)").show() //输出是1.0

91.log1p
用法：log1p(expr) - Returns log(1 + expr)

spark.sql("SELECT log1p(0)").show() //输出是0

92.log2
用法：log2(expr) - Returns the logarithm of expr with base 2

spark.sql("SELECT log2(2)").show() //输出是1.0

93.ln
用法：ln(expr) - Returns the natural logarithm (base e) of expr

spark.sql("SELECT ln(1)").show() //输出是0.0

94.negative
用法：negative(expr)返回expr的相反数

spark.sql("SELECT negative(1)").show() //输出是-1

95.pi
用法：pi() 返回PI的值

spark.sql("SELECT pi()").show() //输出是3.141592653589793

96.pmod
用法：pmod(expr1, expr2) - 返回expr1与expr2取模的正数.

	spark.sql("SELECT pmod(10, 3)").show() //输出是1
    spark.sql("SELECT pmod(-10, 3)").show() //输出是2

97.positive
用法：positive(expr)返回expr

spark.sql("SELECT positive(-10)").show() //输出是-10

98.power和pow一样
pow(expr1, expr2) - 返回expr1的expr2次方

	spark.sql("SELECT pow(2, 3)").show() //输出是2的3次方，即8
    spark.sql("SELECT power(2, 3)").show() //输出是2的3次方，即8

99.radians
用法：radians(expr) - 将角度转成弧度

spark.sql("SELECT radians(180)").show() //输出是3.141592653589793

100.rint
用法：rint(expr) - 返回最接近expr的整数的浮点型数据

spark.sql("SELECT rint(12.3456)").show() //输出是12.0

101.round
用法：round(expr, d) - 四舍五入将expr精确到d位

	spark.sql("SELECT round(2.5444, 2)").show() //输出是2.54
    spark.sql("SELECT round(2.2, 0)").show() //输出是2

102.shiftleft
用法：shiftleft(base, expr) - 将base按位左移expr位

spark.sql("SELECT shiftleft(3, 1)").show() //输出是二进制的3按位向左移1位，即6

103.shiftright
用法：shiftright(base, expr) - 将base按位右移expr位

spark.sql("SELECT shiftright(3, 1)").show() //输出是二进制的3按位向右移1位，即1

104.shiftrightunsigned
用法：shiftrightunsigned(base, expr) - 将base按位无符号右移expr位

spark.sql("SELECT shiftrightunsigned(3, 1)").show() //输出是二进制的3按位无符号向右移1位，即1

105.signum和sign一样
用法：sign(expr) - 如果expr为0则返回0，如果expr为负数则返回-1，如果expr是正数则返回1

	spark.sql("SELECT sign(40)").show() //输出是1
    spark.sql("SELECT signum(40)").show() //输出是1

106.sin
用法：sin(expr) - Returns the sine of expr

spark.sql("SELECT sin(0)").show() //输出是0

107.sinh
用法：sinh(expr) - Returns the hyperbolic sine of expr

spark.sql("SELECT sinh(0)").show() //输出是0

108.str_to_map
用法：str_to_map(text[, pairDelim[, keyValueDelim]]) -

spark.sql("SELECT str_to_map('a:1,b:2,c:3', ',', ':')").show() //输出是Map(a -> 1, b -> 2, c -> 3)

109.tan
用法：tan(expr) - Returns the tangent of expr

spark.sql("SELECT tan(0)").show() //输出是0

110.tanh
用法：tanh(expr) - Returns the hyperbolic tangent of expr

spark.sql("SELECT tanh(0)").show() //输出是0

111.+
用法：expr1 + expr2 - Returns expr1+expr2

spark.sql("SELECT 1 + 2").show() //输出是3

112.-
用法：expr1 - expr2 - Returns expr1-expr2

spark.sql("SELECT 1 - 2").show() //输出是-1

113.*
用法：expr1 * expr2 - Returns expr1*expr2

spark.sql("SELECT 1 * 2").show() //输出是2

114./
用法：expr1 / expr2 - Returns expr1/expr2

 spark.sql("SELECT 1 / 2").show() //输出是0.5

115.%(取余数)
用法：expr1 % expr2 - Returns expr1%expr2

spark.sql("SELECT 1 % 2").show() //输出是1

逻辑函数

准备数据:

	val dataSeq = Seq("Table,Blue,124", "Table,Red,223", "Chair,Blue,101", "Chair,Red,210")

    val df = spark.read.csv(dataSeq.toDS()).toDF("Item", "Color", "Quantity")

    df.createOrReplaceTempView("Inventory")

116.and

spark.sql("select * from Inventory where Item = 'Table' and Color = 'Red'").show()

117.or

spark.sql("select * from Inventory where Item = 'Table' or Color = 'Red'").show()

118.not

spark.sql("select * from Inventory where not(Item = 'Table' and Color = 'Red') ").show()

119.in

spark.sql("select * from Inventory where Item in ('Table', 'Chair')").show()

字符串函数

准备数据:

	val df = spark.sparkContext.parallelize(
      (1 to 100).map(i => TestData(i, i.toString))).toDF()
    df.createOrReplaceTempView("testData")

120.ascii
用法：ascii(str) 返回字符串str第一个字母的数值

	spark.sql("SELECT ascii('222')").show() //输出：50
    spark.sql("SELECT ascii(2)").show() //输出：50

121.base64
用法：base64(bin) 返回二进制的bin的base 64 string

spark.sql("SELECT base64('222')").show() //输出：MjIy

122.concat
用法：concat(str1, str2, …, strN) 返回str1, str2, …, strN拼接起来的字符串

spark.sql("SELECT concat('222', 'spark')").show() //输出：222spark

123.concat_ws
用法：concat_ws(sep, [str | array(str)]+) 返回str1, str2, …, strN拼接起来的字符串, 用sep隔开

spark.sql("SELECT concat_ws(' ','222', 'spark')").show() //输出：222 spark

124.decode
用法：decode(bin, charset) 用charset来解码bin

spark.sql("SELECT decode(encode('abc', 'utf-8'), 'utf-8')").show() //输出：abc

125.encode
用法：encode(bin, charset) 用charset来编码bin

spark.sql("SELECT encode('abc', 'utf-8')").show() //输出：[61 62 63]

126.elt
用法：elt(n, str1, str2, …) 返回str1, str2, …中的第n个字符串

spark.sql("SELECT elt(1, 'scala', 'java')").show() //输出：scala

127.find_in_set
用法：find_in_set(str, str_array) 返回str在str_array中的位置(1表示第一个)

spark.sql("SELECT find_in_set('ab','abc,b,ab,c,def')").show() //输出：3

128.format_string
用法：format_string(strfmt, obj, …) 返回str在str_array中的位置(1表示第一个)

spark.sql("SELECT format_string('Hello World %d %s', 100, 'days')").show() //输出：Hello World 100 days

129.format_number
用法：format_number(expr1, expr2) 将expr1的小数点format到expr2个

spark.sql("SELECT format_number(12332.123456, 4)").show() //输出：12,332.1235

130.get_json_object
用法：get_json_object(json_txt, path)

spark.sql("SELECT get_json_object('{\"a\":\"b\"}', '$.a')").show() //输出：b

131.initcap
用法：initcap(str), 使的单词的第一个字母大写，其他字母小写，每个单词是以空格隔开

spark.sql("SELECT initcap('sPark sql')").show() //输出：Spark Sql

132.instr
用法：instr(str, substr) 返回substr在str中出现的位置(1 based)

spark.sql("SELECT instr('SparkSQL', 'SQL')").show() //输出：6

133.lcase
用法：lcase(str) 返回str的小写化之后的字符串
lower和lcase功能是一样的

spark.sql("SELECT lcase('SparkSql')").show() //输出：sparksql

134.ucase
用法：ucase(str) 将str换成大写
upper的功能和ucase一样

spark.sql("SELECT ucase('SparkSql')").show() //输出：SPARKSQL

135.length
用法：length(str) 返回str的长度

spark.sql("SELECT length('SparkSql')").show() //输出：9

136.levenshtein
用法：levenshtein(str1, str2) 返回str1和str2的Levenshtein distance
编辑距离（Edit Distance），又称Levenshtein距离，是指两个字串之间，由一个转成另一个所需的最少编辑操作次数。
许可的编辑操作包括将一个字符替换成另一个字符，插入一个字符，删除一个字符。一般来说，编辑距离越小，两个串的相似度越大。

spark.sql("SELECT levenshtein('kitten', 'sitting')").show() //输出：3(因为有3个字母相同)

137.like
用法：expr like str 判断expr是否匹配str

spark.sql("SELECT * from testData where value like '%3%'").show()

138.rlike
用法：expr rlike str 判断expr是否正则匹配str

spark.sql("SELECT * from testData where value rlike '\\\\d+'").show()

139.locate
用法：locate(substr, str[, pos]) 返回substr在从pos开始的str后的首次出现的位置

spark.sql("SELECT locate('bar', 'foobarbar', 5)").show() //输出：7

140.lpad
用法：lpad(str, len, pad) 将str左拼接pad到长度为len，如果str的长度大于len的话，则返回截取长度为len的str

	spark.sql("SELECT lpad('hi', 5, '??')").show() //输出：???hi
    spark.sql("SELECT lpad('hi', 1, '??')").show() //输出：h

141.rpad
用法：rpad(str, len, pad) 将str右拼接pad到长度为len，如果str的长度大于len的话，则返回截取长度为len的str

	spark.sql("SELECT rpad('hi', 5, '??')").show() //输出：hi???
    spark.sql("SELECT rpad('hi', 1, '??')").show() //输出：h

142.ltrim
用法：ltrim(str) 将str左边的空格都去掉

spark.sql("SELECT ltrim('    SparkSQL')").show() //输出：SparkSQL

143.rtrim
用法：rtrim(str) 将str右边的空格都去掉

spark.sql("SELECT rtrim('SparkSQL    ')").show() //输出：SparkSQL

144.trim
用法：trim(str) 将str左右两边的空格都去掉

spark.sql("SELECT trim('   SparkSQL    ')").show() //输出：SparkSQL

145.json_tuple
用法：json_tuple(jsonStr, p1, p2, …, pn) 分别提取jsonStr中域p1, p2, …, pn相对应的值

spark.sql("SELECT json_tuple('{\"a\":1, \"b\":2}', 'a', 'b')").show()

146.parse_url
用法：parse_url(url, partToExtract[, key]) 解析u并提取url中的一个部分

spark.sql("SELECT parse_url('http://spark.apache.org/path?query=1', 'HOST')").show() //输出：spark.apache.org
spark.sql("SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY')").show() //输出：query=1
spark.sql("SELECT parse_url('http://spark.apache.org/path?query=1', 'QUERY', 'query')").show() //输出：1

147.printf
用法：printf(strfmt, obj, …) 根据printf-style的形式来格式化strfmt

spark.sql("SELECT printf('Hello World %d %s', 100, 'days')").show() //输出：Hello World 100 days

148.regexp_extract
用法：regexp_extract(str, regexp[, idx]) 提取正则匹配到字符串regexp的字符串

spark.sql("SELECT regexp_extract('100-200', '(\\\\d+)-(\\\\d+)', 1)").show() //输出：100

149.regexp_replace
用法：regexp_replace(str, regexp, rep) 如果regexp能匹配到str中某部分，则str中的这部分字符串会被rep替换掉

spark.sql("SELECT regexp_replace('100-200', '(\\\\d+)', 'num')").show() //输出：num-num

150.regexp_replace
用法：regexp_replace(str, regexp, rep) 如果regexp能匹配到str中某部分，则str中的这部分字符串会被rep替换掉

spark.sql("SELECT regexp_replace('100-200', '(\\\\d+)', 'num')").show() //输出：num-num

151.repeat
用法：repeat(str, n) 返回str重复n次的字符串

spark.sql("SELECT repeat('123', 2)").show() //输出：123123

152.reverse
用法：reverse(str) 将str反转

spark.sql("SELECT reverse('Spark SQL')").show() //输出：LQS krapS

153.sentences
用法：sentences(str[, lang, country]) 将字符串str分割成句子数据，每一个句子又分割成由单词组成的数组
lang表述语言， country表示国家

	spark.sql("SELECT sentences('Hi there! Good morning.')").show(false) //输出：[WrappedArray(Hi, there), WrappedArray(Good, morning)]
    spark.sql("SELECT sentences('你 好! 早 上好.', 'zh', 'CN')").show(false) //输出：[WrappedArray(你, 好), WrappedArray(早, 上好)]

154.soundex
用法：soundex(str) 返回字符串的Soundex code

spark.sql("SELECT soundex('Miller')").show() //输出：M460

155.space
用法：space(n) 返回n个空格

spark.sql("SELECT concat('hi', space(3), 'hek')").show() //输出：hi   hek

156.split
用法：split(str, regex) 对字符串str按照regex切割

spark.sql("SELECT split('oneAtwoBthreeC', '[ABC]')").show() //输出：["one", "two", "three", ""]

157.substr
用法：substr(str, pos[, len]) 从字符串的pos位置开始对str切割长度为len的字符串
substring和substr功能是一样的

	spark.sql("SELECT substr('Spark SQL', 5)").show() //输出：k SQL
    spark.sql("SELECT substr('Spark SQL', -3)").show() //输出：SQL
    spark.sql("SELECT substr('Spark SQL', 5, 1)").show() //输出：k

158.substring_index
用法：substring_index(str, delim, count) 从开始到delim在str出现的count次的地方切割字符串str

spark.sql("SELECT substring_index('www.apache.org', '.', 2)").show() //输出：www.apache

159.translate
用法：translate(input, from, to) 将input中的from替换成to

spark.sql("SELECT translate('AaBbCc', 'abc', '123')").show() //输出：A1B2C3

160.xpath
用法：xpath(xml, xpath) 按照xpath从xml中提取对应的数值

spark.sql("SELECT xpath('<a><b>b1</b><b>b2</b><b>b3</b><c>c1</c><c>c2</c></a>','a/b/text()')").show() //输出：[b1, b2, b3]

161.xpath_boolean
用法：xpath_boolean(xml, xpath) 如果xpath在xml中存在则返回true

spark.sql("SELECT xpath_boolean('<a><b>1</b></a>','a/b')").show() //输出：true

162.xpath_double
用法：xpath_double(xml, xpath) 按照xpath提取xml中的double值，如果在xml没有发现xpath则返回零，如果xpath返回的值不是数字则返回NaN
xpath_number的功能和xpath_double是一样的

spark.sql("SELECT xpath_double('<a><b>1</b><b>2</b></a>', 'sum(a/b)')").show() //输出：3.0

163.xpath_float
用法：xpath_float(xml, xpath) 按照xpath提取xml中的float值，如果在xml没有发现xpath则返回零，如果xpath返回的值不是数字则返回NaN

spark.sql("SELECT xpath_float('<a><b>1</b><b>2</b></a>', 'sum(a/b)')").show() //输出：3.0

164.xpath_int
用法：xpath_int(xml, xpath) 按照xpath提取xml中的int值，如果在xml没有发现xpath则返回零，如果xpath返回的值不是数字则返回0

spark.sql("SELECT xpath_int('<a><b>1</b><b>2</b></a>', 'sum(a/b)')").show() //输出：3

165.xpath_long
用法：xpath_long(xml, xpath) 按照xpath提取xml中的long值，如果在xml没有发现xpath则返回零，如果xpath返回的值不是数字则返回0

spark.sql("SELECT xpath_long('<a><b>1</b><b>2</b></a>', 'sum(a/b)')").show() //输出：3.0

166.xpath_short
用法：xpath_short(xml, xpath) 按照xpath提取xml中的short值，如果在xml没有发现xpath则返回零，如果xpath返回的值不是数字则返回0

spark.sql("SELECT xpath_short('<a><b>1</b><b>2</b></a>', 'sum(a/b)')").show() //输出：3.0

167.xpath_string
用法：xpath_string(xml, xpath) 按照xpath在xml第一次出现的文本值

 spark.sql("SELECT xpath_string('<a><b>b</b><c>cc</c><c>c2</c></a>','a/c')").show() //输出：3.0

窗口函数

	 val dataSeq = Seq("c001,j2se,t002", "c002,java web,t002",
      "c003,ssh,t001", "c004,oracle,t001", "c005,spark,t003", "c006,c,t003", "c007,js,t002")

    val df = spark.read.csv(dataSeq.toDS()).toDF("cno", "cname", "tno")

    df.createOrReplaceTempView("course")

    spark.sql("SELECT c.*,LAG(c.cname,1) OVER(ORDER BY c.cno) as lag_result FROM course c").show()
    spark.sql("SELECT c.*,LEAD(c.cname,1) OVER(ORDER BY c.cno) as lead_result FROM course c").show()

168.row_number
为查询出来的每一行记录生成依次排序且不重复的的序号
先使用over子句中的排序语句对记录进行排序，然后按照这个顺序生成序号。over子句中的order by子句与SQL语句中的order by子句没有任何关系，这两处的order by 可以完全不同

spark.sql("SELECT c.*,row_number() OVER(ORDER BY c.cno) as rowNo FROM course c").show()

169.其他的窗口函数

val numberDF = Seq(10, 40, 40, 50, 50, 60, 90, 90).toDF("score")

    numberDF.createOrReplaceTempView("numbers")

    //rank函数用于返回结果集的分区内每行的排名， 行的排名是相关行之前的排名数加一

    //dense_rank函数的功能与rank函数类似，dense_rank函数在生成序号时是连续的，
    // 而rank函数生成的序号有可能不连续。dense_rank函数出现相同排名时，将不跳过相同排名号，rank值紧接上一次的rank值
    //在各个分组内，rank()是跳跃排序，有两个第一名时接下来就是第四名，dense_rank()是连续排序，有两个第一名时仍然跟着第二名
    
	// row_number:不管排名是不是有相同的，都按照顺序1，2，3…..n
//rank:排名相同的名次一样，同一排名有几个，后面排名就会跳过几次，如1 2 2 2 5 6 6 8
//dense_rank:排名相同的名次一样，且后面名次不跳跃 如 1 2 2 2 3 4 4 5

    //cume_dist的计算方法：小于等于当前行值的行数/总行数。
    //percent_rank的计算方法：当前rank值-1/总行数-1

    //ntile函数可以对序号进行分组处理，将有序分区中的行分发到指定数目的组中

    //参考：http://www.cnblogs.com/52XF/p/4209211.html
    spark.sql(
      """
        |select ROW_NUMBER() over(order by score) as rownum
        |,score
        |,cume_dist()over(order by score) as cum
        |,percent_rank() over(order by score) as per_rnk
        |,rank() over(order by score) as rnk
        |,dense_rank() over(order by score) as dense_rnk
        |,ntile(4) over(order by score) as nt
        |from numbers
      """.stripMargin).show()

各种其他的函数

170.assert_true
用法 assert_true(expr) 返回元素expr是否为true，如果expr不为true，则抛异常

spark.sql("SELECT assert_true(0 < 1)").show(false) //输出：null

171.crc32
用法 crc32(expr) 返回expr的cyclic redundancy check value

spark.sql("SELECT crc32('spark')").show(false) //输出：2635321133

172.md5
用法 md5(expr) Returns an MD5 128-bit checksum as a hex string of expr

spark.sql("SELECT md5('spark')").show(false) //输出：98f11b7a7880169c3bd62a5a507b3965

173.hash
用法 hash(expr1, expr2, …) 返回expr1, expr2, …的hash值

spark.sql("SELECT hash('Spark', array(123), 2)").show(false) //输出：-1321691492

174.sha
用法 sha(expr) Returns a sha1 hash value as a hex string of the expr
sha1的用法和sha是一样的

spark.sql("SELECT sha('Spark')").show(false) //输出：85f5955f4b27a9a4c2aab6ffe5d7189fc298b92c

175.sha2
用法 sha2(expr, bitLength) Returns a checksum of SHA-2 family as a hex string of expr.
SHA-224, SHA-256, SHA-384, and SHA-512 are supported. Bit length of 0 is equivalent to 256

spark.sql("SELECT sha2('Spark', 256)").show(false) //输出：529bc3b07127ecb7e53a4dcf1991d9152c24537d919178022b2c42657f79a26b

176.spark_partition_id
用法 spark_partition_id() 返回当前数据的partition id

用法 spark_partition_id() 返回当前数据的partition id

177.input_file_name
用法 input_file_name() 返回当前正在读取数据的文件名字

spark.sql("SELECT input_file_name()").show(false) //输出：""

178.monotonically_increasing_id
用法 monotonically_increasing_id(),生成递增的唯一id

spark.sql("SELECT monotonically_increasing_id()").show(false) //输出：0

179.current_database
用法 current_database() 返回当前的database

spark.sql("SELECT current_database()").show(false) //输出：default

180.reflect
用法 reflect(class, method[, arg1[, arg2 …]]) 利用反射调用class中的方法methon
java_method和reflect功能是一样的

	spark.sql("SELECT reflect('java.util.UUID', 'randomUUID')").show(false) //输出：8f3d20fa-4e0f-4ef9-9935-5972cc5b0d79
    spark.sql("SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2')").show(false) //输出：a5cf6c42-0c85-418f-af6c-3e4e5b1328f2

181.abs
绝对值

spark.sql("SELECT abs(-1)").show() //输出为1

182.coalesce
得到第一个不是null的值

	 spark.sql("SELECT coalesce(1, 2)").show() //输出为1
    spark.sql("SELECT coalesce(null, 1, 2)").show() //输出为1
    spark.sql("SELECT coalesce(null, null, 2)").show() //输出为2
    spark.sql("SELECT coalesce(null, null, null)").show() //输出为null

183.explode
将一个数组中的每一个元素变成每一行的值

spark.sql("SELECT explode(array(10,20))").show()

184.greatest
得到参数中最大的值(参数的个数必须大于2)，如果参数为null则跳过,如果参数都是为null，则结果为null

spark.sql("SELECT greatest(10, 3, 5, 13, -1)").show() //输出为13

185.least
用法：least(expr, …) - 返回所有参数中最小的值，跳过为null的值

spark.sql("SELECT least(10, 3, 5, 13, -1)").show() //输出为-1

186.if
用法：if(expr1, expr2, expr3) - 如果expr1为true, 则返回expr2; 否则返回expr3.

spark.sql("SELECT if(1 < 2, 'a', 'b')").show() // 输出为a

187.inline
将结构体的数组转成表

spark.sql("SELECT inline(array(struct(1, 'a'), struct(2, 'b')))").show()

188.isnan
判断参数是否是NAN(not a number)

spark.sql("SELECT isnan(cast('NaN' as double))").show() //输出为true

189.ifnull
用法：ifnull(expr1, expr2) - 如果expr1为null则返回expr2，否则返回expr1

spark.sql("SELECT ifnull(NULL, array('2'))").show() //输出 [2]

190.isnull
用法：isnull(expr) - 如果expr为null则返回true，否则返回false

spark.sql("SELECT isnull(1)") //输出false

191.isnotnull
用法：isnotnull(expr) - 如果expr不为null则返回true，否则返回false

spark.sql("SELECT isnotnull(1)") //输出true

192.nanvl
用法： nanvl(expr1, expr2) - 如果expr1不是NAN则返回expr1，否则返回expr2

spark.sql("SELECT nanvl(cast('NaN' as double), 123)").show() //输出是123.0

193.nullif
用法：nullif(expr1, expr2) - 如果expr1 等于 expr2则返回null, 否则返回 expr1.

spark.sql("SELECT nullif(2, 2)").show() //返回null

194.nvl
用法：nvl(expr1, expr2) - 如果expr1 为 null 则返回expr2, 否则返回expr1.

spark.sql("SELECT nvl(NULL, array('2'))").show() //输出 [2]

195.nvl2
用法：nvl2(expr1, expr2, expr3) - 如果expr1不是null则返回expr2, 否则返回expr3.

spark.sql("SELECT nvl2(NULL, 2, 1)").show() //输出 1

196.posexplode
用法：posexplode(expr) -,将array的expr中的每一个元素变成带有位置信息的每一行,将map的expr中的每一个键值对变成带有位置信息的每一行,

spark.sql("SELECT posexplode(array(10,20))").show()
spark.sql("SELECT posexplode(map(1.0, '2', 3.0, '4'))").show()

197.rand
用法：rand([seed]) - 返回一个随机独立同分布(i.i.d.)的值，取值区间为[0, 1).

	spark.sql("SELECT rand()").show() //输出为0.30627416170191424 每次运行都不一样
    spark.sql("SELECT rand(0)").show() //输出为0.8446490682263027 每次运行都是一样
    spark.sql("SELECT rand(null)").show() //输出为0.8446490682263027 每次运行都是一样

198.randn
用法：randn([seed]) - 返回一个随机独立同分布(i.i.d.)的值，取值的逻辑是符合标准正太分布

 	spark.sql("SELECT randn()").show() //输出为-1.499402805473817 每次运行都不一样
    spark.sql("SELECT randn(0)").show() //输出为1.1164209726833079 每次运行都是一样
    spark.sql("SELECT randn(null)").show() //输出为1.1164209726833079 每次运行都是一样

199.stack
用法：stack(n, expr1, …, exprk) - 将 expr1, …, exprk 分成 n 行.

spark.sql("SELECT stack(2, 1, 2, 3)").show()

200.when
用法：CASE WHEN expr1 THEN expr2 [WHEN expr3 THEN expr4]* [ELSE expr5] END

spark.sql("SELECT CASE WHEN 1 > 2 THEN 'a' WHEN 2 > 3 THEN 'B' ELSE 'C' END").show() // 输出 C

没有合适的昵称

关注

7
点赞
踩
15

收藏

觉得还不错? 一键收藏
0
评论
SQL内置函数

别问博主哪里来的,呕心沥血,整理来当做字典查看就好了,所有的内置函数名参考:org.apache.spark.sql.catalyst.analysis.FunctionRegistry特别说明:内置函数过多,记住常用的,其他的有点映像就行了,收藏我的这个文章,到时候来查字典就好了聚合函数准备点数据:val spark = SparkSession .builder() ...
复制链接

扫一扫