一、说明
- spark框架源码functions.scala文件包含了大量的内置函数,尤其在agg函数中会广泛应用。
- 使用之前要导入import org.apache.spark.sql.functions._
- 这里只列举了一些常用的内置函数,其他的要参考官网。
二、代码示例
package com.dt.spark.Test
import com.dt.spark.Test.AggTest.Student
import org.apache.spark.sql.{DataFrame, SparkSession}
/**
* 测试spark SQL的内置函数
*/
object FuncTest {
case class Student(classId: Int, name: String, gender: String, age: Int, birthday: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("FuncTest").getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
sc.setLogLevel("WARN")
val stuDF: DataFrame = Seq(
Student(1001, "zhangsan", "F", 20, "1995-12-11 12:12:13"),
Student(1002, "lisi", "M", 16, "2000-01-14 10:10:57"),
Student(1003, "wangwu", "M", 21, "1994-05-13 01:12:00"),
Student(1004, "zhaoliu", "F", 21, "1998-02-03 03:02:48"),
Student(1004, "zhouqi", "M", 22, "1996-06-04 05:28:19"),
Student(1001, "qianba", "M", 19, "1997-10-14 08:09:47"),
Student(1003, "liuliu", "F", 23, "2004-09-14 16:25:30")
).toDF()
import org.apache.spark.sql.functions._
/**
* 主要日期函数,支持 date/timestamp/string
*/
stuDF.withColumn("b01", date_format($"birthday", "yyyy-MM-dd"))
.withColumn("date_add", date_add(col("birthday"), 3))
.withColumn("date_sub", date_sub('birthday, 3))
.withColumn("datediff", datediff('date_add, 'date_sub))
.withColumn("hour", hour(col("birthday"))) //提取小时
.withColumn("month", month(col("birthday"))) //提取月
.withColumn("quarter", quarter(col("birthday"))) //提取季度
.show()
/**
* 字符串函数:
* initcap(e: Column): 单词首字母大写
* lower(e: Column): 转小写
* upper(e: Column): 转大写
* instr(str: Column, substring: String): substring在str中第一次出现的位置
* length(e: Column): 字符串长度
*/
stuDF.withColumn("initcap", initcap($"name"))
.withColumn("lower", lower('gender))
.withColumn("upper", upper(col("name")))
.withColumn("instr", instr(column("classId"), "00"))
.withColumn("length", length(column("name")))
.show()
/**
* 其他非聚合函数:
* struct(cols: Column*):多列组合成新的struct column
* when(condition: Column, value: Any):当condition为true返回value
* lit(literal: Any):将字面量(literal)创建一个Column
* isnan(e: Column):如果列为非数字值得特殊值NaN,则返回true
* isnull(e: Column):如果为空,则返回true
* array(cols: Column*)多列合并为array,cols必须为同类型
*/
stuDF.withColumn("struct", struct(col("name"), $"gender"))
.withColumn("array", array(col("name"), $"gender"))
.withColumn("when", when($"gender" === "F", 1).
when(col("gender") === "M", 0)
.otherwise("2"))
.withColumn("isnan", isnan($"gender"))
.withColumn("isnan1", isnan($"age"))
.withColumn("isnull", isnull($"gender"))
.withColumn("lit", lit("00"))
.show()
/**
* 聚合函数:
* avg:平均值
* collect_list:聚合指定字段的值到list
* count:计数
* mean:平均值
*/
stuDF.agg(avg($"age"), mean($"age"), collect_list($"name"), count($"gender"))
.show()
}
}
三、运行结果
+-------+--------+------+---+-------------------+----------+----------+----------+--------+----+-----+-------+
|classId| name|gender|age| birthday| b01| date_add| date_sub|datediff|hour|month|quarter|
+-------+--------+------+---+-------------------+----------+----------+----------+--------+----+-----+-------+
| 1001|zhangsan| F| 20|1995-12-11 12:12:13|1995-12-11|1995-12-14|1995-12-08| 6| 12| 12| 4|
| 1002| lisi| M| 16|2000-01-14 10:10:57|2000-01-14|2000-01-17|2000-01-11| 6| 10| 1| 1|
| 1003| wangwu| M| 21|1994-05-13 01:12:00|1994-05-13|1994-05-16|1994-05-10| 6| 1| 5| 2|
| 1004| zhaoliu| F| 21|1998-02-03 03:02:48|1998-02-03|1998-02-06|1998-01-31| 6| 3| 2| 1|
| 1004| zhouqi| M| 22|1996-06-04 05:28:19|1996-06-04|1996-06-07|1996-06-01| 6| 5| 6| 2|
| 1001| qianba| M| 19|1997-10-14 08:09:47|1997-10-14|1997-10-17|1997-10-11| 6| 8| 10| 4|
| 1003| liuliu| F| 23|2004-09-14 16:25:30|2004-09-14|2004-09-17|2004-09-11| 6| 16| 9| 3|
+-------+--------+------+---+-------------------+----------+----------+----------+--------+----+-----+-------+
+-------+--------+------+---+-------------------+--------+-----+--------+-----+------+
|classId| name|gender|age| birthday| initcap|lower| upper|instr|length|
+-------+--------+------+---+-------------------+--------+-----+--------+-----+------+
| 1001|zhangsan| F| 20|1995-12-11 12:12:13|Zhangsan| f|ZHANGSAN| 2| 8|
| 1002| lisi| M| 16|2000-01-14 10:10:57| Lisi| m| LISI| 2| 4|
| 1003| wangwu| M| 21|1994-05-13 01:12:00| Wangwu| m| WANGWU| 2| 6|
| 1004| zhaoliu| F| 21|1998-02-03 03:02:48| Zhaoliu| f| ZHAOLIU| 2| 7|
| 1004| zhouqi| M| 22|1996-06-04 05:28:19| Zhouqi| m| ZHOUQI| 2| 6|
| 1001| qianba| M| 19|1997-10-14 08:09:47| Qianba| m| QIANBA| 2| 6|
| 1003| liuliu| F| 23|2004-09-14 16:25:30| Liuliu| f| LIULIU| 2| 6|
+-------+--------+------+---+-------------------+--------+-----+--------+-----+------+
+-------+--------+------+---+-------------------+------------+-------------+----+-----+------+------+---+
|classId| name|gender|age| birthday| struct| array|when|isnan|isnan1|isnull|lit|
+-------+--------+------+---+-------------------+------------+-------------+----+-----+------+------+---+
| 1001|zhangsan| F| 20|1995-12-11 12:12:13|[zhangsan,F]|[zhangsan, F]| 1|false| false| false| 00|
| 1002| lisi| M| 16|2000-01-14 10:10:57| [lisi,M]| [lisi, M]| 0|false| false| false| 00|
| 1003| wangwu| M| 21|1994-05-13 01:12:00| [wangwu,M]| [wangwu, M]| 0|false| false| false| 00|
| 1004| zhaoliu| F| 21|1998-02-03 03:02:48| [zhaoliu,F]| [zhaoliu, F]| 1|false| false| false| 00|
| 1004| zhouqi| M| 22|1996-06-04 05:28:19| [zhouqi,M]| [zhouqi, M]| 0|false| false| false| 00|
| 1001| qianba| M| 19|1997-10-14 08:09:47| [qianba,M]| [qianba, M]| 0|false| false| false| 00|
| 1003| liuliu| F| 23|2004-09-14 16:25:30| [liuliu,F]| [liuliu, F]| 1|false| false| false| 00|
+-------+--------+------+---+-------------------+------------+-------------+----+-----+------+------+---+
+------------------+------------------+--------------------+-------------+
| avg(age)| avg(age)| collect_list(name)|count(gender)|
+------------------+------------------+--------------------+-------------+
|20.285714285714285|20.285714285714285|[zhangsan, lisi, ...| 7|
+------------------+------------------+--------------------+-------------+