处理复杂的数据类型
这里是从我个人翻译的《Spark 权威指南》第六章摘录的一部分,但我觉得书中这块讲的程度还不够,额外补充了一些
当然,更多内容可参见本系列《Spark The Definitive Guide Learning》(Spark 权威指南)学习翻译笔记,收录于此:josonle/Spark-The-Definitive-Guide-Learning
处理 Structs 的方法
这种数据结构同C语言的结构体,内部可以包含不同类型的数据。还是用上面的数据,先创建一个包含struct的DataFrame
scala> val complexDF = df.selectExpr("struct(Description,InvoiceNo) as complex","Description","InvoiceNo")
scala> complexDF.printSchema
root
|-- complex: struct (nullable = false)
| |-- Description: string (nullable = true)
| |-- InvoiceNo: string (nullable = true)
|-- Description: string (nullable = true)
|-- InvoiceNo: string (nullable = true)
复制代码
包含复杂数据类型的complexDF
和之前DataFrame都是一样使用的,区别在于如何取到结构体complex内地字段数据,有如下几种方法:
complexDF.select(col("complex").getField("Description")).show(5,false) # getField方法/getItem方法也OK,二者有区别的
complexDF.select("complex.Description").show(5,false) # 或者直接dot [`.`],全选的话是`.*`
# sql
complexDF.createOrReplaceTempView("complex_df")
spark.sql("select complex.* from complex_df").show(5,false)
spark.sql("select complex.Description from complex_df").show(5,false)
复制代码
处理 Arrays 的方法
如其名数组,和数组的性质之一一样内部只能包含同一类型的数据,先来创建一个包含Array类型的字段的DataFrame,书中这里提到了一个字符串的split
方法,通过第二个正则参数将字符串分割,返回一个Array类型的Column
def split(str: Column, pattern: String): Column
, Splits str around pattern (pattern is a regular expression).
# scala
scala> import org.apache.spark.sql.functions.split
import org.apache.spark.sql.functions.split
# 将Description通过空格分割
scala> df.select(split(col("Description")," ")).printSchema
root
|-- split(Description, ): array (nullable = true)
| |-- element: string (containsNull = true)
scala> df.select(split(col("Description")," ")).show(2)
+---------------------+
|split(Description, )|
+---------------------+
| [WHITE, HANGING, ...|
| [WHITE, METAL, LA...|
+---------------------+
# SQL做法,SELECT split(Description, ' ') FROM dfTable
复制代码
Spark可以将这类复杂数据类型转为另一列,并可以通过一种类似Python操作数组的方式进行查询该数组
scala> df.select(split(col("Description")," ").alias("array_col")).select(expr("array_col[0]")).show(2)
+------------+
|array_col[0]|
+------------+
| WHITE|
| WHITE|
+------------+
# sql写法,SELECT split(Description, ' ')[0] FROM dfTable
# 当然还可以用getItem
scala> df.select(split(col("Description")," ").alias("array_col")).select(col("array_col").getItem(0)).show(2)
复制代码
获取数组的长度可以使用size
方法(也适合于Map)
def size(e: Column): Column
, Returns length of array or map.
scala> import org.apache.spark.sql.functions.size
import org.apache.spark.sql.functions.size
# 我这里Column是用$方式写的
scala> df.select(split($"Description", " ").alias("array_col")).withColumn("no_of_array",size($"array_col")).show(2,false)
+----------------------------------------+-----------+
|array_col |no_of_array|
+----------------------------------------+-----------+
|[WHITE, HANGING, HEART, T-LIGHT, HOLDER]|5 |
|[WHITE, METAL, LANTERN] |3 |
+----------------------------------------+-----------+
复制代码
判断Array中是否包含某个元素可以用array_contains
方法
def array_contains(column: Column, value: Any): Column
, Returns null if the array is null, true if the array contains value, and false otherwise.
多用来做where条件的判断
scala> import org.apache.spark.sql.functions.array_contains
import org.apache.spark.sql.functions.array_contains
scala> df.select(split(col("Description"), " ").alias("array_col")).withColumn("contains_WHITE",array_contains($"array_col","WHITE")).show(5,false)
+------------------------------------------+--------------+
|array_col |contains_WHITE|
+------------------------------------------+--------------+
|[WHITE, HANGING, HEART, T-LIGHT, HOLDER] |true |
|[WHITE, METAL, LANTERN] |true |
|[CREAM, CUPID, HEARTS, COAT, HANGER] |false |
|[KNITTED, UNION, FLAG, HOT, WATER, BOTTLE]|false |
|[RED, WOOLLY, HOTTIE, WHITE, HEART.] |true |
+------------------------------------------+--------------+
# sql中一样的
scala> val df1 = df.select(split(col("Description"), " ").alias("array_col"))
df1: org.apache