《Spark The Definitive Guide》Chapter 6：处理不同类型的数据

最新推荐文章于 2020-11-24 23:42:26 发布

VIP文章 lzw2016

最新推荐文章于 2020-11-24 23:42:26 发布

阅读量349

点赞数

分类专栏：读《Spark The Definitive Guide》

本文链接：https://blog.csdn.net/lzw2016/article/details/91125941

版权

文章目录

Chapter 6：处理不同类型的数据

Chapter 6：处理不同类型的数据

这一章如题所示讲的就是如何使用DataFrame相关方法处理不同类型数据，具体一点就是：布尔型、数值型、字符串、日期和时间、null、复杂的Array，Map，Struct类型、用户自定义函数

从哪里找到适合的方法

DataFrame（或者DataSet）的方法，因为DataFrame就是Row类型的DataSet，所以最终还是DataSet方法，去哪里找？只有官网了，链接在此

DataSet又有许多子模块，像包含各种统计相关功能的DataFrameStatFunctions、处理空数据（null）的DataFrameNaFunctions

列Column相关的方法在这里：链接在此

还有一些SQL相关的方法：链接在此

处理布尔类型数据

这次用的数据文件是data/retail-data/by-day/2010-12-01.csv

scala> val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/2010-12-01.csv")
df: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 6 more fields]

scala> df.printSchema
root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

其实没啥好讲的，谈到布尔类型无非就是true、false、逻辑比较（等于、不等于、大于小于等）、且或非运算符这些，它们在spark中的应用如下：

# 等于
df.where(col("InvoiceNo").equalTo(536365)).show()
df.where(expr("InvoiceNo=536365")).show()
df.where("InvoiceNo=536365").show()
df.where(col("InvoiceNo")===536365).show()
# 不等于
df.where(not(col("InvoiceNo").equalTo(536365))).show()
df.where(!col("InvoiceNo").equalTo(536365)).show()
df.where(col("InvoiceNo")=!=536365).show()
df.where(expr("InvoiceNo!=536365")).show()
df.where("InvoiceNo!=536365").show()
# scala和python中还可以
df.where("InvoiceNo <> 536365").show()

且（and）或（or）非（not）问题，之前就提过，and连接的串行过滤器（one by one）spark也会将它们变成一个语句同时执行这些过滤器，而or连接必须写在同一个语句内，not就是取反上面代码里

val priceFilter = col("UnitPrice") > 600
val descripFilter = col("Description").contains("POSTAGE")
df.where(col("StockCode").isin("DOT")).where(priceFilter.or(descripFilter)).show()

布尔表达式还可用在其他地方，像新增列

val DOTCodeFilter = col("StockCode") === "DOT"
val priceFilter = col("UnitPrice") > 600
val descripFilter = col("Description").contains("POSTAGE")
df.withColumn("isExpensive", DOTCodeFilter.and(priceFilter.or(descripFilter)))
.as("isExpensive") #重命名这里没必要
.select("unitPrice", "isExpensive").show(5)

df.withColumn("isExpensive",filter.and(price.or(descript))).where("isExpensive=true").show()

如果比较的字段中有空（null）时，最好使用这个方法eqNullSafe

scala> df.where(col("Description").equalTo("LOVE BUILDING BLOCK WORD")).show
scala> df.where(col("Description").eqNullSafe("LOVE BUILDING BLOCK WORD")).show

补充记录

如何去重？
df.distinct() #整体去重
df.dropDuplicates("InvoiceNo","InvoiceDate") #根据某些列去重
如何判断是否为空（null）？
# 具体就是isNull、isNotNull、isNaN(这个也不能叫空)
df.where(col("Description").isNull).show
NaN和NULL的区别？
null是空值，而nan是”非数字“，是无意义的数学运算的结果，像0/0这种。像spark中创建一个nan可以float("nan")

处理数值型数据

就是正常地加减乘除操作，然后就是一些函数，如pow。这里还提了两个函数，一是四舍五入的round，二是计算相关性的皮尔逊相关系数corr

round()操作是向上四舍五入。bround()操作是向下舍去小数
# 一个是3.0，一个是2.0
df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

处理字符串型数据

就是常见的哪些字符串操作，像大小写转换，去除首尾空格，分割，取子串等等，见链接下的String functions

处理日期和时间型数据

打开链接搜索：Date time functions

处理 null 数据

还是回到根本，pandas中DataFrame有哪些处理null数据的方法，fillna、dropna、isNull、isNaN等等，spark sql 中也对应有相应的方法，在DataFrame的子包na下（df.na._）还有就是sql.functions._下。
像判断是否为空，前面讲了isNull（isNaN）、isNotNull方法，还有几个用于SQL中判断null相关的方法ifnull、nullif、nvl、nvl2方法

ifnull(expr1, expr2)和nvl(expr1, expr2)，expr1为null则返回expr2，否则返回expr1
nullif(expr1, expr2)，expr1等于expr2则返回null，否则返回expr1
nvl2(expr1, expr2, expr3)，expr1为null则返回expr3，否则返回expr2

然后是drop删除包含null的行，fill填充一或多个列，文档链接在此

# 默认删除任何值为null的行
df.na.drop() # df.na.drop("any")
df.na.drop("all") # 所有列都为null才删除
df.na.drop("all", Seq("col1","col2")) # 也可以指定特定

最低0.47元/天解锁文章

lzw2016

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《Spark The Definitive Guide》Chapter 6：处理不同类型的数据

文章目录Chapter 6：处理不同类型的数据从哪里找到适合的方法处理布尔类型数据处理数值型数据处理字符串型数据处理日期和时间型数据处理 null 数据处理复杂的数据类型处理 Structs 的方法处理 Arrays 的方法处理 Maps 的方法处理 JSON 的方法自定义函数（UDF）使用Chapter 6：处理不同类型的数据这一章如题所示讲的就是如何使用DataFrame相关方法处理不同类...
复制链接

扫一扫