《Spark The Definitive Guide》Chapter 6:处理不同类型的数据

Chapter 6:处理不同类型的数据

这一章如题所示讲的就是如何使用DataFrame相关方法处理不同类型数据,具体一点就是:布尔型、数值型、字符串、日期和时间、null、复杂的Array,Map,Struct类型、用户自定义函数

从哪里找到适合的方法

DataFrame(或者DataSet)的方法,因为DataFrame就是Row类型的DataSet,所以最终还是DataSet方法,去哪里找?只有官网了,链接在此

DataSet又有许多子模块,像包含各种统计相关功能的DataFrameStatFunctions、处理空数据(null)的DataFrameNaFunctions

列Column相关的方法在这里:链接在此

还有一些SQL相关的方法:链接在此

处理布尔类型数据

这次用的数据文件是data/retail-data/by-day/2010-12-01.csv

scala> val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/2010-12-01.csv")
df: org.apache.spark.sql.DataFrame = [InvoiceNo: string, StockCode: string ... 6 more fields]

scala> df.printSchema
root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)

其实没啥好讲的,谈到布尔类型无非就是true、false、逻辑比较(等于、不等于、大于小于等)、且或非运算符这些,它们在spark中的应用如下:

# 等于
df.where(col("InvoiceNo").equalTo(536365)).show()
df.where(expr("InvoiceNo=536365")).show()
df.where("InvoiceNo=536365").show()
df.where(col("InvoiceNo")===536365).show()
# 不等于
df.where(not(col("InvoiceNo").equalTo(536365))).show()
df.where(!col("InvoiceNo").equalTo(536365)).show()
df.where(col("InvoiceNo")=!=536365).show()
df.where(expr("InvoiceNo!=536365")).show()
df.where("InvoiceNo!=536365").show()
# scala和python中还可以
df.where("InvoiceNo <> 536365").show()

且(and)或(or)非(not)问题,之前就提过,and连接的串行过滤器(one by one)spark也会将它们变成一个语句同时执行这些过滤器,而or连接必须写在同一个语句内,not就是取反上面代码里

val priceFilter = col("UnitPrice") > 600
val descripFilter = col("Description").contains("POSTAGE")
df.where(col("StockCode").isin("DOT")).where(priceFilter.or(descripFilter)).show()

布尔表达式还可用在其他地方,像新增列

val DOTCodeFilter = col("StockCode") === "DOT"
val priceFilter = col("UnitPrice") > 600
val descripFilter = col("Description").contains("POSTAGE")
df.withColumn("isExpensive", DOTCodeFilter.and(priceFilter.or(descripFilter)))
.as("isExpensive") #重命名这里没必要
.select("unitPrice", "isExpensive").show(5)

df.withColumn("isExpensive",filter.and(price.or(descript))).where("isExpensive=true").show()

如果比较的字段中有空(null)时,最好使用这个方法eqNullSafe

scala> df.where(col("Description").equalTo("LOVE BUILDING BLOCK WORD")).show
scala> df.where(col("Description").eqNullSafe("LOVE BUILDING BLOCK WORD")).show

补充记录

  • 如何去重?
df.distinct() #整体去重
df.dropDuplicates("InvoiceNo","InvoiceDate") #根据某些列去重
  • 如何判断是否为空(null)?
# 具体就是isNull、isNotNull、isNaN(这个也不能叫空)
df.where(col("Description").isNull).show
  • NaN和NULL的区别?
    null是空值,而nan是”非数字“,是无意义的数学运算的结果,像0/0这种。像spark中创建一个nan可以float("nan")

处理数值型数据

就是正常地加减乘除操作,然后就是一些函数,如pow。这里还提了两个函数,一是四舍五入的round,二是计算相关性的皮尔逊相关系数corr

round()操作是向上四舍五入。bround()操作是向下舍去小数

# 一个是3.0,一个是2.0
df.select(round(lit("2.5")), bround(lit("2.5"))).show(2)

处理字符串型数据

就是常见的哪些字符串操作,像大小写转换,去除首尾空格,分割,取子串等等,见链接下的String functions

处理日期和时间型数据

打开链接搜索:Date time functions

处理 null 数据

还是回到根本,pandas中DataFrame有哪些处理null数据的方法,fillna、dropna、isNull、isNaN等等,spark sql 中也对应有相应的方法,在DataFrame的子包na下(df.na._)还有就是sql.functions._下。
像判断是否为空,前面讲了isNull(isNaN)、isNotNull方法,还有几个用于SQL中判断null相关的方法ifnull、nullif、nvl、nvl2方法

  • ifnull(expr1, expr2)nvl(expr1, expr2),expr1为null则返回expr2,否则返回expr1
  • nullif(expr1, expr2),expr1等于expr2则返回null,否则返回expr1
  • nvl2(expr1, expr2, expr3),expr1为null则返回expr3,否则返回expr2

然后是drop删除包含null的行,fill填充一或多个列,文档链接在此

# 默认删除任何值为null的行
df.na.drop() # df.na.drop("any")
df.na.drop("all") # 所有列都为null才删除
df.na.drop("all", Seq("col1","col2")) # 也可以指定特定
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Welcome to this first edition of Spark: The Definitive Guide! We are excited to bring you the most complete resource on Apache Spark today, focusing especially on the new generation of Spark APIs introduced in Spark 2.0. Apache Spark is currently one of the most popular systems for large-scale data processing, with APIs in multiple programming languages and a wealth of built-in and third-party libraries. Although the project has existed for multiple years—first as a research project started at UC Berkeley in 2009, then at the Apache Software Foundation since 2013—the open source community is continuing to build more powerful APIs and high-level libraries over Spark, so there is still a lot to write about the project. We decided to write this book for two reasons. First, we wanted to present the most comprehensive book on Apache Spark, covering all of the fundamental use cases with easy-to-run examples. Second, we especially wanted to explore the higher-level “structured” APIs that were finalized in Apache Spark 2.0—namely DataFrames, Datasets, Spark SQL, and Structured Streaming—which older books on Spark don’t always include. We hope this book gives you a solid foundation to write modern Apache Spark applications using all the available tools in the project. In this preface, we’ll tell you a little bit about our background, and explain who this book is for and how we have organized the material. We also want to thank the numerous people who helped edit and review this book, without whom it would not have been possible.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值