《Spark The Definitive Guide》Chapter 5:基本结构化API操作

本文详细介绍了Spark结构化API的操作,包括DataFrame的创建、列操作、转换和查询。重点讲解了如何使用select、selectExpr、withColumn、drop、renamenTo等方法进行数据处理,以及如何进行数据过滤、去重、排序和取样。此外,还涵盖了DataFrame的join、union、repartition和sortWithinPartitions等操作,以及如何将数据返回给Driver程序。
摘要由CSDN通过智能技术生成

Chapter 5:基本结构化API操作

前言

《Spark 权威指南》学习计划

Schemas (模式)

我这里使用的是书附带的数据源中的 2015-summary.csv 数据

scala> val df = spark.read.format("csv").option("header","true").option("inferSchema","true").load("data/2015-summary.csv")
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> df.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)

通过printSchema方法打印df的Schema。这里Schema的构造有两种方式,一是像上面一样读取数据时根据数据类型推断出Schema(schema-on-read),二是自定义Schema。具体选哪种要看你实际应用场景,如果你不知道输入数据的格式,那就采用自推断的。相反,如果知道或者在ETL清洗数据时就应该自定义Schema,因为Schema推断会根据读入数据格式的改变而改变。

看下Schema具体是什么,如下输出可知自定义Schema要定义包含StructType和StructField两种类型的字段,每个字段又包含字段名、类型、是否为null或缺失

scala> spark.read.format("csv").load("data/2015-summary.csv").schema
res1: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,IntegerType,true))

一个自定义Schema的例子,具体就是先引入相关类StructType,StructField和相应内置数据类型(Chapter 4中提及的Spark Type),然后定义自己的Schema,最后就是读入数据是通过schema方法指定自己定义的Schema

scala> import org.apache.spark.sql.types.{StructType,StructField,StringType,LongType}
import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}

scala> val mySchema = StructType(Array(
     |  StructField("DEST_COUNTRY_NAME",StringType,true),
     |  StructField("ORIGIN_COUNTRY_NAME",StringType,true),
     |  StructField("count",LongType,true)
     | ))
mySchema: org.apache.spark.sql.types.StructType = StructType(StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))

scala> val df = spark.read.format("csv").schema(mySchema).load("data/2015-summary.csv")
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> df.printSchema
root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: long (nullable = true)

看这里StringType、LongType,其实就是Chapter 4中谈过的Spark Type。还有就是上面自定义Schema真正用来的是把RDD转换为DataFrame,参见之前的笔记

Columns(列) 和 Expressions(表达式)

书提及这里我觉得讲得过多了,其实质就是告诉你在spark sql中如何引用一列。下面列出这些

df.select("count").show
df.select(df("count")).show
df.select(df.col("count")).show #col方法可用column替换,可省略df直接使用col
df.select($"count").show #scala独有的特性,但性能没有改进,了解即可(书上还提到了符号`'`也可以,如`'count`)
df.select(expr("count")).show
df.select(expr("count"),expr("count")+1 as "count+1").show(5) #as是取别名
df.select(expr("count+1")+1).show(5)
df.select(col("count")+1).show(5)

大致就上面这些了,主要是注意col和expr方法,二者的区别是expr可以直接把一个表达式的字符串作为参数,即expr("count+1")等同于expr("count")+1expr("count")+1

多提一句,SQL中select * from xxx在spark sql中可以这样写df.select("*")/df.select(expr("*"))/df.select(col("*"))

书中这一块还讲了为啥上面这三个式子相同,spark会把它们编译成相同的语法逻辑树,逻辑树的执行顺序相同。编译原理学过吧,自上而下的语法分析,LL(1)自左推导
比如 (((col("someCol") + 5) * 200) - 6) < col("otherCol") 对应的逻辑树如下

逻辑树

Records(记录) 和 Rows(行)

Chapter 4中谈过DataFrame=DataSet[Row],DataFrame中的一行记录(Record)就是一个Row类型的对象。Spark 使用列表达式 expression 操作 Row 对象,以产生有效的结果值。Row 对象的内部表示为:字节数组。因为我们使用列表达式操作 Row 对象,所以,字节数据不会对最终用户展示(用户不可见)

我们来自定义一个Row对象

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val myRow = Row("China",null,1,true)
myRow: org.apache.spark
Welcome to this first edition of Spark: The Definitive Guide! We are excited to bring you the most complete resource on Apache Spark today, focusing especially on the new generation of Spark APIs introduced in Spark 2.0. Apache Spark is currently one of the most popular systems for large-scale data processing, with APIs in multiple programming languages and a wealth of built-in and third-party libraries. Although the project has existed for multiple years—first as a research project started at UC Berkeley in 2009, then at the Apache Software Foundation since 2013—the open source community is continuing to build more powerful APIs and high-level libraries over Spark, so there is still a lot to write about the project. We decided to write this book for two reasons. First, we wanted to present the most comprehensive book on Apache Spark, covering all of the fundamental use cases with easy-to-run examples. Second, we especially wanted to explore the higher-level “structured” APIs that were finalized in Apache Spark 2.0—namely DataFrames, Datasets, Spark SQL, and Structured Streaming—which older books on Spark don’t always include. We hope this book gives you a solid foundation to write modern Apache Spark applications using all the available tools in the project. In this preface, we’ll tell you a little bit about our background, and explain who this book is for and how we have organized the material. We also want to thank the numerous people who helped edit and review this book, without whom it would not have been possible.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值