Spark 2.1 -- spark SQL , Dataframe 和DataSet 指南

最新推荐文章于 2024-07-25 10:39:07 发布

hopeatme

最新推荐文章于 2024-07-25 10:39:07 发布

阅读量1w

点赞数 2

一综述

spark sql 是spark 中处理格式化数据的模块。不像spark RDD 的API， Spark SQL 接口除了提供结构化数据的信息外，还提供了计算优化的信息。本质上， spark SQL 需要额外信息来优化计算，有多种方式使用Spark SQL , 包括SQL的方式和Dataset API的方式。 spark SQL 的计算结果是和计算引擎（API/ 开发语言）无关地。换句话说，只要数据处理的逻辑不变，开发人员可以轻而易举地在各种API之间切换。

本指南中使用的数据来自spark 发布程序，可以使用spark-shell , pyspark 和sparkR shell 运行。

1.1 SQL

spark sql 用于支持SQL查询，除此而外， spark SQL可以从hive 中读取数据。了解详细spark sql 与hive 集成，见hive tables : http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Spark SQL API 返回结果是Dataset或Dataframe 。除了API外，开发人员还可以使用命令行，或者ODBC/JDBC。

1.2 Dataset 和Dataframe

Dataset 是分布式数据集， dataset的API是在spark 1.6版本中添加地，它的初衷是为了提升RDD（强类型限制，可以使用lambda函数）优化SQL执行引擎。Dataset是JVM中的一个对象，可以作用于其它操作（map,faltMap, filter等）。Dataset API提供Scala /Java 语言支持。 Python 暂不支持Dataset API，但由于python的动态特性，已经支持部分dataset API （例如，用row.columnName来读取行数据的某列），这个例子和R语言的类似。

Dataframe 与dataset 基本类似，只是多了列名这些信息。概念上，它类似于关系数据的表，及R/python 中的dataframe 但在执行效率上进一步优化。可以从很多数据源中生成dataframe ，如结构化数据文件，hive表，外部数据库，或 RDD。 Dataframe API支持 scala / java / python 和R 。在scala和java 中，dataframe 实际表示为dataset的行数据。在scala API 中，Dataframe 实际是Dataset[Row]的别名，但在java API 中，开发人员需要使用Dataset<Row> 来表示 DataFrame 。

通篇只scala Dataset行数据表示 DataFrame

2 开始

2.1 开始之旅： SparkSession

spark 所有功能都始于SparkSession 类，使用SparkSession.builder()来创建一个SparkSession 。

importorg.apache.spark.sql.SparkSession

val spark  
 =SparkSession
.builder 
 ()
.appName 
 ("Spark SQL basic example")
.config 
 ("spark.some.config.option","some-value")
.getOrCreate 
 ()

 
  // For implicit conversions like converting RDDs to DataFrames 
 
 
  import  
  spark.implicits._ 
 

完整例子见： “ examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala”

Spark 2.0 中SparkSession 内置支持hive特性，包括编写hiveQL ，使用Hive UDF ，读取hive 表数据。并且以上不需要安装hive 。

2.2 创建DataFrame

创建了SparkSession后，应用程序可以从已存在RDD上创建DataFrame, 从hive表，或其它Spark 数据源。以下例子从json文件中创建DataFrame

val df  
 = spark 
 .read 
 .json 
 ("examples/src/main/resources/people.json")

  // Displays the content of the DataFrame to stdout 
 
   df 
  .show 
  () 
 
  // +----+-------+ 
 
  // | age|   name| 
 
  // +----+-------+ 
 
  // |null|Michael| 
 
  // |  30|   Andy| 
 
  // |  19| Justin| 
 
  // +----+-------+

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

2.3 非强制类型的Dataset 变换（也就是DataFrame 变换）

DataFrame 提供结构化数据变换DSL（特定领域语言），支持scala/java/python/R。上面提致， Spark 2.0中scala/java API中， Dataframe 只是Dataset行数据。因此，这些操作称为“非强制类型变换”，相对于scala/java 中Dataset 这类“强制类型变换”。

下面是使用Dataset 处理结构化数据的简单例子：

 
  // This import is needed to use the $-notation 
 
  import  
  spark.implicits._ 
 
  // Print the schema in a tree format 
 
   df 
  .printSchema 
  () 
 
  // root 
 
  // |-- age: long (nullable = true) 
 
  // |-- name: string (nullable = true) 
 
  // Select only the "name" column 
 
   df 
  .select 
  ( 
  "name" 
  ).show 
  () 
 
  // +-------+ 
 
  // |   name| 
 
  // +-------+ 
 
  // |Michael 
  | 
 
  // |   Andy| 
 
  // | Justin| 
 
  // +-------+ 
 
  // Select everybody, but increment the age by 1 
 
   df 
  .select 
  ($ 
  "name" 
  , $ 
  "age"  
  +  
  1 
  ).show 
  () 
 
  // +-------+---------+ 
 
  // |   name|(age + 1)| 
 
  // +-------+---------+ 
 
  // |Michael|     null| 
 
  // |   Andy|       31| 
 
  // | Justin|       20| 
 
  // +-------+---------+ 
 
  // Select people older than 21 
 
   df 
  .filter 
  ($ 
  "age"  
  >  
  21 
  ).show 
  () 
 
  // +---+----+ 
 
  // |age|name| 
 
  // +---+----+ 
 
  // | 30|Andy| 
 
  // +---+----+ 
 
  // Count people by age 
 
   df 
  .groupBy 
  ( 
  "age" 
  ).count 
  ().show 
  () 
 
  // +----+-----+ 
 
  // | age|count| 
 
  // +----+-----+ 
 
  // |  19|    1| 
 
  // |null|    1| 
 
  // |  30|    1| 
 
  // +----+-----+

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

详细查看DataSet 支持的操作集，详见API 文档： http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

除了简单的列引用和表达式外，Dataset 提供了丰富的功能库，包括字符串操作，日期计算，通用数据操作等。详见 DataFrame 函数参考： http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

2.4 SQL查询编程

SparkSession 上的sql 函数允许程序执行SQL查询，返回Dataframe 结果

 
  // Register the DataFrame as a SQL temporary view 
 
   df 
  .createOrReplaceTempView 
  ( 
  "people" 
  ) 
 
  val sqlDF  
  = spark 
  .sql 
  ( 
  "SELECT * FROM people" 
  ) 
 
   sqlDF 
  .show 
  () 
 
  // +----+-------+ 
 
  // | age|   name| 
 
  // +----+-------+ 
 
  // |null|Michael| 
 
  // |  30|   Andy| 
 
  // |  19| Justin| 
 
  // +----+-------+

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

2.5 全局临时视图

spark sql 中临时视图是session范围内有效，当创建视图的session退出后，视图随着消亡。如果想拥有一个临时视图可以在所有session中共享，

并且可以存活直到spark 应用程序退出，那就需要创建全局临时视图。全局临时视图保存在数据库 global_temp 中，如果想引用一个全局视图，需要使用全名，如 select * from global_temp.view1。

 
  // Register the DataFrame as a global temporary view 
 
   df 
  .createGlobalTempView 
  ( 
  "people" 
  ) 
 
  // Global temporary view is tied to a system preserved database `global_temp` 
 
   spark 
  .sql 
  ( 
  "SELECT * FROM global_temp.people" 
  ).show 
  () 
 
  // +----+-------+ 
 
  // | age|   name| 
 
  // +----+-------+ 
 
  // |null|Michael| 
 
  // |  30|   Andy| 
 
  // |  19| Justin| 
 
  // +----+-------+ 
 
  // Global temporary view is cross-session 
 
   spark 
  .newSession 
  ().sql 
  ( 
  "SELECT * FROM global_temp.people" 
  ).show 
  () 
 
  // +----+-------+ 
 
  // | age|   name| 
 
  // +----+-------+ 
 
  // |null|Michael| 
 
  // |  30|   Andy| 
 
  // |  19| Justin| 
 
  // +----+-------+

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

2.6 创建datasets

Dataset 和RDD 类似，除了Dataset没有使用java 序列化类和kryo外，而是引入了特定编码器来序列化计算数据，及在网络中传输数据。编码器和标准序列化类都会将对象转化为字节，编码器是动态生成的代码，这种代码使用的数据格式在进行操作时，如过滤，排序或hash时，不需要将字节反序列为对象，就可直接操作。

 
  // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit, 
 
 
  // you can use custom classes that implement the Product interface 
 
 
  case  
  class  
  Person 
  (name 
  :  
  String 
  , age 
  :  
  Long 
  ) 
 

 
  // Encoders are created for case classes 
 
 
  val caseClassDS  
  =  
  Seq 
  ( 
  Person 
  ( 
  "Andy" 
  ,  
  32 
  )).toDS 
  ()caseClassDS 
  .show 
  () 
 
 
  // +----+---+ 
 
 
  // |name|age| 
 
 
  // +----+---+ 
 
 
  // |Andy| 32| 
 
 
  // +----+---+ 
 

 
  // Encoders for most common types are automatically provided by importing spark.implicits._ 
 
 
  val primitiveDS  
  =  
  Seq 
  ( 
  1 
  ,  
  2 
  ,  
  3 
  ).toDS 
  ()primitiveDS 
  .map 
  ( 
  _  
  +  
  1 
  ).collect 
  () 
 
 
  // Returns: Array(2, 3, 4) 
 

 
  // DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name 
 
 
  val path  
  =  
  "examples/src/main/resources/people.json" 
 
 
  val peopleDS  
  = spark 
  .read 
  .json 
  (path 
  ).as 
  [ 
  Person 
  ]peopleDS 
  .show 
  () 
 
 
  // +----+-------+ 
 
 
  // | age|   name| 
 
 
  // +----+-------+ 
 
 
  // |null|Michael| 
 
 
  // |  30|   Andy| 
 
 
  // |  19| Justin| 
 
 
  // +----+-------+ 
 

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

2.7 与RDD的相互操作

spark sql 支持两种方法，将已有RDD转化为Datasets。第一种方法是反射，分析 RDD中所承载的数据的类型，推断RDD转化为Dataset的schema。这种反射的缺点是需要写很多额外的代码，并且你需要预先知道spark应用程序所处理数据的schema .

第二种方法是通过程序接口的方式构造一个schema，将这个schema与RDD关联在一起。此方法虽然繁琐，但却可以实现列数及列类型在运行期才可知的场景。

2.7.1 使用反射推断schema

Spark SQL 的scala API 支持将case class 的RDD，自动转化为Dataframe 。 case class定义表的schema，使用反射读取case class

的参数，这些参数最终变成列名。 case class 可以嵌套，可以包含复杂类型,如 Seqs 或 Arrays。此RDD可以隐式转化为Dataframe，

并注册为一个表，紧接着可以用SQL查询表数据。

 
  import  
  org.apache.spark.sql.catalyst.encoders.ExpressionEncoder 
 
  import  
  org.apache.spark.sql.Encoder 
 
  // For implicit conversions from RDDs to DataFrames 
 
  import  
  spark.implicits._ 
 
  // Create an RDD of Person objects from a text file, convert it to a Dataframe 
 
  val peopleDF  
  = spark 
  .sparkContext 
 
.textFile 
 ("examples/src/main/resources/people.txt")
.map 
 (_.split 
 (","))
.map 
 (attributes  
 =>Person(attributes 
 (0), attributes 
 (1).trim 
 .toInt 
 ))
      
  .toDF 
  () 
 
  // Register the DataFrame as a temporary view 
 
   peopleDF 
  .createOrReplaceTempView 
  ( 
  "people" 
  ) 
 
  // SQL statements can be run by using the sql methods provided by Spark 
 
  val teenagersDF  
  = spark 
  .sql 
  ( 
  "SELECT name, age FROM people WHERE age BETWEEN 13 AND 19" 
  ) 
 
  // The columns of a row in the result can be accessed by field index 
 
   teenagersDF 
  .map 
  (teenager  
  =>  
  "Name: "  
  + teenager 
  ( 
  0 
  )).show 
  () 
 
  // +------------+ 
 
  // |       value| 
 
  // +------------+ 
 
  // |Name: Justin| 
 
  // +------------+ 
 
  // or by field name 
 
   teenagersDF 
  .map 
  (teenager  
  =>  
  "Name: "  
  + teenager 
  .getAs 
  [ 
  String 
  ]( 
  "name" 
  )).show 
  () 
 
  // +------------+ 
 
  // |       value| 
 
  // +------------+ 
 
  // |Name: Justin| 
 
  // +------------+ 
 
  // No pre-defined encoders for Dataset[Map[K,V]], define explicitly 
 
  implicit  
  val mapEncoder  
  = org 
  .apache 
  .spark 
  .sql 
  . 
  Encoders 
  .kryo 
  [ 
  Map 
  [ 
  String,  
  Any 
  ]] 
 
  // Primitive types and case classes can be also defined as 
 
  // implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder() 
 
  // row.getValuesMap[T] retrieves multiple columns at once into a Map[String,T] 
 
   teenagersDF 
  .map 
  (teenager  
  => teenager 
  .getValuesMap 
  [ 
  Any 
  ]( 
  List 
  ( 
  "name" 
  ,  
  "age" 
  ))).collect 
  () 
 
  // Array(Map("name" -> "Justin", "age" -> 19))

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

2.7.2 编程设置schema

前述方法的缺点：事先需要能知道表数据的列个数和列类型，否则地话，只能通过编程设置schema。需要遵照以下三步：

1 从原始RDD 创建一个RDD的行数据

2 创建一个StructType 类型表示schema ，需要与1 中行数据对应

3 将schema 应用于RDD行数据，通过 SparkSession 的createDataFrame方法

例如：

importorg.apache.spark.sql.types._

  // Create an RDD 
 
  val peopleRDD  
  = spark 
  .sparkContext 
  .textFile 
  ( 
  "examples/src/main/resources/people.txt" 
  ) 
 
  // The schema is encoded in a string 
 
  val schemaString  
  =  
  "name age" 
 
  // Generate the schema based on the string of schema 
 
  val fields  
  = schemaString 
  .split 
  ( 
  " " 
  ) 
 
  .map 
  (fieldName  
  =>  
  StructField 
  (fieldName 
  ,  
  StringType 
  , nullable  
  =  
  true 
  )) 
 
  val schema  
  =  
  StructType 
  (fields 
  ) 
 
  // Convert records of the RDD (people) to Rows 
 
  val rowRDD  
  = peopleRDD 
 
.map 
 (_.split 
 (","))
.map 
 (attributes  
 =>Row(attributes 
 (0), attributes 
 (1).trim 
 ))

  // Apply the schema to the RDD 
 
  val peopleDF  
  = spark 
  .createDataFrame 
  (rowRDD 
  , schema 
  ) 
 
  // Creates a temporary view using the DataFrame 
 
   peopleDF 
  .createOrReplaceTempView 
  ( 
  "people" 
  ) 
 
  // SQL can be run over a temporary view created using DataFrames 
 
  val results  
  = spark 
  .sql 
  ( 
  "SELECT name FROM people" 
  ) 
 
  // The results of SQL queries are DataFrames and support all the normal RDD operations 
 
  // The columns of a row in the result can be accessed by field index or by field name 
 
   results 
  .map 
  (attributes  
  =>  
  "Name: "  
  + attributes 
  ( 
  0 
  )).show 
  () 
 
  // +-------------+ 
 
  // |        value| 
 
  // +-------------+ 
 
  // |Name: Michael| 
 
  // |   Name: Andy| 
 
  // | Name: Justin| 
 
  // +-------------+

完整例子见： examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala

最低0.47元/天解锁文章

hopeatme

关注

2
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Spark 2.1 -- spark SQL , Dataframe 和DataSet 指南

OverviewSQLDatasets and DataFramesGetting StartedStarting Point: SparkSessionCreating DataFramesUntyped Dataset Operations (aka DataFrame Operations)Running SQL Queries ProgrammaticallyGloba
复制链接

扫一扫

专栏目录