[Spark2.0]Spark SQL, DataFrames 和Datasets指南

最新推荐文章于 2023-06-09 11:16:14 发布

yhao浩

最新推荐文章于 2023-06-09 11:16:14 发布

阅读量7.4k

点赞数 1

分类专栏： spark 文章标签： Spark 2.0 官网文档 Dataframe SQL

spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

综述

Spark SQL是Spark提供的针对结构化数据处理的模块。不同于基本的Spark RDD API，SparkSQL提供的接口提供了更多的关于数据和计算执行的信息。在内部，SparkSQL使用这些额外信息完成额外的优化。这里有几种方式可以和SparkSQL相互操作，包括SQL和Dataset API。计算结果的时候使用相同的执行

本页中所有示例使用到的样例的数据都包含在Spark发布中，而且都能在spark-shell，pyspark或者sparkR中运行。

SQL

Spark SQL的一种用法是执行SQL查询。Spark SQL也可以用于从已安装的Hive中读取数据。更多的关于此特性的配置，请参考Hive Tables。当从内部其他编程语言执行SQL，结果将以Dataset/DataFrame形式返回。你也可以通过command-line或者JDBC/ODBC与SQL接口进行交互。

Datasets和DataFrames

Dataset是分布式数据集合。Dataset是Spark1.6新增的接口，用以提供RDDs（强类型，有使用强大的lambda函数的能力）的优点和Spark SQL的经优化的执行引擎的优点。Dataset可以从JVM对象进行构造并通过转换函数（如map，flatmap，filter等）进行操作。DatasetAPI支持Scala和Java。Python不支持Dataset API。但因为Python本身的动态性，DatasetAPI的许多优点都已经可用（比如，你可以通过名字很自然的访问一行的某一个字段，如row.columnName），R的情况与此类似。

DataFrame是Dataset组织成命名列的形式。它在概念上相当于关系型数据库中的表，或者R/Python中的数据帧，但是在底层进行了更多的优化。DataFrames可以从多种数据源创建，例如：结构化数据文件、Hive中的表、外部数据库或者已存在的RDDs。DataFrame API支持Scala、Java、Python和R。在Scala和Java中DataFrame其实是Dataset的RowS的形式的表示。在Scala API中，DataFrame仅仅是Dataset[Row]的别名。但在Java中，使用者需要使用Dataset<Row>来表示一个DataFrame。

在本文档中，我们会经常将Scala/Java Dataset的RowS作为DataFrame的参考。

开始使用

起始点：SparkSession

在Spark中所有功能的切入点是SparkSession类。直接使用SparkSession.builder()就可以创建一个基本的SaprkSession：

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"找到完整的代码。

SparkSession是Spark2.0开始提供的内建了对Hive特性的支持，包括使用HiveQL写查询语句、调用Hive UDFs、从Hive表读取数据的能力。你不需要事先部署Hive就能使用这些特性。

创建DataFrame

使用SparkSession，应用可以从已存在的RDD、Hive表或者Spark数据源创建DataFrame。下面的示例从一个JSON文件创建一个DataFrame:

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"找到完整的代码。

无类型的Dataset操作（aka DataFrame Operations）

DataFrame在Scala、Java、Python和R中为结构化数据操作提供了一个特定领域语言支持。

就像网文提到的，在Spark2.0中，在Scala和Java的API中，DataFrame仅仅是Dataset的RowS表示。与Scala/Java中的强类型的“带类型转换操作”相比，这些操作也可以看做“无类型转换操作”。

这里我们提供了一些使用Dataset进行结构化数据处理的基本示例：

计算机生成了可选文字:|0// This import is needed to use the S-notation import spark. implicits._ print the schema in a tree format df. pri ntschema() // root /-- age: long (nu77ab7e = true) / -- name: string (nu77ab7e = true) select only the "name" column df. sel . show() / name / 'Michae 7 / / Andy/ / Justin/ select everybody, but increment the age by 1 df. S"age" + 1). show() / name / (age + 1) / 'Michae 7 / / Andy/ / Justin/ nun,' 31 / 20/ select people older than 21 > 21). show() / age / name / / 30 'Andy/ Count people by age df. . count() . show() / age / count / / 19/ // 'nun,' 1/ 1/ 1/

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"找到完整的代码。

可以在Dataset上执行的操作的类型的完整列表可以参考API文档。

除了简单的列引用和表达式外，Dataset同时有丰富的函数库，包括字符串操作、日期算法、常用数学操作等。完整的列表可参考DataFrame Function Reference。

编程执行SQL查询语句

Sparksession中的sql函数使得应用可以编程式执行SQL查询语句并且已DataFrame形式返回：

计算机生成了可选文字:|0// Register the DataFrame as a SQL temporary view df. createOrRep1 aceTempVi "peopl e val sq 1 DF spark.sql ("SELECT * FROM people") sq 1 DF . show() / age / name / /nul /Michae / // / 30/ Andy/ / 19/ Justin/

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"找到完整的代码。

创建Dataset

Dataset与RDD很像，不同的是它并不使用Java序列化或者Kryo，而是使用特殊的编码器来为网络间的处理或传输的对象进行序列化。对转换一个对象为字节的过程来说编码器和标准系列化器都是可靠的，编码器的代码是自动生成并且使用了一种格式，这种格式允许Spark在不需要将字节解码成对象的情况下执行很多操作，如filtering、sorting和hashing等。

计算机生成了可选文字:|0// Note: Case classes in Scala 2.10 can support only up to 22 fields. k around this limit, // you can use custom classes that implement the product interface case class person(name: String, age: Long) // Encoders are created for case classes val casecl assDS — Seq(person("Andy", 32)) . toDS() casecl assDS . show() // /name/age/ // /Andy/ 32/ To wor // Encoders for most common types are automatically provided by importing sp ark. implicits._ val primitiveDS Seq(l, 2, 3) . toDS() primitiveDS .map(_ + 1) .collect ( ) // Returns: Array(2, 3, 4) // DataFrames can be converted to a Dataset by providing a class. 77 be done by name val path 'exampl es/s rc/mai n/ resources/people . json " val peopl eDS spark . read . json(path) . as [person] peopl eDS . show() / age / name / /nul /Michae / // / 30/ Andy/ / 19/ Justin/ Mapp ing wi

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"找到完整的代码。

与RDD互操作

Spark SQL支持两种将已存在的RDD转化为Dataset的方法。第一种方法使用反射推断包含特定类型对象的RDD的结构。这种基于反射的方法代码更加简洁，并且当你在写Spark程序的时候已经知道RDD的结构的情况下效果很好。

第二种创建Dataset的方法是通过编程接口建立一个结构，然后将它应用于一个存在的RDD。虽然这种方法更加繁琐，但它允许你在运行之前不知道其中的列和对应的类型的情况下构建Dataset。

使用反射推断结构

Spark SQL的Scala接口支持自动的将一个包含case class的RDD转换为DataFrame。这个case class定义了表结构。Caseclass的参数名是通过反射机制读取，然后变成列名。Caseclass可以嵌套或者包含像Seq或Array之类的复杂类型。这个RDD可以隐式的转换为一个DataFrame，然后被注册为一张表。这个表可以随后被SQL的statement使用。

计算机生成了可选文字:|0i mport i mport // For i mport org. apache. spark. sq] . catalyst. encoders. Expressi onEncoder org. apache. spark. sq] . Encoder imp 7 i cit conversions from RDDs to DataFrames spark. impl icits._ // Create an ROD of Person objects from a text file, convert it to a Dataframe val peopl eDF spark. sparkcontext . textFi 1 es/src/mai n/resources/peopl e. txt") . map( split(', . map(attributes attributes (1). trim. tolnt)) // Register the DataFrame as a temporary view peopl eDF. createorRep1 aceTempvi e") // SQL statements can be run by using the sq 7 methods provided by spark val teenagersDF spark. sql ("SELECT name, age FROM people WHERE age BETWEEN 13 AND 19") // The columns of a row in the result can be accessed by field index teenagersDF. map (teenager / Name : va rue / Just in / or by field name teenagersDF. map (teenager 'Name : 'Name : + teenager (O)). show() + teenager. getAs [stri ng] ("name")). show() / Name : va rue / Just in / No pre-defined encoders for Dataset [Map[K, define explicitly implicit val mapEncoder — org. apache. spark. sq 1 . Encoders. kryoCMap [stri ng, Any]] // primitive types and case c7asses can be also defined as implicit val stri ng1ntMapEncoder: Encoder [Map C string, Int]] = Expressi onEncoder() // row. getva7uesMap[T] retrieves mu7tip7e columns at once into a Map[string, T] teenagersDF. map (teenager teenager. getva1uesMap [Any] (List("name", "age")) ) . col lect() // "Justin", "age " 19))

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"in the Spark repo. 找到完整的代码。

以编程方式指定模式

当case class不能被事先定义（比如记录的结构被编码为字符串，或者对不同的用户，文本数据集被不同的解析并进行字段投影），DataFrame可以通过以下3个步骤实现编程创建：

从原始RDD创建RowS形式的RDD
以StructType创建匹配步骤1中RowS形式的RDD的模式
通过SparkSession提供的createDataFrame方法将模式应用于RowS形式的RDD

例如：

计算机生成了可选文字:|0import org.apache.spark.sq] . types. _ // Create an RDD val peopleRDD = spark.sparkcontext. txt") // The schema is encoded in a string val schemastring 'name age' / Generate the schema based on the string of schema val fields = schemastring.split(" ' .map(fie1dName StructFie1d(fie1dName, StringType, val schema = StructType(fie1ds) / Convert records of the RDD (people) to Rows null able schema — true)) val rowRDD = peopl eRDD .map(_. split(" , ")) .map(attributes , / Apply the schema to the RDD attributes(l) . trim)) s chemaJüR *RowSh$jtä5RDD , DataFrame al peopl eDF spark. createDataFrame(rowRDD, schema) // Creates a temporary view using the DataFrame peopl eDF . createOrRep1 aceTempVi ew( "peopl e // SQL can be run over a temporary view created using DataFrames val results spark.sql ("SELECT name FROM people") // The results of SQL queries are DataFrames and support all the normal RDD operations // The columns of a row in the result can be accessed by field index or by field name results .map(attributes "Name: + attributes(O)) .show() va lue / /Name: Michael/ / Name: Andy/ / Name: Justin/

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala"找到完整的代码。

数据源

Spark SQL通过DataFrame接口，可以支持对多种数据源的操作。DataFrame可以使用关系转换来进行操作，而且可以用来创建临时视图。将DataFrame注册为临时视图可以允许你在数据上运行SQL查询语句。本节讲解使用SparkData Source加载数据和保存数据的通用方法，然后

详细讲述内部支持的数据源可用的特定操作。

通用Load/Save函数

最简单的，默认的数据源（parquet，除非使用spark.sql.sources.default进行了配置）将被用于所有的操作。

计算机生成了可选文字:|0val usersDF spark . read . 1 oad("exampl es/src/mai n/ resources/users . parquet") usersDF . sel ect("name " "favori te_col or ") . wri te . save ("namesAndFa«01 ors . parqu et")

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala"找到完整的代码。

手动指定选项

你可以手动指定数据源以及数据源附带的额外选项。数据源被他们的完全限定名来指定（如，org.apache.spark.sql.parquet），但对于内部支持的数据源，你可以使用短名（json，parquet，jdbc）。DataFrame可以使用这种语法从任何可以转换为其他类型的数据源加载数据。

计算机生成了可选文字:|0val peopl eDF spark . read . format("json ") . 1 oad("exampl es/src/mai n/ resources/p eople.json ") peopl eDF . select("name", "age") .wri te. format("parquet") . . pa r quet")

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala"找到完整的代码。

在文件上直接执行SQL

除了使用读取API加载一个文件到SATAFrame然后查询它的方式，你同样可以通过SQL直接查询文件。

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala"找到完整的代码。

保存模式

保存操作可选SaveMode，它指定了如何处理现有的数据。需要重视的一点是这些保存模式没有使用任何的锁，并且不具有原子性。此外，当执行Overwrite时，数据将先被删除，然后写出新数据。

Scala/Java	其他语言	含义
SaveMode.ErrorIfEcists（默认）	“error”（默认）	保存DataFrame到数据源时，如果数据已经存在，将抛出一个异常。
SaveMode.Append	“append”	保存DataFrame到数据源时，如果数据/表存在时，DataFrame的内容将追加到已存在的数据后。当
SaveMode.Overwrite	“overwrite”	Overwrite模式意味着当保存一个DataFrame到数据源时，如果数据/表已经存在，存在的数据将会被DataFrame的内容覆盖。
SaveMode.Ignore	“ignore”	Ignore模式意味着当保存一个DataFrame到数据源时，如果数据已经存在，保存操作将不会保存DataFrame的内容，并且不会改变原数据。这与SQL中的CREATE TABLE IF NOT EXISTS相似。

保存到持久化表

也可以通过saveAsTable命令将DataFrame作为持久化表保存到Hive元数据库中。注意使用此特性时不需要事先部署Hive。Spark将为你创建一个默认的本地Hive元数据库（使用Derby）。不同于createOrReplaceTempView命令，saveAsTable将具体化DataFrame的内容并且在Hive元数据库中创建一个指向数据的指针。在你保持你的连接是到相同的元数据库时，当你的Spark程序重启后持久化表依然会存在。通过在SparkSession上使用表名调用table命令，可以创建用于持久化表的DataFrame。

默认的saveAsTable将会创建一个“托管表”，意味着数据的位置酱油元数据库控制。托管表也有他们自己的数据，当对应的表被删除时这些数据会一并删除。

Parquet文件

Parquet是一种被很多其他数据处理系统支持的列式文件。Spark SQL提供了可以自动保存原始数据模式的对Parquet文件读取和写入的操作。当写入一个Parquet文件时，因为兼容性原因，所有的列都会自动转换为nullable（可为空的）。

编程式加载数据

使用上面例子的数据：

计算机生成了可选文字:|0// Encoders for most common types are automatically provided by importing sp ark. implicits._ import spark. implicits._ val peopl eDF spark . read . json ("exampl es/s rc/mai n/ resources/peopl e . json ") // DataFrames can be saved as Parquet files, maintaining the schema informat on peopl eDF . wri te . . parquet") // Read in the parquet file created above // Parquet files are self-describing so the schema is preserved // The result of loading a Parquet file is also a DataFrame val parquetFi1eDF spark . read . . parquet") // Parquet files can also be used to create a temporary view and then used i n SQL statements parquetFi 1 eDF . createOrRep1 aceTempVi "parquetFi 1 e ") val names DF name FROM parquetFi1e WHERE age BETWEEN 13 A ND 19") namesDF .map(attributes "Name: + attributes(O)) .show() / Name : va lue / Justin/

可以从Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala"找到完整的代码。

分区发现

表分区是Hive等系统中常用的优化方法。在一个分区表中，数据常常存放在不同的目录中，根据分区列的值的不同，编码了每个分区目录不同的路径。目前parquet数据源已经可以自动的发现和推断分区信息。例如，我们可以用下面的目录结构存储所有我们以前经常使用的数据到分区表，只需要额外的添加两个列gender和country作为分区列：

计算机生成了可选文字:|0path to table gender—male country—US data. parquet country—CN data. parquet gender—female country—US data. parquet country—CN data. parquet

使用SparkSession.read.parquet或者SparkSession.read.load加载path/to/table后，Spark SQL能够自动的从路径中提取分区信息。返回的DataFrame的模式结构是：

计算机生成了可选文字:|0root I-- name: string (nullable — true) age: long (nullable — true) gender: string (nullable — true) country: string (nullable — true)

注意分区列的数据类型是自动推断的。目前支持数值型数据和字符串型数据。有时候用户并不想自动推断分区列的数据类型，这种情况下，可以通过配置spark.sql.sources.partitionColumnTypeInference.enabled这个参数来配置自动类型推断，默认情况下是true。当关闭类型推断后，分区列的类型将为字符串型。

从Spark1.6.0开始，在默认情况下，只在给定的路径下进行分区发现。在上述的例子中，如果用户将path/to/table/gender=male传给SparkSession.read.parquet或者SparkSession.read.load，gender将会被认为是分区列。如果用户需要指定分区发现开始的基础路径，可以将basePath设置到数据源选项。例如，当path/to/table/gender=male是数据的路径，并且用户设置basePath为path/to/table，gender将作为分区列。

模式（schema）合并

与ProtocolBuffer，Avro，和Thrift类似，Parquet同样支持schema的演变。用户可以以一个简单点的schema开始，然后在需要时逐渐的添加更多列到schema。使用这种方法，用户将最终得到由不同的但是相互兼容的schema构成的多个Parquet文件。Parquet数据源目前可以自动的检测这种情况并且合并这些文件的schema。

由于合并schema是相对代价较大的操作，而且在大多数情况下并不需要这样，从1.5.0开始我们默认将它关闭，你可以通过以下方法使它生效：

在读取Parquet文件时（就像下面的例子）设置数据源操作mergeSchema为true
设置全局SQL选项spark.sql.parquet.mergeSchema为true

计算机生成了可选文字:|0// This is used to implicitly convert an RDD to a DataFrame. import spark. implicits._ // Create a simple DataFrame, store into a partition directory val squaresDF spark. sparkContext.makeRDD(1 to 5) .map(i (i , ) . "square") spark. read . opti "true") . parquet("data/test_table") squaresDF . wri te . parquet( "data/test_tabl e/key=l // Create another DataFrame in a new partition directory, // adding a new column and dropping an existing column val cubesDF spark. sparkContext.makeRDD(6 to 10) .map(i cubesDF . wri te . parquet( " data/test_tabl e/key=2 ") // Read the partitioned table val mergedDF mergedDF . pri ntschema() ) . toDF ("val ue" , "cube // The final schema consists of all 3 columns in the Parquet files together // with the partitioning column appeared in the partition directory paths root - value: int (nullable — true) square: int (nullable — true) cube: int (nullable — true) - key : int (nullable true)

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala"找到完整的代码

Hive原数据

当读写Hive元存储Parquet表时，为了更好的性能，SparkSQL将试图使用它自己支持的Parquet代替Hive SerDe。这种行为可以通过spark.sql.hive.convertMetastoreParquet进行配置，默认已经开启。

Hive/Parquet schema调节

从表的schema处理的角度来看，Hive和Parquet有两点关键的不同之处。

Hive是类型敏感的，而Parquet并不是
Hive中所有列都是非空的，而Parquet中非空是很重要的特性。

因为这个原因，当我们需要将Hive元存储转换为Spark SQL Parquet表中的Parquet表时，我们需要调节Hive元存储的schema和Parquet的schema。调节规则如下：

不管是否可为空值，两种schema中具有相同名字的字段必须具有相同的数据类型。这种调节字段应该有与Parquet一方相同的数据类型，因此可为空值的特性很重要。
调节的schema准确的包含在Hive元存储schema中定义的字段。
- 任何只在Parquet schema中出现的字段都会在调节schema中被丢弃
- 任何只出现在Hive元存储schema中的字段都会在调节schema中被添加为可为空的字段

元数据更新

为了更好的性能，Spark SQL会缓存Parquet元数据。当Hive元存储Parquet表转换操作可用时，这些被转换的表的元数据同样被缓存。如果这些表被Hive或者外部工具更新，你需要手动更新元数据以保持其一致性。

配置

Parquet的配置可以使用SparkSession中的setConf方法进行，或者使用SQL执行SET key=value命令。

计算机生成了可选文字:|0Property Name spark . sq 1 . parquet. binaryAsString spark . sq 1 . parquet. int96AsTimestamp spark . sq 1 . parquet. cacheMetadata spark . sq 1 . parquet. compressi on . codec spark . sq 1 . parquet. fi Iterpushdown spark . sq 1 . hive . convertMetastoreParquet spark . sq 1 . parquet.mergeschema Default false true true gzip true true false Meaning Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems. Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems. Turns on caching of Parquet schema metadata. Can speed up querying of static data. Sets the compression codec use when writing Parquet files. Acceptable values include: uncompressed, snappy, gzip, Izo. Enables Parquet filter push-down optimization when set to true. When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support. When true, the Parquet data source merges schemas collected from all data files, otherwise the schema is picked from the summary file or a random data file if no summary file is available.

JSON数据集

Spark SQL可以自动推断JSON数据集的schema并且加载为Dataset[Row]。可以对String类型的RDD或者JSON文件使用SparkSession.read.json()来实现这种转换。

注意这里的JSON文件不是通常意义的JSON文件。每一行必须包含分离的，完整有效的JSON对象。因此，不支持常用的多行式JSON文件。

计算机生成了可选文字:|0// A JSON dataset is pointed to by path. // The path can be either a single text file or a directory storing text files val path 'exampl es/s rc/mai n/ resources/people . json " val peopl eDF spark. read . json(path) // The inferred schema can be visualized using the printschema() method peopl eDF . pri ntschema() // root // / -- age: long (nullable true) // / -- name: string (nullable — true) // Creates a temporary view using the DataFrame peopl eDF . createOrRep1 aceTempVi ew( "peopl e // SQL statements can be run by using the sq 7 methods provided by spark val teenagerNamesDF name FROM people WHERE age BETWEEN 13 AND 19") teenagerNamesDF . show() // / name / // /Justin/ // Alternatively, a DataFrame can be created for a JSON dataset represented by // an RDD[String] storing one JSON object per string val otherpeop1eRDD = spark. sparkContext.makeRDD( "l "name " : "Yi n" , "address " : {"ci ty" : "Col umbus " , " state val otherpeople spark . read . json (otherpeopl eRDD) otherpeople . show() address / name / / [Co lumbus , Oh i o] / Vin / " : "Ohi " " : Nil)

注意，RDD[String]中每一个元素必须是一个字符串形式的JSON对象。

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala"找到完整的代码。

Hive表

Spark SQL同样支持从Apache Hive中读写数据。但是，自从Hive有大量依赖之后，这些依赖就不包括在Spark发布版中了。如果Hive的依赖可以在环境变量中找到，Spark将自动加载它们。注意这些Hive依赖项同样必须在每个worker节点上存在，因为他们需要访问Hive序列化和反序列化库以便可以访问Hive中存储的数据。

配置可以在conf/目录中的hive-site.xml, core-site.xml(安全配置),和hdfs-site.xml(HDFS配置)这几个文件中进行配置。

当在Hive上工作时，必须实例化SparkSession对Hive的支持，包括对持久化Hive元存储的连通性，对Hive序列化反序列化，Hive用户自定义函数的支持。当没有在hive-site.xml配置是，context会自动在当前目录创建metastore_db并且创建一个被spark.sql.warehouse.dir配置的目录，默认在spark应用启动的当前目录的spark-warehouse。注意从Spark2.0.0开始hive-site.xml中的hive.metastore.warehouse.dir参数被弃用。作为替代，使用spark.sql.warehouse.dir来指定仓库中数据库的位置。你可能需要授权写权限给启动spark应用的用户。

$计算机生成了可选文字:|0import org.apache.spark.sq].Row import org.apache.spark.sq] . Sparksession Record(key: Int, case class // warehouseLocation points value: String) to the default location for managed databases and tables val warehouseLocati on "fi 1 e : S{system: user . di r}/spark-warehouse " val spark = Sparksession . builder() . appName("Spark Hive Example") . confi . sq 1 . warehouse . di r" , . enabl eHiveSupport() . getorcreate() import spark. implicits._ import spark. sq] warehouseLocation) sql ("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) sq 1 ("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kvl. txt' // Queries are expressed in HiveQL * FROM src") .show() // /key/ va / // /238/va1_238/ // / 86/ val_86/ // /311/va1_311/ INTO TABLE src")$ $计算机生成了可选文字:|0// Aggregation queries are also supported. FROM src") .show() / count (1) / 500 / The results of SQL queries are themselves DataFrames and support all normal functions. val sq 1 DF — key, value FROM src WHERE key < 10 ORDER BY key") // The items in DaraFrames are of type Row, which allows you to access each column by ordinal. val stringsDS — sq1DF .map { case Row(key: Int, value: String) s"Key: Skey, Value: Svalue" stri ngsDS . show() / Key: / Key: / Key: O, O, O, Va lue : Va lue : Va lue : va lue / va va 7_0/ va Vou can also use DataFrames to create temporary views within a Hivecontext. val recordsDF spark. createDataFrame((1 to 100) .map(i Record(i, s"val Si "))) records DF . createOrRep1 aceTempVi " records ") // Queries can then join DataFrame data with data stored in Hive. sq 1 ("SELECT * FROM records r JOIN src s ON r.key = s.key") .show() // /key/ va lue/key/ va lue/ 2/ val_2/ 2/ val_2/ 4/ va7_4/ 2/ val_2/ 2/ val_2/ 4/ va7_4/$

可以在Spark仓库的"examples/src/main/scala/org/apache/spark/examples/sql/hive/SparkHiveExample.scala"找到完整的代码。

不同版本的Hive元存储的交互

Spark SQL对Hive支持的最重要的特点之一是与Hive元存储的交互，这使得SparkSQL可以访问Hive表的元数据。从Spark1.4.0开始，可以使用一个Spark SQL的二进制构建来查询不同版本的Hive元存储。Spark SQL在内部编译Hive1.2.1并且使用这些classes用于内部执行（序列化反序列化，UDFs，UDAFs等）

可以使用下面的选项来配置用于检索元数据的Hive的版本：

计算机生成了可选文字:|0Property Name spark . sq 1 . hive .metastore . versi on spark . sq 1 . hive .metastore . jars spark . sq 1 . hive . metastore . sharedprefi xes spark . sq 1 . hive . metastore . barri erprefi xes Default 1.2.1 builtin com.mysql.jdbc, org.postgresql , com.microsoft.sq1server, oracle.jdbc (empty) Meaning Version of the Hive metastore. Available options are o. 12. o through 1.2.1. Location of the jars that should be used to instantiate the HiveMetastoreClient. This property can be one of three options: . builtin Use Hive 1.2.1 , which is bundled with the Spark assembly when -Phive is enabled. When this option is chosen, spark . sq 1 . hive . metastore . versi on must be either 1.2 . 1 or not defined. 2. maven Use Hive jars of specified version downloaded from Maven repositories. This configuration is not generally recommended for production deployments. 3. A classpath in the standard format for the JVM. This classpath must include all of Hive and its dependencies, including the correct version of Hadoop. These jars only need to be present on the driver, but if you are running in yarn cluster mode then you must ensure they are packaged with you application. A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j. A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive IJDFs that are declared in a prefix that typically would be shared (i.e. org . apache . spark . k).

其他数据库的JDBC

Spark SQL同样包括可以使用JDBC从去他数据库读取数据的数据源。此功能优先使用JdbcRDD.这是因为返回的结果作为一个DataFrame并且可以轻松地使用Spark SQL处理或者与其他数据源进行连接。使用Java或者Python可以更容易的使用JDBC数据源因为它们不需要用户提供ClassTag。（注意这与Spark SQLJDBC服务器可以允许其他应用使用Spark SQL执行查询语句不同）

在开始之前你需要将你指定的数据库的JDBC driver包含在Spark的环境变量中。例如，为了从Spark Shell连接到postgres，你需要执行以下命令：

远程数据库的表可以被加载为DataFrame或者使用Data Sources API加载为Spark SQL临时表。支持以下选项：

计算机生成了可选文字:|0Property Name urn dbtable driver parti ti oncol umn , 1 ower80und , upper80und, numparti ti ons fetchsi ze Meaning The JDBC URL to connect to. The JDBC table that should be read. Note that anything that is valid in a FROM clause of a SQL query can be used. For example, instead of a full table you could also use a subquery in parentheses. The class name of the JDBC driver to use to connect to this URL. These options must all be specified if any of them is specified. They describe how to partition the table when reading in parallel from multiple workers. parti ti oncol umn must be a numeric column from the table in question. Notice that 1 and uppersound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. The JDBC fetch size, which determines how many rows to fetch per round trip. This can help performance on JDBC drivers which default to low fetch size (eg. Oracle with 10 rows).

计算机生成了可选文字:|0val jdbcDF Map("url " "dbtable" spark . read . . opti ons( "jdbc:postgresql :dbserver" , "schema. tablename")) . load()

排错

JDBC driver类必须在客户端和所有执行器上对原始类加载器可见。这是因为Java的DriverManager类在用户打开连接时，要进行安全检查，检查其中的结果已经忽略了所有的对原始类加载器不可见的部分。一个方便的方法是更改所有worker节点的compute_classpath.sh使其包含你的driver的JAR包
一些数据库，比如H2，要求将所有的名字转换为大写，你需要在Spark SQL中使用大写。

性能调优

对一些工作负载，可以通过将数据缓存在内存中，在某些经验项上进行调优来提高性能。

缓存数据到内存

Spark SQL可以通过调用 spark.cacheTable("tableName")或者dataFrame.cache()来将表以列式形式缓存在内存中。然后Spark SQL可以只扫描需要的列并且可以自动调节压缩以最小内存使用率和GC压力。你可以调用spark.uncacheTable("tableName")来将表从内存中删除。

可以在SparkSession上使用setConf方法来配置内存缓存，或者使用SQL执行SET key=value命令。

计算机生成了可选文字:|0Property Name spark . sq 1 . i nMemoryC01 umnarStorage . compressed spark . sq 1 . i nMemoryC01 umnarStorage . batchsi ze Default true 10000 Meaning When set to true Spark SQL will automatically select a compression codec for each column based on statistics of the data. Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.

其他配置选项

下面的选项同样可以用于查询语句执行时的性能调优。在以后的发布版本中可能会弃用这些选项，更多的将优化改为自动执行。

计算机生成了可选文字:|0Property Name spark . sq 1 . fi 1 es . maxparti ti on8ytes spark . sq 1 . fi 1 es . openCostIn8ytes spark . sq 1 . aut08roadcastJoi n Threshol d spark. sq 1 . shuffle. parti ti ons Default 134217728 (128 MB) 4194304 (4 MB) 10485760 (10 MB) 200 Meaning The maximum number of bytes to pack into a single partition when reading files. The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). Configures the maximum size in bytes for a table that will be broadcast to all worker nodes when performing a join. By setting this value to -1 broadcasting can be disabled. Note that currently statistics are only supported for Hive Metastore tables where the command ANALYZE TABLE COMPUTE STATISTICS noscan has been run. Configures the number of partitions to use when shuffling data for joins or aggregations.

分布式SQL引擎

Spark SQL同样可以使用JDBC/ODBC或者命令行接口来作为一个分布式查询引擎。在这种模式中，终端用户或者应用可以通过执行SQL查询语句直接与Spark SQL进行交互，不需要写任何代码。

运行Thrift JDBC/ODBC服务

Thrift JDBC/ODBC服务实现了与Hive1.2.1的一致性。你可以使用任意来自Spark或者Hive1.2.1的beeline script来测试JDBC服务。

在Spark目录中运行以下命令来启动JDBC/ODBC服务：

这个脚本接受所有bin/spark-submit的命令行选项，再加上可以执行Hive属性的 --hiveconf选项。你可以运行./sbin/start-thriftserver.sh--help来显示所有可用的选项的完整列表。默认情况下，此服务在localhost:10000进行监听。你可以通过配置环境变量来改变此运行状态，比如：