SparkSql官方文档

Overview (概述)

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section.

SparkSql 是一种结构化处理Spark模块,他提供了一种叫做DataFrame抽象编程,他也可以作为分布式Sql查询引擎,
SparkSql也可以从已经安装的Hive服务中读取数据.详细的配置项需要到Hivetables部分

DataFrames(结构)

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
The DataFrame API is available in Scala, Java, Python, and R.
All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell, pyspark shell, or sparkRshell.

DataFrame是组织成命名列的数据的分布式集合。它在概念上等同于关系数据库中的表或R / Python中的数据框架,但是具有更丰富的优化。 DataFrames可以从各种来源构建,例如:结构化数据文件,Hive中的表,外部数据库或现有RDD。
DataFrame API可在Scala,Java,Python和R中使用。
此页面上的所有示例都使用Spark发行版中包含的示例数据,并且可以在spark-shell,pyspark shell或sparkR shell中运行。

Starting Point: SQLContext (开始)

The entry point into all functionality in Spark SQL is the SQLContext class, or one of its descendants. To create a basic SQLContext, all you need is a SparkContext.
Spark SQL中所有功能的入口点是SQLContext类或其后代之一。要创建基本的SQLContext,所有你需要的是一个SparkContext。

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available. HiveContext is only packaged separately to avoid including all of Hive’s dependencies in the default Spark build. If these dependencies are not a problem for your application then using HiveContext is recommended for the 1.3 release of Spark. Future releases will focus on bringing SQLContext up to feature parity with a HiveContext.
The specific variant of SQL that is used to parse queries can also be selected using the spark.sql.dialect option. This parameter can be changed using either the setConf method on a SQLContext or by using a SET key=value command in SQL. For a SQLContext, the only dialect available is “sql” which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the default is “hiveql”, though “sql” is also available. Since the HiveQL parser is much more complete, this is recommended for most use cases.

除了基本的SQLContext之外,您还可以创建一个HiveContext,它提供了由基本SQLContext提供的功能的超集。其他功能包括使用更完整的HiveQL解析器编写查询,访问Hive UDF以及从Hive表中读取数据的能力。要使用HiveContext,您不需要具有现有的Hive设置,并且SQLContext可用的所有数据源仍可用。 HiveContext只是单独打包,以避免在默认Spark构建中包括所有Hive的依赖。如果这些依赖关系对您的应用程序不是问题,那么对于Spark的1. 3版本,建议使用HiveContext。未来版本将专注于使SQLContext达到与HiveContext的功能奇偶性。
用于解析查询的SQL的特定变体也可以使用spark选择。 sql。方言选项。此参数可以使用SQLContext上的setConf方法或在SQL中使用SET key = value命令进行更改。对于SQLContext,唯一可用的方言是“sql”,它使用Spark SQL提供的一个简单的SQL解析器。在HiveContext中,默认值为“hiveql”,但“sql”也可用。由于HiveQL解析器更加完整,因此建议在大多数使用情况下使用。

Creating DataFrames(创建DF)

With a SQLContext, applications can create DataFrames from an existing RDD, from a Hive table, or from data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:

使用SQLContext,应用程序可以从现有RDD,Hive表或数据源创建DataFrames。
作为示例,以下内容根据JSON文件的内容创建DataFrame:

json 数据
testpeople.json:

{"name":"Yin", "age":"25"}
{"name":"Michael", "age":"26"}
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout
df.show()

这里写图片描述

DataFrame Operations(DF操作)

DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, and Python.
Here we include some basic examples of structured data processing using DataFrames:

DataFrames为Scala,Java和Python中的结构化数据操作提供了特定领域的语言。
这里我们包括使用DataFrames的结构化数据处理的一些基本示例:

val sc: SparkContext // An existing SparkContext.

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the DataFram
val df = sqlContext.read.json("examples/src/main/resources/people.json")

df.show()
//查看DF格式
df.printSchema()

df.select("name").show()

df.select(df("name"), df("age") + 1).show()

df.filter(df("age") > 21).show()

df.groupBy("age").count().show()

这里写图片描述

For a complete list of the types of operations that can be performed on a DataFrame refer to the API Documentation.
In addition to simple column references and expressions, DataFrames also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the DataFrame Function Reference.

有关可在DataFrame上执行的操作类型的完整列表,请参阅API文档。
除了简单的列引用和表达式,DataFrames还有一个丰富的函数库,包括字符串操作,日期算术,常用的数学运算等等。完整的列表在DataFrame函数参考中提供。

Running SQL Queries Programmatically

The sql function on a SQLContext enables applications to run SQL queries programmatically and returns the result as a DataFrame.

SQLContext上的sql函数使应用程序以编程方式运行SQL查询,并将结果作为DataFrame返回。

Interoperating with RDDs (与RDD的操作)

Spark SQL supports two different methods for converting existing RDDs into DataFrames. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating DataFrames is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct DataFrames when the columns and their types are not known until runtime.

Spark SQL支持两种不同的方法将现有RDD转换为DataFrames。第一种方法使用反射来推断包含特定类型的对象的RDD的模式。当你在编写Spark应用程序时已经知道模式时,这种基于反射的方法会产生更简洁的代码。
第二种创建DataFrames的方法是通过一个编程接口,允许您构造一个模式,然后将其应用到现有的RDD。虽然此方法更冗长,它允许您在列和其类型在运行时之前未知时构造Da​​taFrames。

Inferring the Schema Using Reflection(通过反射使用Schema)

The Scala interface for Spark SQL supports automatically converting an RDD containing case classes to a DataFrame. The case class defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table. Tables can be used in subsequent SQL statements.

Spark SQL的Scala接口支持将包含案例类的RDD自动转换为DataFrame。 case类定义表的模式。 case类的参数的名称使用反射读取,并成为列的名称。案例类也可以嵌套或包含复杂类型,如序列或数组。此RDD可以隐式转换为DataFrame,然后注册为表。表可以在后续的SQL语句中使用。

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
// you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a table.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index:
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

// or by field name:
teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)

// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
// Map("name" -> "Justin", "age" -> 19)

testpeople.test:

Yin,25
Michael,26

1.定义一个Person表模式
2.加载切分数据(testpeople.text)装换成DF->people(dataFrame)
3.将peopleRDD注册成表(前面定义的Person表模式)->peoson(该表明随意定义)表
4.执行查询语句返回,DF(DateFrame)->teenagersDateFrame数据
5.按字段位置,字段名称,字段集合,查询,查看teenagerDateFrame数据() -> map定义输出格式
这里写图片描述

Programmatically Specifying the Schema

When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

  1. Create an RDD of Rows from the original RDD;
  2. Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
  3. Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

当case类不能提前定义时(例如,记录的结构被编码在一个字符串中,或​​者一个文本数据集将被解析,字段将被不同的用户投射),一个DataFrame可以用三个步骤。

1.从原始RDD创建行的RDD;
2.创建由与第1步中创建的RDD中的Rows结构匹配的StructType表示的模式。
3.通过SQLContext提供的createDataFrame方法将模式应用于Rows的RDD。

For example:

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Import Row.
import org.apache.spark.sql.Row;

// Import Spark SQL data types
import org.apache.spark.sql.types.{StructType,StructField,StringType};

// Generate the schema based on the string of schema
val schema =
  StructType(
    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD (people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))

// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

// Register the DataFrames as a table.
peopleDataFrame.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql("SELECT name FROM people")

// The results of SQL queries are DataFrames and support all the normal RDD operations.
// The columns of a row in the result can be accessed by field index or by field name.
results.map(t => "Name: " + t(0)).collect().foreach(println)

1.创建加载peopleRDD
2.定义格式化字段(schemaString)
3.导入所需要的包
4.将格式化字段转化为格式类型(schema)
5.peopleRDD数据切分->rowRDD
6.CreateDataFrame,(rowRDD,schema) ->peopleDataFrame
7.将peopleDaraFrame注册成表(people(该表名随意定义)
8.按SQL语句查询
这里写图片描述

Data Sources (数据来源)

Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on as normal RDDs and can also be registered as a temporary table. Registering a DataFrame as a table allows you to run SQL queries over its data. This section describes the general methods for loading and saving data using the Spark Data Sources and then goes into specific options that are available for the built-in data sources.

Spark SQL支持通过DataFrame接口对各种数据源进行操作。 DataFrame可以作为正常RDD操作,也可以注册为临时表。将DataFrame注册为表允许您对其数据运行SQL查询。本节介绍使用Spark数据源加载和保存数据的一般方法,然后介绍可用于内置数据源的特定选项。

Generic Load/Save Functions(加载保存方法)

In the simplest form, the default data source (parquet unless otherwise configured by spark.sql.sources.default) will be used for all operations.
在最简单的形式中,默认数据源(除非另有配置,否则为spark.sql.sources.default)将用于所有操作。

//parquet 压缩格式数据
val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

Manually Specifying Options

You can also manually specify the data source that will be used along with any extra options that you would like to pass to the data source. Data sources are specified by their fully qualified name (i.e., org.apache.spark.sql.parquet), but for built-in sources you can also use their short names (json, parquet, jdbc). DataFrames of any type can be converted into other types using this syntax.

您还可以手动指定要与您要传递给数据源的任何其他选项一起使用的数据源。数据源由它们的完全限定名(即org.apache.spark.sql.parquet)指定,但对于内置源,您还可以使用其短名称(json,parquet,jdbc)。任何类型的DataFrames可以使用此语法转换为其他类型。

val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")
df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")

小结:

read date:
    sqlContext.read.load(path)
    sqlContext.read.json("xxx.json")
    sqlContext.read.format("parquet").load("xxx.parquet")
save data:
     DateFrame.select("columns").write.format("json").save("path")
sql      :
     sqlContext.sql("sql语句")
     DateFrame.select("columns")

Save Modes

Save operations can optionally take a SaveMode, that specifies how to handle existing data if present. It is important to realize that these save modes do not utilize any locking and are not atomic. Additionally, when performing a Overwrite, the data will be deleted before writing out the new data.

保存操作可以选择采用SaveMode,指定如何处理现有数据(如果存在)。重要的是要意识到这些保存模式不利用任何锁定并且不是原子的。此外,执行覆盖时,数据将在写出新数据之前被删除。

四种模式
SaveMode.ErrorIfExists “error”(default)
When saving a DataFrame to a data source, if data already exists, an exception is expected to be thrown.
将DataFrame保存到数据源时,如果数据已存在,则会抛出异常。

SaveMode.Append “append”
When saving a DataFrame to a data source, if data/table already exists, contents of the DataFrame are expected to be appended to existing data.
将DataFrame保存到数据源时,如果data / table已存在,则DataFrame的内容将被附加到现有数据。

SaveMode.Overwrite “overwrite”
Overwrite mode means that when saving a DataFrame to a data source, if data/table already exists, existing data is expected to be overwritten by the contents of the DataFrame.
将DataFrame保存到数据源时,如果data / table已存在,则DataFrame的内容将被覆盖到现有数据。

SaveMode.Ignore “ignore”
Ignore mode means that when saving a DataFrame to a data source, if data already exists, the save operation is expected to not save the contents of the DataFrame and to not change the existing data. This is similar to a CREATE TABLE IF NOT EXISTS in SQL
忽略模式意味着将DataFrame保存到数据源时,如果数据已存在,则保存操作预计不会保存DataFrame的内容,也不会更改现有数据。这类似于SQL中的CREATE TABLE IF NOT EXISTS。

Saving to Persistent Tables

When working with a HiveContext, DataFrames can also be saved as persistent tables using the saveAsTable command. Unlike the registerTempTable command, saveAsTable will materialize the contents of the dataframe and create a pointer to the data in the HiveMetastore. Persistent tables will still exist even after your Spark program has restarted, as long as you maintain your connection to the same metastore. A DataFrame for a persistent table can be created by calling the table method on a SQLContext with the name of the table.
By default saveAsTable will create a “managed table”, meaning that the location of the data will be controlled by the metastore. Managed tables will also have their data deleted automatically when a table is dropped.

当使用HiveContext时,DataFrames也可以使用saveAsTable命令保存为持久表。与registerTempTable命令不同,saveAsTable将实现数据帧的内容并创建指向HiveMetastore中的数据的指针。即使在Spark程序重新启动后,持久表仍将存在,只要您保持与同一个存储区的连接。可以通过调用具有表名称的SQLContext上的表方法来创建持久表的DataFrame。
默认情况下,saveAsTable将创建一个“managed table”,这意味着数据的位置将由metastore控制。managed table也将在删除表时自动删除其数据。

Parquet Files

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
Parquet是一种由许多其他数据处理系统支持的列式格式。 Spark SQL提供对读取和写入Parquet文件的支持,这些文件自动保留原始数据的模式。

// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

val people: RDD[Person] = ... // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
//save as parquet
people.write.parquet("people.parquet")

// Read in the parquet file created above.  Parquet files are self-describing so the schema is preserved.
// The result of loading a Parquet file is also a DataFrame.
//read for parquet
val parquetFile = sqlContext.read.parquet("people.parquet")

//Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Partition Discovery

Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory. The Parquet data source is now able to discover and infer partitioning information automatically. For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, gender and country as partitioning columns:
表分区是系统中常用的优化方法,如Hive。在分区表中,数据通常存储在不同的目录中,分区列值在每个分区目录的路径中编码。 Parquet数据源现在能够自动发现和推断分区信息。例如,我们可以使用以下目录结构将所有先前使用的人口数据存储到分区表中,其中包含两个额外的列,gender和country作为分区列:
path
└── to
└── table
├── gender=male
│ ├── …
│ │
│ ├── country=US
│ │ └── data.parquet
│ ├── country=CN
│ │ └── data.parquet
│ └── …
└── gender=female
├── …

├── country=US
│ └── data.parquet
├── country=CN
│ └── data.parquet
└── …

By passing path/to/table to either SQLContext.read.parquet or SQLContext.read.load, Spark SQL will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes:

通过将 part / to / table传递给SQLContext.read.parque或SQLContext.read.load,Spark SQL将自动从路径中提取分区信息。现在,返回的DataFrame的模式变为:
root
|– name: string (nullable = true)
|– age: long (nullable = true)
|– gender: string (nullable = true)
|– country: string (nullable = true)

Notice that the data types of the partitioning columns are automatically inferred. Currently, numeric data types and string type are supported. Sometimes users may not want to automatically infer the data types of the partitioning columns. For these use cases, the automatic type inference can be configured by spark.sql.sources.partitionColumnTypeInference.enabled, which is default to true. When type inference is disabled, string type will be used for the partitioning columns.

请注意,自动推断分区列的数据类型。目前,支持数字数据类型和字符串类型。有时用户可能不想自动推断分区列的数据类型。对于这些用例,自动类型推断可以由spark.sql.sources.partitionColumnTypeInference.enabled,默认为true。当禁用类型推断时,字符串类型将用于分区列。

Schema Merging

Like ProtocolBuffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by

  1. setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or
  2. setting the global SQL option spark.sql.parquet.mergeSchema to true.

像 ProtocolBuffer,Avro 和Thrift,Parquet也支持结构进化。 用户可以用一个简单的结构,开始和逐渐将更多的列添加到需要的结构。 这种方式,用户最终可能会达到多个Parquet文件与不同的格式但相互兼容的结构。Parquet的数据源是现在能够自动检测到此情况和合并的所有这些文件的结构。
(说白了就是,不同格式的内容通过schemas,合并数据)
因为结构合并是相对昂贵的操作,并且不是必要性在大多数情况下,我们就把它关掉,从 1.5.0版本开始默认关闭该操作 。 您可以启用它由

    1.在读取Parquet文件(如下面的示例所示)
    2.设置全局SQL选项spark.sql.parquet.mergeSchema为true
// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Create a simple DataFrame, stored into a partition directory
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.write.parquet("data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
val df2 = sc.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.write.parquet("data/test_table/key=2")

// Read the partitioned table
val df3 = sqlContext.read.option("mergeSchema", "true").parquet("data/test_table")
df3.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partitioning column appeared in the partition directory paths.
// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
// |-- triple: int (nullable = true)
// |-- key : int (nullable = true)

Hive metastore Parquet table conversion

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by the spark.sql.hive.convertMetastoreParquet configuration, and is turned on by default.

当读取数据和写入数据到Hive metastore Parquet tables表,为更好的性能,Spark SQL 将尝试使用自己的Parque格式而不是Hive的 SerDe为。 这种行为被spark.sql.hive.convertMetastoreParquet 配置,并且在默认情况下打开。

Hive/Parquet Schema Reconciliation

There are two key differences between Hive and Parquet from the perspective of table schema processing.

  1. Hive is case insensitive, while Parquet is not
  2. Hive considers all columns nullable, while nullability in Parquet is significant Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
    1. Fields that have the same name in both schema must have the same data type regardless of nullability. The reconciled field should have the data type of the Parquet side, so that nullability is respected.
    2. The reconciled schema contains exactly those fields defined in Hive metastore schema.
      ○ Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
      ○ Any fileds that only appear in the Hive metastore schema are added as nullable field in the reconciled schema.

从表模式处理的角度来看,Hive和Parquet之间有两个主要区别。
1.Hive不区分大小写,而Parquet不是
2.Hive认为所有列都可以为null,而Parquet中的可空性很重要,由于这个原因,当将Hive Metastore Parquet表转换为Spark SQL时,我们必须调整Hive Metastore模式与Parquet结构。认证规则为:

1)两个模式中具有相同名称的字段必须具有相同的数据类型,而不管其是否为空。已认证字段应具有Parquet侧的数据类型,以便遵守可空性。

2)已调节的模式恰好包含在Hive metastore模式中定义的那些字段。
*只有在Parquet模式中出现的任何字段都会在已调节模式中删除。
*只有在Hive Metastore模式中出现的任何文件才会在已调节模式中添加为可空字段。

Metadata Refreshing

Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table conversion is enabled, metadata of those converted tables are also cached. If these tables are updated by Hive or other external tools, you need to refresh them manually to ensure consistent metadata.

Spark SQL缓存Parquet元数据以获得更好的性能。启用Hive metastore Parquet表转换时,这些转换表的元数据也会被缓存。如果这些表是由Hive或其他外部工具更新的,则需要手动刷新它们以确保一致的元数据。

// sqlContext is an existing HiveContext
sqlContext.refreshTable("my_table")

Configuration

Configuration of Parquet can be done using the setConf method on SQLContext or by running SET key=value commands using SQL.

这里就不翻译了,都是配置项

JSON Datasets

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SQLContext.read.json() on either an RDD of String, or a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

Spark SQL可以自动推断JSON数据集的模式,并将其作加载为DataFrame。此转换可以用SQLContext.read.json() 对String的RDD或JSON文件使用。
请注意,作为json文件提供的文件不是典型的JSON文件。每行必须包含一个单独的,自包含的有效JSON对象。因此,常规的多行JSON文件通常会失败。

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// A JSON dataset is pointed to by path.
// The path can be either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
//json文件
val people = sqlContext.read.json(path)

// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root
//  |-- age: integer (nullable = true)
//  |-- name: string (nullable = true)

// Register this DataFrame as a table.
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
//String RDD 
val anotherPeopleRDD = sc.parallelize(
  """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.read.json(anotherPeopleRDD)

Hive Tables

Spark SQL also supports reading and writing data stored in Apache Hive. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.
Hive support is enabled by adding the -Phive and -Phive-thriftserver flags to Spark’s build. This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your hive-site.xml file in conf/. Please note when running the query on a YARN cluster (yarn-clustermode), the datanucleus jars under the lib_managed/jars directory and hive-site.xml under conf/ directory need to be available on the driver and all executors launched by the YARN cluster. The convenient way to do this is adding them through the –jars option and –file option of the spark-submit command.

Spark SQL还支持读取和写入存储在Apache Hive中的数据。但是,由于Hive有大量依赖项,因此它不包含在默认的Spark程序集中。
通过向Spark的构建中添加-Phive和-Phive-thriftserver标志来启用Hive支持。此命令构建一个包括Hive的新的程序集jar。请注意,此Hive程序集jar也必须存在于所有工作节点上,因为它们需要访问Hive序列化和反序列化库(SerDes)才能访问存储在Hive中的数据。

配置Hive是通过放置您的hive-site.xml文件在 conf /. 请注意,当在YARN集群(纱线集群模式)上运行查询时,数据核仓在lib_managed/jars目录和hive-site.xml在SPARK_HOME/conf /目录下需要在驱动程序和所有由YARN集群启动的执行器可用。方便的方法是通过spark-submit命令的–jars选项和–file选项来添加它们。
这里写图片描述

When working with Hive one must construct a HiveContext, which inherits from SQLContext, and adds support for finding tables in the MetaStore and writing queries using HiveQL. Users who do not have an existing Hive deployment can still create a HiveContext. When not configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.

当使用Hive时,必须构造一个HiveContext,它继承自SQLContext,并添加对在MetaStore中查找表和使用HiveQL编写查询的支持。
没有现有Hive部署的用户仍然可以创建HiveContext。当没有配置由hive-site.xml,上下文会自动在当前目录中创建metastore_db和仓库。

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

这里写图片描述

借鉴其他案例混合使用

第一步   创建hiveTable从本地文件系统加载数据
//创建一个hiveTable并将数据加载,注意people.txt第二列有空格,所以age取string类型
scala>hiveContext.sql("CREATE TABLE hiveTable(name string,age string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ")
scala>hiveContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/upload/class6/people.txt' INTO TABLE hiveTable")
第二步   创建parquet表,从HDFS加载数据
//创建一个源自parquet文件的表parquetTable2,然后和hiveTable混合使用
scala>hiveContext.parquetFile("hdfs://hadoop1:9000/class6/people.parquet").registerTempTable("parquetTable2")
第三步   两个表混合使用
scala>hiveContext.sql("select a.name,a.age,b.age from hiveTable a join parquetTable2 b on a.name=b.name").collect().foreach(println)

Interacting with Different Versions of Hive Metastore

One of the most important pieces of Spark SQL’s Hive support is interaction with Hive metastore, which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below. Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
The following options can be used to configure the version of Hive that is used to retrieve metadata:

Spark SQL的Hive支持中最重要的部分之一是与Hive Metastore的交互,这使Spark SQL能够访问Hive表的元数据。从Spark 1. 4. 0开始,可以使用Spark SQL的单个二进制构建查询不同版本的Hive Metastore,使用下面描述的配置。请注意,独立于用于与metastore交谈的Hive版本,内部Spark SQL将针对Hive 1. 2. 1编译,并将这些类用于内部执行(serdes,UDF,UDAF等)。

以下选项可用于配置用于检索元数据的Hive版本:
这里写图片描述

JDBC To Other Databases

Spark SQL also includes a data source that can read data from other databases using JDBC. This functionality should be preferred over using JdbcRDD. This is because the results are returned as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. The JDBC data source is also easier to use from Java or Python as it does not require the user to provide a ClassTag. (Note that this is different than the Spark SQL JDBC server, which allows other applications to run queries using Spark SQL).

To get started you will need to include the JDBC driver for you particular database on the spark classpath. For example, to connect to postgres from the Spark Shell you would run the following command:

Spark SQL还包括可以使用JDBC从其他数据库读取数据的数据源。此功能应优先于使用JdbcRDD。这是因为结果返回的是DataFrame,并且可以很容易地在Spark SQL中处理或与其他数据源结合。 JDBC数据源也更容易从Java或Python中使用,因为它不需要用户提供ClassTag。 (请注意,这与Spark SQL JDBC服务器不同,后者允许其他应用程序使用Spark SQL运行查询)。

要开始,您需要在spark类路径中包含特定数据库的JDBC驱动程序。例如,要连接到Spark Shell的postgres,您将运行以下命令:

SPARK_CLASSPATH=postgresql-9.3-1102-jdbc41.jar bin/spark-shell

Tables from the remote database can be loaded as a DataFrame or Spark SQL Temporary table using the Data Sources API. The following options are supported:

来自远程数据库的表可以使用Data Sources API作为DataFrame或Spark SQL临时表加载。支持以下选项:

Property Name
url
dbtable
driver

val jdbcDF = sqlContext.read.format("jdbc").options( 
          Map("url" -> "jdbc:postgresql:dbserver",
         "dbtable" -> "schema.tablename")).load()

这里写图片描述

Troubleshooting

● The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.
● Some databases, such as H2, convert all names to upper case. You’ll need to use upper case to refer to those names in Spark SQL.

●JDBC驱动程序类必须对客户端会话上的原始类加载器和所有执行程序都可见。这是因为Java的DriverManager类进行了一个安全检查,导致它忽略了当打开一个连接时原始类加载器不可见的所有驱动程序。一个方便的方法是修改compute_classpath.sh在所有工作节点上包括您的驱动程序JAR。
●一些数据库,如H2,将所有名称转换为大写。在Spark SQL中,您需要使用大写来引用这些名称。

Performance Tuning

For some workloads it is possible to improve performance by either caching data in memory, or by turning on some experimental options.

对于某些工作负载,可以通过在内存中缓存数据或通过启用一些实验选项来提高性能。

Caching Data In Memory

Spark SQL can cache tables using an in-memory columnar format by calling sqlContext.cacheTable(“tableName”) or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. You can call sqlContext.uncacheTable(“tableName”) to remove the table from memory.

Configuration of in-memory caching can be done using the setConf method on SQLContext or by running SET key=value commands using SQL.

Spark SQL可以通过sqlContext.cacheTable(“tableName”)或dataFrame.cache().缓存这些表进内存中, 然后Spark SQL将仅扫描所需的列,并将自动调整压缩以最小化内存使用和GC压力。你可以调用sqlContext.uncacheTable(“tableName”)从内存中删除表。

配置的内存中缓存可以在 SQLContext 上setConf method 或 通过运行 SET key=value 命令使用 SQL。

Property Name ### Default ### Meaning
spark.sql.inMemoryColumnarStorage.compressed ### true ### When set to true Spark SQL will automatically select a compression codec for each column based on statistics of the data.
当设置为true时Spark SQL将根据数据的统计信息自动为每个列选择一个压缩编解码器。

spark.sql.inMemoryColumnarStorage.batchSize ### 10000 ### Controls the size of batches for columnar caching. Larger batch sizes can improve memory utilization and compression, but risk OOMs when caching data.
控制列式缓存的批次大小。较大的批处理大小可以提高内存利用率和压缩率,但在缓存数据时会面临OOM的风险。

测试案例:

sparkSQL的cache可以使用两种方法来实现:
CacheTable()方法
CACHE TABLE命令

千万不要先使用cache SchemaRDD,然后registerAsTable;使用RDD的cache()将使用原生态的cache,而不是针对SQL优化后的内存列存储。
这里写图片描述
缓存界面:
这里写图片描述
取消缓存界面:
这里写图片描述

Other Configuration Options

The following options can also be used to tune the performance of query execution. It is possible that these options will be deprecated in future release as more optimizations are performed automatically.

以下选项也可用于调整查询执行的性能。这些选项可能会在将来的版本中被弃用,因为会自动执行更多优化.(暂时没有那个功力配置调优)

这里写图片描述

Distributed SQL Engine

Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code.

Spark SQL还可以充当使用其JDBC / ODBC或命令行界面的分布式查询引擎。在此模式下,最终用户或应用程序可以与Spark SQL直接交互以运行SQL查询,而无需编写任何代码。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值