Spark SQL and DataFrame Guide

最新推荐文章于 2022-11-29 11:59:35 发布

Negation

最新推荐文章于 2022-11-29 11:59:35 发布

阅读量1.4k

点赞数

分类专栏： Spark 文章标签： Spark-SQL DataFrame

Spark 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

Overview

Spark SQL是一个处理结构化数据的Spark模块。它提供了一个称为DataFrames的程序抽象，并且可以作为分布式SQL查询引擎。

DataFrames

一个DataFrame是一个组织成命名列的分布式数据集合。它从概念上等同于一个关系型数据库或者一个R/Python的数据框架，并且内部有更好的优化。DataFrames可以从广泛的数据来源来构建，例如：结构化数据文件，Hive的tables，外部数据集，或存在的RDDs。
DataFrame API可以使用Scala, Java, Python。
此页中所有例子中使用的Spark分布中的样例数据，都可以使用spark-shell或pysparkshell运行。

Starting Point: `SQLContext`

Spark SQL中所有函数的入口点是SQLContextclass，或者它的子类。为创建一个基本SQLContext，你只需要一个SparkContext。

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

除了基本的SQLContext，你可以创建HiveContext-提供一个由基本SQLContext提供的函数的超集。额外的特性包括使用更彻底的HiveQL parser编写查询的能力，访问Hive UDFs，从Hive tables读取数据的能力。要使用HiveContext，你不需要一个存在的Hive系统，而且SQLContext所有可用数据源依然可用。HiveContext只是单独的封装，为了避免把Hive的所有依赖包含到Spark的默认构造中。如果这些依赖对你的应用不是问题(不会引起问题)，那么Spark 1.3版本中推荐使用HiveContext。未来的版本会关注把SQLContext等价于HiveContext。
习惯的解析查询-指定SQL变量，也是可以被选择的，只要使用spark.sql.dialect选项。这个参数可以使用SQLContext的setConf方法、或在SQL中使用设置命令key=value。一个SQLContext，唯一可用的方言是”sql”-使用Spark SQL提供的简单SQL解析。HiveContext中，默认的是”hiveql”，虽然”sql”也可用。由于HiveQL解析器更加完善了，它是更被推荐的。

Creating DataFrames

通过一个SQLContext，可以从一个existing RDD，Hive table，data source创建DataFrames。
下面例子，基于一个JSON文件创建一个DataFrame：

val sc: SparkContext // An existiong SparkContext.
val SqlContext = new org.apache.Spark.sql.SQLContext(sc)

val df = sqlContext.jsonFile("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()

DataFrame Operations

DataFrames为结构化数据处理提供领域特点语言(DSL)，用Scala, Java, Python。
这里我收录一些使用DataFrame处理结构化数据的基本例子：

val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the DataFrame
val df = sqlContext.jsonFile("examples/src/main/resources/people.json")

// Show the content
df.show()
// age  name
// null Michael
// 30   Andy
// 19   Justin

// Print the schema in a tree format.
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

// Select only the "name" column
df.select("name").show()
// name
// Michael
// Andy
// Justin

// Select everybody, but increment the age by 1
df.select("name", df("age") + 1).show()
// name    (age + 1)
// Michael null
// Andy    31
// Justin  20

// Select people older than 21
df.filter(df("name") > 21).show()
// age name
// 30  Andy

// Count people by age
df.groupBy("age").count().show()
// age  count
// null 1
// 19   1
// 30   1

Running SQL Queries Programmatically

SQLContext的sql方法，可以使应用运行SQL查询程序并返回一个DataFrame结果。

val sqlContext = ... // An existing SQLContext.
val df = sqlContext.sql("SELECT * FROM table")

Interoperating with RDDs

Spark SQL支持两种不同的方法转化存在的RDDs成DataFrame。第一种方法使用反射推断包含指定的对象类型的RDD schema。基于这种方式的反射导致更简洁的代码，并且当你编写Spark应用时，一旦你已经了解了这个schema，工作会运行的良好。
创建* DataFrame 第二种方法通过一个编程接口-允许你构造一个schema，然后应用它到一个存在的RDD。虽然这个方法更繁琐，但是当运行前列和类型不明确的情况下，它也允许你构造 DataFrame *。

Inferring the Schema Using Reflection

Spark SQL的Scala接口支持自动转换一个包含样例类的RDD成DataFrame。这个样例类定义了table的schema。样例类的参数名字，使用反射来读取，并且会变成列的名字。样例类也可以是嵌入的或包含像Sequences或Arrays的复杂类型。这个RDD可以被隐式转化成* DataFrame *，然后注册成table。Tables可以在随后的SQL语句中使用。

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to DataFrame.
import sqlContext.implicits._

// Define the schema using a case class.
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
//       you can use custom classes that implement the Product interface.
case class Person(name: String, age: Int)

// Create an RDD of Person objects and register it as a tables.
val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).tirm.toInt)).toDF()
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// The results of SQL queries are DataFrame and support all the normal RDD operations.
// The columns of a row in the results can be accessed by ordinal.
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Programmatically Specifying the Schema

当不能提前定义样例类(例如，记录的结构以字符串编码，或一个文本数据集-不同用户用不同的解析和不同字段设计)，以编程方式创建一个* DataFrame *需要三步。
1. 从原始RDD创建一个* Rows * RDD
2. 创建一个StructType-(匹配步骤1中RDD中Rows的结构)，代表的schema
3. 通过SQLContext提供的createDate方法应用schema到Rows RDD

例如：

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD
val people = sc.textFile("examples/src/main/resources/people.txt")

// The schema is encoded in a string
val schemaString = "name age"

// Import Spark SQL data types and Row.
import org.apache.spark.sql._

// Generate the schema based on the string of schema
val schema =
    StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))

// Convert records of the RDD(people) to Rows.
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).tirm))

// Apply the schema to the RDD.
val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)

// Register the DataFrame as a table.
peopleDataFrame.registerTempTable("people")

// SQL statements can be run by using the sql methods provided by sqlContext.
val results: DataFrame = sqlContext.sql("SELECT name FROM people")

// The results of SQL queries are DataFrame and spport all the normal RDD operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t => "Name: " + t(0)).collect().foreach(println)

Data Sources

通过DataFrame的接口Spark SQL支持各种数据源操作。* DataFrame 能像普通RDDs一样操作，并且也可以被注册成临时表。把 DataFrame *注册成表，允许你运行SQL查询它的数据。本章描述使用Spark Data Source加载和保存数据的常规方法，然后进入用于内建数据源的特殊选择。

Generic Load/Save Functions

最简单的形式，默认的数据源(默认parquet-除非使用spark.sql.sources.default设置)会被所有操作使用。

val df = sqlContext.load("people.parquet")
df.select("name", "age").save("namesAndAges.parquet")

Manually Specifying Options

你也可以手动的指定数据源，连同你想传递给数据源的额外选项。数据源由它们的完全限定名称(即org.apache.spark.sql.parquet)指定，对于内建数据源你也可以使用短名(json, parquet, jdbc)。使用这种语法，任意类型的* DataFrame *可以被转化成其它类型。

val df = sqlContext.load("people.json", "json")
df.select("name", "age").save("namesAndAges.parquet", "parquet")

Save Modes

存储操作可以可选的接受一个* SaveMode ，指定怎么操作一个存在的数据，如果存在的话。重点是认识到这些 Save Mode 能操作锁，而且不是原子性的。因此，* 对于多个写入者试图写入同一位置这种情况，不是安全的。 **
此外，当支持overwrite，写入新数据之前，数据会被删除。

Scala/Java	Python	Meaning
SaveMode.ErrorIfExists (default)	“error” (default)	当存储一个DataFrame到数据源时，如果数据已经存在，会抛出异常
SaveMode.Append	“append”	当存储一个DataFrame到数据源时，如果数据/表已经存在，DataFrame的内容会附加到存在的数据
SaveMode.Ignore	“ignore”	Ignore模式意味当存储一个DataFrame到数据源时，如果数据已经存在，这个存储操作不会存储DataFrame内容，而且不会改变存在的数据。类似于SQL的`CREATE TABLE IF NOT EXISTS`

Saving to Persistent Tables

当使用HiveContext工作时，使用saveAsTable命令* DataFrame 也能以持久化表的形式被存储。不同于registerTempTable命令，saveAsTable命令会实体化 DataFrame 的内容，然后在 HiveMetastore 中创建一个数据的指针。在你的Spark程序重启后，持久化表仍然会存在，只要你维持连接同一元存储。持久化表的 DataFrame *可以调用SQLContext的table方法，with 表名。
缺省的，saveAsTable会创建一个”managed table”，意味着数据的位置会被元存储控制。当表被删除时，Managed tables也会自动的删除它的数据。

Parquet Files

Parquet是一个被许多数据存储系统支持的* 列格式 *存储。Spark SQL同时提供读取和写入* Parquet *文件(自动保存原始数据的schema)的支持。

Loading Data Programmatically

使用上面例子的数据：

// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

val people: RDD[Person] = ... // An RDD of case class objects, from the pervious example.

// The RDD is implicitly converted to a DataFrame by implicits, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")

// Parquet files can also be registered as tables and then used in SQL statements.
parquetFile.registerTempTable("parquetFile")
val teenagers = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.map(t => "Name: " + t(0)).collect().foreach(println)

Partition discovery

系统中表分区是常用的最佳化方法，像Hive。在分区的表种，数据通常存储在不同的目录中，每个分区目录的路径中编码分区的列值。Parquet数据源
现在可以探索并且自动推断的分区的信息。例如，我们可以存储人口数据成分区表，使用下面的目录结构， 2个额外的列、gender country as 分区列：

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...

传递path/to.table给SQLContext.parquetFile或SQLContext.load，Spark SQL会自动从路径中提取分区信息。现在，返回的* DataFrame *的schema会变成：

root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)

注意，分区列的数据类型是自动推断的。目前，支持数值数据类型和字符串类型。

Schema merging

类似于ProtocolBuffer，Avro、Thrift，Parquet也支持schema演化。用户可以以一个简单的schema开始，然后如果需要逐步添加更多的列到这个schema。用这种方法，用户最后会有多种多样的不同但交互兼容的schemas的Parquet文件。这个Parquet数据源现在可以自动侦测这个case，然后合并所有文件的schemas。

// sqlContext from the previous example is used in this example.
// This is used to implicitly convert an RDD to a DataFrame.
import sqlContext.implicits._

// Create a simple DataFrame. stored into a partition directory.
val df1 = sparkContext.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.saveAsParquetFile("data/test_table/key=1")

// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column.
val df2 = sparkContext.makeRDD(6 to 10).map(i => (i, i * 3)).toDF("single", "triple")
df2.saveAsParquetFile("data/test_table/key=2")

// Read the partitioned table
val df3 = sqlContext.parquetFile("data/test_table")
df3.printSchema()

// The final schema consists of all 3 columns in the Parquet files together
// with the partioning column appeared in the partition directory paths.

// root
// |-- single: int (nullable = true)
// |-- double: int (nullable = true)
// |-- triple: int (nullable = true)
// |-- key : int (nullable = true)

Configuration

Parquet的配置可以使用SQLContext的setConf方法或者使用SQL运行SET key=value命令。

Property Name	Default	Meaning
spark.sql.parquet.binaryAsString	false	另一些Parquet产品系统，特殊的Impala和老版本Spark SQL中的，不会区分二进制数据和字符串当写入Parquet schema的时候。这个标志告诉Spark SQL以字符串解释二进制数据，来提供这些系统的兼容性
spark.sql.parquet.int96AsTimestamp	true	另一些Parquet产品系统，特殊的Impala，存储* Timestamp 成INT96。Spark也会以INT96来存储 Timestamp *，因为我们需要避免纳秒字段的精度丢失
spark.sql.parquet.cacheMetadata	true	开启Parquet schema元数据的缓存。可以提升查询静态数据的速度
spark.sql.parquet.compression.codec	gzip	当写入Parquet文件时，设置压缩编码器使用。接受的值包括：uncompressed, snappy, gzip, lzo
spark.sql.parquet.filterPushdown	false	开启Parquet过滤器后进先出优化。默认这个特性是关闭的，因为在Paruet 1.6.0rc3 (PARQUET-136)有已知的bug。然而，如果你的表不包含任何空的字符串或二进制列，开启这个特性还是安全的
spark.sql.hive.convertMetastoreParquet	true	当设置成false，Spark SQL会对Parquet表使用Hive SerDe替代内建的支持

JSON Datasets

Spark SQL可以自动推断JSON数据集的schema，然后加载它成为一个* DataFrame *。可以使用SQLContext中两个方法中的一个来完成这个转换：
- jsonFile - 从JSON文件的目录加载数据，每个文件的行都是一个JSON对象
- jsonRDD - 从一个存在的RDD加载数据，RDD的每个元素都是一个包含JSON对象的字符串

注意，提供的文件-* jsonFile *不是一个典型的JSON文件。每行都必须包含一个分割、独立的有效JSON对象。因此，一个正规多行JSON文件通常都会失败。

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// A JSON dataset is pointed to by path.
// The path can ne either a single text file or a directory storing text files.
val path = "examples/src/main/resources/people.json"
// Create a DataFrame from the file(s) pointed to by path
val people = sqlContext.jsonFile(path)

// The inferred schema can be visualized using the printSchema() method.
people.printSchema()
// root
// |-- age: integer (nullable = true)
// |-- name: string (nullable = true)

// Register this DataFrame as a table.
people.registerTempTable("people")

// SQL statements can be run by using the sql methods provided
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

// Alternatively, a DataFrame can be created for a JSON dataset represented by
// an RDD[String] storing one JSON object per string.
val anotherPeopleRDD = sc.parallelize(
    """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""" :: Nil)
val anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)

Hive Tables

Spark SQL也支持读取和写入存储在Apache Hive数据。然而，因为Hive有大量的依赖，它没有被植入Spark默认组装。通过添加-Phive和-Phive-thriftserver标志到Spark’s bulid，就可以启用Hive支持。这个命令建立一个包含Hive的新的组装jar。注意，Hive组装jar必须出现在所有工作节点，因为他们会需要访问Hive序列化库和反序列化库，为了访问存储在Hive的数据。
配置Hive，conf/.中的hive-site.xml文件。
当使用Hive时，必须构建一个HiveContext-继承自SQLContext，然后添加对使用HiveQL编写查询和查找元存储中的表的支持。如果用户没有存在的Hive部署，也可以创建HiveContex。如果没有使用* hive-site.xml *配置，Context会在当前目录自动创建metastore_db和warehouse。

// sc ia an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

// Queries are expressed in HiveQL
sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)

JDBC To Other Databases

Spark SQL也包含一个数据源-可以使用JDBC从其它数据源读取数据。这个功能优先使用JdbcRDD。这是因为以* DataFrame *形式返回结果，而且它们可以容易的在Spark SQL中处理或与其它数据源联合。这个JDBC数据源更容易的从Java或Python中使用，因为它不需要用户提供ClassTag。(注意，这点不同于Spark SQL JDBC server，允许其它应用使用Spark SQL执行查询。)

第一步，你需要添加你的数据源的JDBC driver到spark classpath。例如，从Spark shell连接postgres，你需要执行下面命令：

SPARK_CLASSPATH=postgresql-9.3-1102-jdbc41.jar bin/spark-shell

使用这个数据源的API，可以以* DataFrame *或Spark SQl临时表的形式加载远程数据库的表。下面是支持的可选项：

Property Name	Meaning
url	连接的JDBC URL
dbtable	将要读取的JDBC表。注意，有效的SQL查询子句’FROM’是可用的。例如，取代一个完全的表，你也可以在括号中使用子查询
driver	JDBC driver的class name需要连接这个URL。在执行JDBC命令，允许这个驱动注册它自己和JDBC子系统之前，这个class会加载到master和workers
partitionColumn, lowerBound, upperBound, numPartitions	如果这些选项其中一个被指定了，那么所有的都必须被指定。它们描述了当从多个节点并行读取时，怎样分区表。`partitionColumn`必须是表中的一个数值列，这是一个问题

val jdbcDF = sqlContext.load("jdbc", Map(
        "url" -> "jdbc:postgresql:dbserver",
        "dbtable" -> "schema.tablename")
    )

Troubleshooting

JDBC driver class 对于* client session 上的原始 class loader 和所有的 executors 来说，必须是可见的。这是因为Java的 DriverManager class 做安全检查，当一个打开连接时，检查的结果忽略了所有对于原始 class loader 不可见的 drivers 。一个方便的方法是更改所有节点的 compute_classpath.sh ，添加你的 driver *JARs
一些数据库，例如H2，转化所有名字为大写字母。在Spark SQL中，对于这些名字，你需要使用大写字母

Performance Tuning

对不同的工作负载可以使用* 内存缓存数据，或开启一些实验性的选项 *来提升性能。

Caching Data In Memory

Spark SQL可以通过调用sqlContext.cacheTable("tableName")或dataFrame.cache()来利用内存列格式缓存表。因此，Spark SQL只需要扫描列，然后自动的优化压缩来减少内存的使用和GC压力。可以调用sqlContext.uncacheTable("tableName")从内存中移除表。
内存缓存配置使用SQLContext的setConf方法或使用SQL运行SET key=value命令。

column	column	column
spark.sql.inMemoryColumnarStorage.compressed	true	当设置为* true *时，Spark SQL会自动的基于数据统计对每一列选择压缩编码
spark.sql.inMemoryColumnarStorage.batchSize	10000	控制列缓存的批量大小。大一些的批量大小可以提供内存利用率和压缩率，但是有OOMs(Out of Memory)的风险

Other Configuration Options

下面的选项也可以被用来优化查询性能。这些选项在未来版本可能被弃用，因为更多的优化会自动执行。

Property Name	Default	Meaning
spark.sql.autoBroadcastJoinThreshold	10485760 (10 MB)	配置表的字节最大值，然后当执行结合操作时广播到所有节点。设置值为 -1，取消广播。注意，近期统计只支持Hive Metastore表，使用命令`ANALYZE TABLE <tableName> COMPUTE STATISTICS noscan`来执行
spark.sql.codegen	false	当为true时，运行时会为一个指定的查询中的表达式求值动态生成代码。对一些包含复杂表达式的查询，这个选项可以导致重大的提升。然而，对一些简单的查询会缓慢查询执行
spark.sql.shuffle.partitions	200	设置分区使用的数量，当结合或聚合数据的时候

Distributed SQL Engine

Spark SQL也能以分布式查询引擎的方式执行，只要使用它的JDBC/ODBC或命令行的接口。用这种方式，终端用户或应用可以与Spark SQL目录进行交换-只用执行SQL查询，不需要写任何代码。

Running the Thrift JDBC/ODBC server

这里应用的Thrift JDBC/ODBC server对应Hive 0.13中的HiveServer2。你可以使用Spark或Hive 0.13的脚本测试JDBC server。
在Spark目录中运行下面的目录，启动JDBC/ODBC server：

./sbin/start-thriftserver.sh

这个脚本接受所有bin/spark-sumit命令行选项，附加--hiveconf选项指定Hive属性。你可以运行./sbin/start-thriftserver.sh --help获取完整的可选项列表。缺省的，server listen是localhost:10000。你可以重写这个自变量通过环境变量，像：

export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
  --master <master-uri> \
  ...

或系统属性：

./sbin/start-thriftserver.sh \
  --hiveconf hive.server2.thrift.port=<listening-port> \
  --hiveconf hive.server2.thrift.bind.host=<listening-host> \
  --master <master-uri>
  ...

现在。你可以使用beeline测试Thrift JDBC/ODBC server:

./bin/beeline

beeline连接 JDBC/ODBC server:

beeline> !connect jdbc:hive2://localhost:10000

Beeline要求一个username和password。在非安全模式中，简单的输入username和空password。安全模式中，请按beeline documentation中的指令。
conf/中的hive-site.xml可以配置Hive。
你也可以使用Hive的beeline脚本。
Thrift JDBC server也支持使用HTTP发送thrift RPC消息。使用下面的设置启用HTTP模式在系统属性或在conf/中hive-site.xml文件：

hive.server2.transport.mode - Set this to value: http
hive.server2.thrift.http.port - HTTP port number fo listen on; default is 10001
hive.server2.http.endpoint - HTTP endpoint; default is cliservice

为测试，使用beeline以HTTP模式连接JDBC/ODBC server：

beeline> !connect jdbc:hive2://<host>:<port>/<database>?hive.server2.transport.mode=http;hive.server2.thrift.http.path=<http_endpoint>

Running the Spark SQL CLI

Spark SQL CLI是一个方便的工具，用来以本地模式运行Hive metastore service和从命令行输入执行查询。注意，Spark SQL CLI不能与Thrift JDBC server通信。

在Spark目录中运行下面命令，启动Spark SQL CLI：

./bin/spark-sql

conf/中的hive-site.xml可以配置Hive。你可以运行./bin/spark-sql --help获取完整的可选项列表。

Migration Guide(迁移指南)

Upgrading from Spark SQL 1.0-1.2 to 1.3

Spark 1.30中，我们从Spark SQL中移除”Alpha”标签，作为其中一部分清理了可用APIs。从Spark 1.3之前，Spark SQL会提供1.x系列其它版本的二进制兼容性。这个兼容性保证了排除明确标志为不稳定APIs(即，DeveloperAPI或实验性的)。

Rename of SchemaRDD to DataFrame

Unification of the Java and Scala APIs

Isolation of Implicit Conversions and Removal of dsl Package (Scala-only)

Removal of the type aliases in org.apache.spark.sql for DataType (Scala-only)

UDF Registration Moved to sqlContext.udf (Java & Scala)

Python DataTypes No Longer Singletons

Migration Guide for Shark User

Scheduling

Reducer number

Caching

Compatibility with Apache Hive

Deploying in Existing Hive Warehouses

Supported Hive Features

Unsupported Hive Functionality

Data Types

Spark SQL和* DataFrame *支持下面的数据类型：

Numeric类型：
- ByteType: 1-byte 带符号整数；-128 - 127
- ShortType: 2-byte 带符号整数； -32768 - 32767
- IntegerType: 4-byte 带符号整数； -2147483648 - 2147483647.
- LongType: 8-byte 带符号整数 -9223372036854775808 - 9223372036854775807
- FloatType: 4-byte 单精度浮点数
- DoubleType: 8-byte 双精度浮点数
- DecimalType: 任意精度带符号小数。内部使用java.math.BigDecimal支持。
String类型：
- StringType
Binary类型：
- BinaryType
Boolean类型：
- BooleanType
Datetime类型：
- TimestampType
- DateType
Complex类型：
- ArrayType(elementType, containsNull)
- MapType
- StructType(fields)
  - StructField(name, dataType, nullable)
    Spark SQL的所有数据类型都在包org.apache.spark.sql.types中。你可以使用下面来访问它们：

import org.apache.spark.sql.types._

Data type	Value type in Scala	API to access or create a data type
ByteType	Byte	ByteType
ShortType	Short	ShortType
IntegerType	Int	IntegerType
LongType	Long	LongType
FloatType	Float	FloatType
DoubleType	Double	DoubleType
DecimalType	java.math.BigDecimal	DecimalType
StringType	String	StringType
BinaryType	Array[Byte]	BinaryType
BooleanType	Boolean	BooleanType
TimestampType	java.sql.Timestamp	TimestampType
DateType	java.sql.Date	DateType
ArrayType	scala.collection.Seq	ArrayType(elementType, [containsNull]).Note: The default value of containsNull is true.
MapType	scala.collection.Map	MapType(keyType, valueType, [valueContainsNull]).Note: The default value of valueContainsNull is true.
StructType	org.apache.spark.sql.Row	StructType(fields).Note: fields is a Seq of StructFields. Also, two fields with the same name are not allowed
StructField	The value type in Scala of the data type of this field (For example, Int for a StructField with the data type IntegerType)	StructField(name, dataType, nullable)