LearningSpark9:SparkSQL

最新推荐文章于 2024-05-22 09:46:25 发布

岸芷汀兰whu

最新推荐文章于 2024-05-22 09:46:25 发布

阅读量829

点赞数

分类专栏： spark 文章标签： spark

本文链接：https://blog.csdn.net/u012432611/article/details/48135651

版权

spark 专栏收录该内容

66 篇文章 0 订阅

订阅专栏

这章讨论spark对结构化和半结构化数据的接口sparksql，结构化数据是有schema 的数据，schema即每个记录的的字段集。sparksql提供了三个主要的能力：

通过多种结构化数据源加载数据(JSON,Hive,Parquet)
在spark程序和从通过标准数据库连接（JDBC/ODBC）连接到sparksql的外部工具（如商业智能工具Tableau）用SQL查询数据
当在spark程序里使用时，sparksql提供里丰富的集成sql和规则python/java/scala代码，包括join RDD和SQL表的能力。
为实现这些能力，spark提供里特殊类型的RDD称为schemaRDD.
一个schemaRDD是行对象的RDD，每个代表一个记录。schemaRDD也被称为它的行的schema(data fields).schemaRDD看起来像通常的RDD，但是它们利用schema以更高效的方式存储数据。另外，他们提供了那些RDD没有的新的操作，例如：SQL语句。schmaRDD可以从外部数据源，queries结果，普通RDD创建。
## Lingking with SparkSQL ##
sparksql的hive支持使得我们可以获得hive表，UDF，SerDes,和hive语句。包含hive库并不需要安装hive。一般，最好build sparksql with hive support来获得这些特性。spark的二进制已经是build with hive support的，如果你从源代码编译，应该使用sbt/sbt-Phive assembly
根据我们是否需要hive支持，我们有两个sparksql入口，推荐入口是HiveContext提供HiveQL和其他hive依赖功能。更基础的SQLContext提供了sparkSQL支持的子集，而不依赖于Hive,使用HiveContext并不需要安装hive.

推荐使用HiveQL查询语言，
最后，连接Spark SQL到已经安装的Hive上，必须复制hive-site.xml文件到$SPARK_HOEM/conf目录下，如果你没安装hive,sparksql也是会运行，只是它会创建子集的hive metastore(metadata DB)到你程序的工作目录，称为metastore_db.如果你试图使用HiveQL的CREATE TABLE语句(不是CREATE EXTERNAL TABLE)创建表，它将会存储在默认目录（本地系统或者HDFS(如果你有 hdfs-site.xml在你的classpath上)）/usr/hive/warehouse目录下。

在应用中使用SparkSQL

在spark应用中使用SparkSQL是最高效的，我们基于SparkContext创建一个HiveContext它提供了和SparkSQL数据查询和交互的功能，我们可以创建SchemaRDD,它代表了结构化的数据，我们可以使用SQL或者一般的RDD操作来操作它。

初始化SparkSQL

// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext

注意，我们并没有像sparkContext那样import HiveContext._来获得隐事转换。这些隐事转换用于把带有需要的类型信息的RDD转换为SparkSQL指定的RDD用于查询。取而代之，我们创建了HiveContext实例，我们可以添加下面代码来引入隐事转换：

Example 9-3. Scala SQL implicits
// Create a Spark SQL HiveContext
val hiveCtx = ...
// Import the implicit conversions
import hiveCtx._

Example 9-4. Java SQL imports
// Import Spark SQL
import org.apache.spark.sql.hive.HiveContext;
// Or if you can't have the hive dependencies
import org.apache.spark.sql.SQLContext;
// Import the JavaSchemaRDD
import org.apache.spark.sql.SchemaRDD;
import org.apache.spark.sql.Row;
Example 9-5. Python SQL imports
# Import Spark SQL
from pyspark.sql import HiveContext, Row
# Or if you can't include the hive requirements
from pyspark.sql import SQLContext, Row

一旦我们添加了imports，我们需要创建一个HiveContext或者一个SQLContext如果没有引入hive依赖。这些类都需要运行在SparkContext上。

Example 9-6. Constructing a SQL context in Scala
val sc = new SparkContext(...)
val hiveCtx = new HiveContext(sc)
Example 9-7. Constructing a SQL context in Java
JavaSparkContext ctx = new JavaSparkContext(...);
SQLContext sqlCtx = new HiveContext(ctx);
Example 9-8. Constructing a SQL context in Python
hiveCtx = HiveContext(sc)

基本查询实例

查询表，我们调用HiveContext或者SQLContext的sql()方法，首先我们要告诉SparkSQL一些要查询的数据，我们从JSON加载一些twitter数据，并注册为表，从而可以使用SQL查询

Example 9-9. Loading and quering tweets in Scala
val input = hiveCtx.jsonFile(inputFile)
// Register the input schema RDD
input.registerTempTable("tweets")
// Select tweets based on the retweetCount
val topTweets = hiveCtx.sql("SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10")
Example 9-10. Loading and quering tweets in Java
SchemaRDD input = hiveCtx.jsonFile(inputFile);
// Register the input schema RDD
input.registerTempTable("tweets");
// Select tweets based on the retweetCount
SchemaRDD topTweets = hiveCtx.sql("SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10");
Example 9-11. Loading and quering tweets in Python
input = hiveCtx.jsonFile(inputFile)
# Register the input schema RDD
input.registerTempTable("tweets")
# Select tweets based on the retweetCount
topTweets = hiveCtx.sql("""SELECT text, retweetCount FROM
tweets ORDER BY retweetCount LIMIT 10""")

如果你安装了hive,并且把hive-site.xml文件拷贝到了$SPARK_HOME/conf目录下，你可以仅使用hiveCtx.sql查询Hive表。

SchemaRDD

加载和执行查询都返回SchemaRDD,SchemaRDD很像创痛数据库中的表，SchemaRDD是一个由行对象和额外的每列的类型schema信息组成的RDD，行对象只是封装了基本类型数组。

行对象

行对象代表了SchemaRDD内的记录，是定长字段数组。在scala/java中，行对象有一些getter方法，由下标返回每个字段的值。标准的getter,get(或者scala 中的apply) 由列数返回对象类型(scala中是Any)，我们要把它们转换为正确的类型。对于Boolean,Byte,Double,Float,Int,Long,Short和String有getType()方法返回相应的类型，例如：getString(0)将把字段0返回为string.

Example 9-12. Accessing the text column (also first column) in the topTweets
SchemaRDD in Scala
val topTweetText = topTweets.map(row => row.getString(0))
Example 9-13. Accessing the text column (also first column) in the topTweets
SchemaRDD in Java
JavaRDD<String> topTweetText = topTweets.toJavaRDD().map(new Function<Row, String>() {
public String call(Row row) {
return row.getString(0);
}});

持久化Caching

持久化SparkSQL有一点不同，因为我们知道每列的类型，Spark可以更有效的存储数据。我们使用hiveCtx.cacheTable("tableName")方法，你可以使用HiveQL/SQL持久化表,`CACHE TABLE tableName or UNCACHE TABLE tableName .

加载和保存数据

SparkSQL非常好的支持结构化数据源，使你不需要复杂的加载过程来获得行对象。这些源包括：Hive表，JSON,Parquet文件。另外，如果你使用SQL从这些源查询仅仅select字段的子集，sparkSQL可以聪明的仅仅对这些字段扫描数据的子集而不是像SparkContext.hadoopFile那样可能扫描全表。

除了这些数据源，你可以通过赋予schema把一般的RDD转换为SchemaRDD,通常，当你一次计算很多统计量(如平均年龄，最大年龄，id数)时，SQL语句更聪明。另外，你可以轻松的把这些RDD和从其他SparkSQL数据源得到的SchemaRDD进行join.

加载Hive表

Example 9-15. Hive load in Python
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)
rows = hiveCtx.sql("SELECT key, value FROM mytable")
keys = rows.map(lambda row: row[0])
Example 9-16. Hive load in Scala
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
val rows = hiveCtx.sql("SELECT key, value FROM mytable")
val keys = rows.map(row => row.getInt(0))
Example 9-17. Hive load in Java
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SchemaRDD;
HiveContext hiveCtx = new HiveContext(sc);
SchemaRDD rows = hiveCtx.sql("SELECT key, value FROM mytable");
JavaRDD<Integer> keys = rdd.toJavaRDD().map(new Function<Row, Integer>() {
public Integer call(Row row) { return row.getInt(0); }
});

Parquet表

Parquet是流行的列源存储格式，有效的存储嵌套字段。
可以使用HiveContext.parquetFile or SQLContext.parquetFile方法加载数据：

Example 9-18. Parquet load in Python
# Load some data in from a Parquet file with field's name and favouriteAnimal
rows = hiveCtx.parquetFile(parquetFile)
names = rows.map(lambda row: row.name)
print "Everyone"
print names.collect()
//You can also register a Parquet file as a Spark SQL //temp table and write queries
//against it. Example 9-19 continues from Example 9-18 //where we loaded the data.
//Example 9-19. Parquet query in Python
# Find the panda lovers
tbl = rows.registerTempTable("people")
pandaFriends = hiveCtx.sql("SELECT name FROM people WHERE favouriteAnimal = \"panda\"")
print "Panda friends"
print pandaFriends.map(lambda row: row.name).collect()

Example 9-20. Parquet file save in Python
pandaFriends.saveAsParquetFile("hdfs://...")

JSON

加载JSON数据，只需要调用hiveCtx的jsonFile()方法。

Example 9-21. Input records
{"name": "Holden"}
{"name":"Sparky The Bear", "lovesPandas":true, "knows":{"friends": ["holden"]}}
Example 9-22. Loading JSON with Spark SQL in Python
input = hiveCtx.jsonFile(inputFile)
Example 9-23. Loading JSON with Spark SQL in Scala
val input = hiveCtx.jsonFile(inputFile)
Example 9-24. Loading JSON with Spark SQL in Java
SchemaRDD input = hiveCtx.jsonFile(jsonFile);

岸芷汀兰whu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
LearningSpark9:SparkSQL

这章讨论spark对结构化和半结构化数据的接口sparksql，结构化数据是有schema 的数据，schema即每个记录的的字段集。sparksql提供了三个主要的能力：通过多种结构化数据源加载数据(JSON,Hive,Parquet)在spark程序和从通过标准数据库连接（JDBC/ODBC）连接到sparksql的外部工具（如商业智能工具Tableau）用SQL查询数据当在spark程序
复制链接

扫一扫