Spark 2.0 -SQL 学习笔记

最新推荐文章于 2023-11-21 05:27:29 发布

千寻千梦

最新推荐文章于 2023-11-21 05:27:29 发布

阅读量2.4k

点赞数

分类专栏： spark 数据库相关文章标签： spark

spark 同时被 2 个专栏收录

22 篇文章 4 订阅

订阅专栏

数据库相关

8 篇文章 0 订阅

订阅专栏

主要参考资料：Spark官方文档：
http://spark.apache.org/docs/latest/sql-programming-guide.html
本文只是翻译部分内容大体意思，参考资料还包括：
DataFrame:
http://blog.csdn.net/cq1982/article/details/45953401
Apache Spark 2.0 三种 API 的传说：RDD，DataFrame 和 Dataset：
http://www.tuicool.com/articles/IjMrmuZ
SparkSession-Spark的一个全新的切入点
http://blog.csdn.net/lw_ghy/article/details/51471832

概述

Spark SQL是Spark结构化数据处理模块，不同于基本的Spark RDD API，Spark SQL提供了更多数据结构和计算功能，并使用了这些丰富的信息进行了额外的优化。
可以通过SQL和DataSet API和Spark SQL交互。与不同API/编程语言无关，使用了相同的执行引擎。
Spark SQL实际上是在执行SQL查询，也可以从已有的Hive中读取数据，在其它语言中返回Dataset/DataFrame结果。可以使用SQL命令行，JDBC/ODBC与SQL进行交互。
Spark SQL is a Spark module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. There are several ways to interact with Spark SQL including SQL and the Dataset API.

When computing a result the same execution engine is used, independent of which API/language you are using to express the computation. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation.
When running SQL from within another programming language the results will be returned as a Dataset/DataFrame. You can also interact with the SQL interface using the command-line or over JDBC/ODBC.

Datasets and DataFrames

　　2.0版本的Spark用户可以在RDD，DataFrame和Dataset三种数据集之间无缝转换，而是只需使用超级简单的API方法。
DataFrame是特殊的Dataset，其每行是一个弱类型JVM object。相对应地，Dataset是强类型JVM object的集合

　　Dataset是一个分布式数据集，在1.6版本之后才有的。
Python中还不支持DatasetAPI,但是由于Python动态的特性，有替代的使用方法。R语言也是类似的情况。

A Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.). The Dataset API is available in Scala and Java.

Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName). The case for R is similar.

　　DataFrame是一个由命名列（都是按指定列存储）组成的分布式数据集。它从概念上讲相当于关系数据库里的一张表，或R/Python里的数据框架，但内部有很多优化。
DataFrame能通过广泛的数据源构建，比如：结构化数据文件，Hive数据表，外部数据库或已有的RDDs。
在Java API中，需要使用 Dataset 代表DataFrame。
While, in Java API, users need to use Dataset to represent a DataFrame.
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows. In the Scala API, DataFrame is simply a type alias of Dataset[Row].

SparkSession-全新的切入点

　　在Spark的早期版本，sparkContext是进入Spark的切入点。我们都知道RDD是Spark中重要的API，然而它的创建和操作得使用sparkContext提供的API；对于RDD之外的其他东西，我们需要使用其他的Context。比如对于流处理来说，我们得使用StreamingContext；对于SQL得使用sqlContext；而对于hive得使用HiveContext。然而DataSet和Dataframe提供的API逐渐称为新的标准API，我们需要一个切入点来构建它们，所以在 Spark 2.0中我们引入了一个新的切入点：SparkSession
　　SparkSession实质上是SQLContext和HiveContext的组合（未来可能还会加上StreamingContext），所以在SQLContext和HiveContext上可用的API在SparkSession上同样是可以使用的。SparkSession内部封装了sparkContext，所以计算实际上是由sparkContext完成的。

import org.apache.spark.sql.SparkSession;

        SparkSession sparkSession=SparkSession
                .builder()
                .appName("Java Spark SQL Example")
                //.config("spark.some.config.option", 
                //"some-value")
                .master("local")
                .enableHiveSupport()
                .getOrCreate();

定义完SparkSession后，后面就可以用spark进行操作了。
如果在Eclipse中运行，可能会提示JVM heap不够的问题，设置如下：
选中被运行的类，点击菜单‘run->run…’，选择(x)=Argument标签页下的vm arguments框里输入 -Xmx512m, 保存运行就ok了

Create DataFrame

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");

// Displays the content of the DataFrame to stdout
df.show();
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+

Untyped Dataset Operations

DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.

As mentioned above, in Spark 2.0, DataFrames are just Dataset of Rows in Scala and Java API. These operations are also referred as “untyped transformations” in contrast to “typed transformations” come with strongly typed Scala/Java Datasets.

Here we include some basic examples of structured data processing using Datasets:

如何合并表

如何合并不同DataFrame中的列到一个DataFrame
insert方法是不可用的

//用这种方式倒是可以合并不同表的两列，将df的name和df2的age合并
        Dataset<Row> df3=df.select(df.col("name"),df2.col("age"));      

        //Inserting into an RDD-based table is not allowed.;
        //不允许向createOrReplaceTempView中插入数据
        //spark.sql("insert into table df2view select name from dfview");

取样

df.sample(false, 0.5)//重采样  false=不放回，true=放回，0.5比例

千寻千梦

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark 2.0 -SQL 学习笔记

Spark SQL是Spark结构化数据处理模块，不同于基本的Spark RDD API，Spark SQL提供了更多数据结构和计算功能，并使用了这些丰富的信息进行了额外的优化。可以通过SQL和DataSet API和Spark SQL交互。与不同API/编程语言无关，使用了相同的执行引擎。
复制链接

扫一扫

专栏目录