Spark修炼之道(进阶篇)——Spark入门到精通:第八节 Spark SQL与DataFrame(一)

本节主要内宾

  1. Spark SQL简介
  2. DataFrame

1. Spark SQL简介

Spark SQL是Spark的五大核心模块之一,用于在Spark平台之上处理结构化数据,利用Spark SQL可以构建大数据平台上的数据仓库,它具有如下特点: 
(1)能够无缝地将SQL语句集成到Spark应用程序当中 
这里写图片描述 
(2)统一的数据访问方式 
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources. 
这里写图片描述 
(3) 兼容Hive 
这里写图片描述 
(4) 可采用JDBC or ODBC连接 
这里写图片描述 
具体见:http://spark.apache.org/sql/

关于Spark SQL的运行原理可参见:http://blog.csdn.net/book_mmicky/article/details/39956809,文章写得非常好 ,这里不再赘述,在此向作者致敬

2. DataFrame

(1)DataFrame简介

本文部分内容译自https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

DataFrames在Spark-1.3.0中引入,主要解决使用Spark RDD API使用的门槛,使熟悉R语言等的数据分析师能够快速上手Spark下的数据分析工作,极大地扩大了Spark使用者的数量,由于DataFrames脱胎自SchemaRDD,因此它天然适用于分布式大数据场景。相信在不久的将来,Spark将是大数据分析的终极归宿。

在Spark中,DataFrame是一种以RDD为基础的分布式数据集,与传统RDBMS的表结构类似。与一般的RDD不同的是,DataFrame带有schema元信息,即DataFrame所表示的表数据集的每一列都带有名称和类型,它对于数据的内部结构具有很强的描述能力。因此Spark SQL可以对藏于DataFrame背后的数据源以及作用于DataFrame之上的变换进行了针对性的优化,最终达到大幅提升运行时效率。

DataFrames具有如下特点:

(1)Ability to scale from kilobytes of data on a single laptop to petabytes on a large cluster(支持单机KB级到集群PB级的数据处理) 
(2)Support for a wide array of data formats and storage systems(支持多种数据格式和存储系统,如图所示) 
这里写图片描述 
(3)State-of-the-art optimization and code generation through the Spark SQL Catalyst optimizer(通过Spark SQL Catalyst优化器可以进行高效的代码生成和优化) 
(4)Seamless integration with all big data tooling and infrastructure via Spark(能够无缝集成所有的大数据处理工具) 
(5)APIs for Python, Java, Scala, and R (in development via SparkR)(提供Python, Java, Scala, R语言API)

(2)DataFrame 实战

本节部分内容来自:http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes

将people.json上传到HDFS上,放置在/data目录下,people.json文件内容如下:

<code class="hljs ruby has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">root<span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">@sparkslave01</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:~</span><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># hdfs dfs -cat /data/people.json</span>
{<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Michael"</span></span>}
{<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Andy"</span></span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"age"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">30</span>}
{<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"name"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:<span class="hljs-string" style="box-sizing: border-box;">"Justin"</span></span>, <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"age"</span><span class="hljs-symbol" style="color: rgb(0, 102, 102); box-sizing: border-box;">:</span><span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">19</span>}</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

由于json文件中已经包括了列名称的信息,因此它可以直接创建DataFrame

<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">scala> val df = sqlContext.read.json("/data/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//显示DataFrame完整信息</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.show()
+----+-------+</span>
<span class="hljs-header" style="box-sizing: border-box;">| age|   name|
+----+-------+</span>
|null|Michael|
|  30|   Andy|
<span class="hljs-header" style="box-sizing: border-box;">|  19| Justin|
+----+-------+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li></ul>

这里写图片描述

<code class="hljs cs has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//查看DataFrame元数据信息</span>
scala> df.printSchema()
root
 |-- age: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">long</span> (nullable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>)
 |-- name: <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">string</span> (nullable = <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">true</span>)
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//返回DataFrame某列所有数据</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.select("name").show()
+-------+</span>
<span class="hljs-header" style="box-sizing: border-box;">|   name|
+-------+</span>
|Michael|
|   Andy|
<span class="hljs-header" style="box-sizing: border-box;">| Justin|
+-------+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//DataFrame数据过滤</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.filter(df("age") > 21).show()
+---+----+</span>
<span class="hljs-header" style="box-sizing: border-box;">|age|name|
+---+----+</span>
<span class="hljs-header" style="box-sizing: border-box;">| 30|Andy|
+---+----+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>
<code class="hljs asciidoc has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">//按年龄分组</span>
<span class="hljs-header" style="box-sizing: border-box;">scala> df.groupBy("age").count().show()
+----+-----+</span>
<span class="hljs-header" style="box-sizing: border-box;">| age|count|
+----+-----+</span>
|null|    1|
|  19|    1|
<span class="hljs-header" style="box-sizing: border-box;">|  30|    1|
+----+-----+</span>
</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li></ul>
<code class="hljs avrasm has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-radius: 0px; word-wrap: normal; background: transparent;">//注册成表
scala> df<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.registerTempTable</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"people"</span>)
//执行SparkSQL
scala> val teenagers = sqlContext<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"SELECT name, age FROM people WHERE age >= 13 AND age <= 19"</span>)
<span class="hljs-label" style="box-sizing: border-box;">teenagers:</span> org<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.apache</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.spark</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.sql</span><span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.DataFrame</span> = [name: string, age: bigint]
//结果格式化输出
scala> teenagers<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.map</span>(t => <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Name: "</span> + t(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>))<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.collect</span>()<span class="hljs-preprocessor" style="color: rgb(68, 68, 68); box-sizing: border-box;">.foreach</span>(println)
<span class="hljs-label" style="box-sizing: border-box;">Name:</span> Justin</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

这里写图片描述

转载: http://blog.csdn.net/lovehuangjiaju/article/details/48661847

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值