sparkSQL实战一：利用反射推断模式

最新推荐文章于 2023-05-13 18:09:42 发布

岸芷汀兰whu

最新推荐文章于 2023-05-13 18:09:42 发布

阅读量1.1k

点赞数 1

分类专栏：大数据 spark 文章标签： spark SQL

本文链接：https://blog.csdn.net/u012432611/article/details/47948625

版权

spark 同时被 2 个专栏收录

66 篇文章 0 订阅

订阅专栏

大数据

59 篇文章 0 订阅

订阅专栏

数据源

Spark SQL支持通过SchemaRDD接口操作各种数据源。一个SchemaRDD能够作为一个一般的RDD被操作,也可以被注册
为一个临时的表。注册一个SchemaRDD为一个表就可以允许你在其数据上运行SQL查询。这节描述了加载数据为
SchemaRDD的多种方法。

RDDs
-parquet文件
JSON数据集
Hive表
## RDD数据源 ##
Spark支持两种方法将存在的RDDs转换为SchemaRDDs。第一种方法使用反射来推断包含特定对象类型的RDD的模式
(schema)。在你写spark程序的同时,当你已经知道了模式,这种基于反射的方法可以使代码更简洁并且程序工作得更好。
创建SchemaRDDs的第二种方法是通过一个编程接口来实现,这个接口允许你构造一个模式,然后在存在的RDDs上使用
它。虽然这种方法更冗长,但是它允许你在运行期之前不知道列以及列的类型的情况下构造SchemaRDDs。

利用反射推断模式

Spark SQL的Scala接口支持将包含样本类的RDDs自动转换为SchemaRDD。这个样本类定义了表的模式。
给样本类的参数名字通过反射来读取,然后作为列的名字。样本类可以嵌套或者包含复杂的类型如序列或者数组。这个RDD
可以隐式转化为一个SchemaRDD,然后注册为一个表。表可以在后续的 sql语句中使用。

启动环境

首先要启动hadoop-dfs和spark集群和spark-shell

编写sparkSQL代码

创建sqlcontext

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@669ee61e

2,引入隐事转换用于把RDD转换为schemaRDD

import sqlContext._

3,接下来定义一个 case class 来用于描述和存储 SQL 表中的每一行数据:

case class Person(name:String,age:Int)
  defined class Person

4,接下来要加载数据,这里的测试数据是 people.txt 文件:
其内容为:
Michael,29
Andy,30
Justin,19
5,加载数据

val people = sc.textFile("hdfs://localhost:9000/user/sqlData/people.txt").map(_.split(",")).map(p=> Person(p(0),p(1).trim.toInt))

6,注册成名称为 people 的 table:

 people.registerAsTable("people")
此时 people 依旧是一个 MappedRDD:
scala> people.toDebugString
15/08/24 11:12:10 INFO mapred.FileInputFormat: Total input paths to process : 1
res6: String = 
MappedRDD[7] at map at <console>:19 (2 partitions)
  MappedRDD[6] at map at <console>:19 (2 partitions)
    MappedRDD[5] at textFile at <console>:19 (2 partitions)
      HadoopRDD[4] at textFile at <console>:19 (2 partitions)

7,接下进行 SQL 查询操作:

val teenagers = sqlContext.sql("SELECT name FROM people WHERE age>= 13 AND age<=19")
15/08/24 11:13:47 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
15/08/24 11:13:47 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
15/08/24 11:13:47 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
15/08/24 11:13:47 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
teenagers: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[10] at RDD at SchemaRDD.scala:98
== Query Plan ==
Project [name#0:0]
 Filter ((age#1:1 >= 13) && (age#1:1 <= 19))
  ExistingRdd [name#0,age#1], MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:174

此时 teenagers 已经通过隐式转换成为了 SchemaRDD,我们看一下其 lineage:

scala> teenagers.toDebugString
res7: String = 
SchemaRDD[10] at RDD at SchemaRDD.scala:98
== Query Plan ==
Project [name#0:0]
 Filter ((age#1:1 >= 13) && (age#1:1 <= 19))
  ExistingRdd [name#0,age#1], MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:174 (2 partitions)
  MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:38 (2 partitions)
    MapPartitionsRDD[11] at mapPartitions at basicOperators.scala:51 (2 partitions)
      MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:174 (2 partitions)
        MappedRDD[7] at map at <console>:19 (2 partitions)
          MappedRDD[6] at map at <console>:19 (2 partitions)
            MappedRDD[5] at textFile at <console>:19 (2 partitions)
              HadoopRDD[4] at textFile at <console>:19 (2 partitions)

8,通过 collect 操作出发 Job 的提交和执行:

scala> teenagers.map(t=> "Name:"+t(0)).collect().foreach(println)
 Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/08/24 11:15:44 INFO scheduler.DAGScheduler: Stage 1 (collect at <console>:20) finished in 0.114 s
15/08/24 11:15:44 INFO spark.SparkContext: Job finished: collect at <console>:20, took 0.316110269 s
Name:Justin

动手实战 Spark SQL 操作的 DSL

DSL 是 Domain Specific Language 的缩写,使用 DSL 我们可以直接基于读取的 RDD
数据进行 SQL 操作,无需注册成 Table。

退出重新进入spark-shell:
bin$ ./spark-shell --master spark://moon:7077

scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1f52053e

scala> import sqlContext._
import sqlContext._

scala> case class Person(name: String,age:Int)
defined class Person

scala> val people=sc.textFile("hdfs://localhost:9000/user/sqlData/people.txt").map(_.split(",")).map(p =>Person(p(0),p(1).trim.toInt))

此时可以直接进行 SQL 查询操作:

scala> val teenagers = people.where('age >= 10).where('age <=19).select('name)

使用 toDebugString 查看一下 lineage 关系:

scala> teenagers.toDebugString
15/08/24 11:26:10 INFO mapred.FileInputFormat: Total input paths to process : 1
res0: String = 
SchemaRDD[8] at RDD at SchemaRDD.scala:98
== Query Plan ==
Project [name#0:0]
 Filter ((age#1:1 >= 10) && (age#1:1 <= 19))
  ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174 (2 partitions)
  MapPartitionsRDD[10] at mapPartitions at basicOperators.scala:38 (2 partitions)
    MapPartitionsRDD[9] at mapPartitions at basicOperators.scala:51 (2 partitions)
      MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174 (2 partitions)
        MappedRDD[3] at map at <console>:19 (2 partitions)
          MappedRDD[2] at map at <console>:19 (2 partitions)
            MappedRDD[1] at textFile at <console>:19 (2 partitions)
              HadoopRDD[0] at textFile at <console>:19 (2 partitions)

可以发现使用 DSL 的时候 teenagers 在内部已经被隐式转换成为了 SechmaRDD 的实
例。
把结果打印出来

scala> teenagers.map(t=>"Name:"+t(0)).collect().foreach(println)
Name:Justin

岸芷汀兰whu

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
sparkSQL实战一：利用反射推断模式

sparkSQL操作实战
复制链接

扫一扫

专栏目录