数据源
Spark SQL支持通过SchemaRDD接口操作各种数据源。一个SchemaRDD能够作为一个一般的RDD被操作,也可以被注册
为一个临时的表。注册一个SchemaRDD为一个表就 可以允许你在其数据上运行SQL查询。这节描述了加载数据为
SchemaRDD的多种方法。
- RDDs
-parquet文件 - JSON数据集
- Hive表
## RDD数据源 ##
Spark支持两种方法将存在的RDDs转换为SchemaRDDs。第一种方法使用反射来推断包含特定对象类型的RDD的模式
(schema)。在你写spark程序的同时,当你已经知道了模式,这种基于反射的 方法可以使代码更简洁并且程序工作得更好。
创建SchemaRDDs的第二种方法是通过一个编程接口来实现,这个接口允许你构造一个模式,然后在存在的RDDs上使用
它。虽然这种方法更冗长,但是它允许你在运行期之前不知道列以及列 的类型的情况下构造SchemaRDDs。
利用反射推断模式
Spark SQL的Scala接口支持将包含样本类的RDDs自动转换为SchemaRDD。这个样本类定义了表的模式。
给样本类的参数名字通过反射来读取,然后作为列的名字。样本类可以嵌套或者包含复杂的类型如序列或者数组。这个RDD
可以隐式转化为一个SchemaRDD,然后注册为一个表。表可以在后续的 sql语句中使用。
启动环境
首先要启动hadoop-dfs和spark集群和spark-shell
编写sparkSQL代码
- 创建sqlcontext
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@669ee61e
2,引入隐事转换用于把RDD转换为schemaRDD
import sqlContext._
3,接下来定义一个 case class 来用于描述和存储 SQL 表中的每一行数据:
case class Person(name:String,age:Int)
defined class Person
4,接下来要加载数据,这里的测试数据是 people.txt 文件:
其内容为:
Michael,29
Andy,30
Justin,19
5,加载数据
val people = sc.textFile("hdfs://localhost:9000/user/sqlData/people.txt").map(_.split(",")).map(p=> Person(p(0),p(1).trim.toInt))
6,注册成名称为 people 的 table:
people.registerAsTable("people")
此时 people 依旧是一个 MappedRDD:
scala> people.toDebugString
15/08/24 11:12:10 INFO mapred.FileInputFormat: Total input paths to process : 1
res6: String =
MappedRDD[7] at map at <console>:19 (2 partitions)
MappedRDD[6] at map at <console>:19 (2 partitions)
MappedRDD[5] at textFile at <console>:19 (2 partitions)
HadoopRDD[4] at textFile at <console>:19 (2 partitions)
7,接下进行 SQL 查询操作:
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age>= 13 AND age<=19")
15/08/24 11:13:47 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
15/08/24 11:13:47 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
15/08/24 11:13:47 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Add exchange
15/08/24 11:13:47 INFO sql.SQLContext$$anon$1: Max iterations (2) reached for batch Prepare Expressions
teenagers: org.apache.spark.sql.SchemaRDD =
SchemaRDD[10] at RDD at SchemaRDD.scala:98
== Query Plan ==
Project [name#0:0]
Filter ((age#1:1 >= 13) && (age#1:1 <= 19))
ExistingRdd [name#0,age#1], MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:174
此时 teenagers 已经通过隐式转换成为了 SchemaRDD,我们看一下其 lineage:
scala> teenagers.toDebugString
res7: String =
SchemaRDD[10] at RDD at SchemaRDD.scala:98
== Query Plan ==
Project [name#0:0]
Filter ((age#1:1 >= 13) && (age#1:1 <= 19))
ExistingRdd [name#0,age#1], MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:174 (2 partitions)
MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:38 (2 partitions)
MapPartitionsRDD[11] at mapPartitions at basicOperators.scala:51 (2 partitions)
MapPartitionsRDD[8] at mapPartitions at basicOperators.scala:174 (2 partitions)
MappedRDD[7] at map at <console>:19 (2 partitions)
MappedRDD[6] at map at <console>:19 (2 partitions)
MappedRDD[5] at textFile at <console>:19 (2 partitions)
HadoopRDD[4] at textFile at <console>:19 (2 partitions)
8,通过 collect 操作出发 Job 的提交和执行:
scala> teenagers.map(t=> "Name:"+t(0)).collect().foreach(println)
Removed TaskSet 1.0, whose tasks have all completed, from pool
15/08/24 11:15:44 INFO scheduler.DAGScheduler: Stage 1 (collect at <console>:20) finished in 0.114 s
15/08/24 11:15:44 INFO spark.SparkContext: Job finished: collect at <console>:20, took 0.316110269 s
Name:Justin
动手实战 Spark SQL 操作的 DSL
DSL 是 Domain Specific Language 的缩写,使用 DSL 我们可以直接基于读取的 RDD
数据进行 SQL 操作,无需注册成 Table。
退出重新进入spark-shell:
bin$ ./spark-shell --master spark://moon:7077
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@1f52053e
scala> import sqlContext._
import sqlContext._
scala> case class Person(name: String,age:Int)
defined class Person
scala> val people=sc.textFile("hdfs://localhost:9000/user/sqlData/people.txt").map(_.split(",")).map(p =>Person(p(0),p(1).trim.toInt))
此时可以直接进行 SQL 查询操作:
scala> val teenagers = people.where('age >= 10).where('age <=19).select('name)
使用 toDebugString 查看一下 lineage 关系:
scala> teenagers.toDebugString
15/08/24 11:26:10 INFO mapred.FileInputFormat: Total input paths to process : 1
res0: String =
SchemaRDD[8] at RDD at SchemaRDD.scala:98
== Query Plan ==
Project [name#0:0]
Filter ((age#1:1 >= 10) && (age#1:1 <= 19))
ExistingRdd [name#0,age#1], MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174 (2 partitions)
MapPartitionsRDD[10] at mapPartitions at basicOperators.scala:38 (2 partitions)
MapPartitionsRDD[9] at mapPartitions at basicOperators.scala:51 (2 partitions)
MapPartitionsRDD[4] at mapPartitions at basicOperators.scala:174 (2 partitions)
MappedRDD[3] at map at <console>:19 (2 partitions)
MappedRDD[2] at map at <console>:19 (2 partitions)
MappedRDD[1] at textFile at <console>:19 (2 partitions)
HadoopRDD[0] at textFile at <console>:19 (2 partitions)
可以发现使用 DSL 的时候 teenagers 在内部已经被隐式转换成为了 SechmaRDD 的实
例。
把结果打印出来
scala> teenagers.map(t=>"Name:"+t(0)).collect().foreach(println)
Name:Justin