1 SparkSQL 介绍
Spark SQL是Spark用来处理结构化数据的一个模块,它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作用。
已经学习了Hive,它是将Hive SQL转换成MapReduce然后提交到集群上执行,大大简化了编写MapReduce的程序的复杂性,由于MapReduce这种计算模型执行效率比较慢。所有Spark SQL的应运而生,它是将Spark SQL转换成RDD,然后提交到集群执行,执行效率非常快!
1.1 特色
-
易整合
-
统一的数据访问方式
-
兼容 Hive
-
标准的数据连接
2 DataFrames
2.1 DataFrames 介绍
与RDD类似,DataFrame也是一个分布式数据容器。然而DataFrame更像传统数据库的二维表格,除了数据以外,还记录数据的结构信息,即schema。同时,与Hive类似,DataFrame也支持嵌套数据类型(struct、array和map)。从API易用性的角度上 看,DataFrame API提供的是一套高层的关系操作,比函数式的RDD API要更加友好,门槛更低。由于与R和Pandas的DataFrame类似,Spark DataFrame很好地继承了传统单机数据分析的开发体验。
2.2 创建 DataFrames
在Spark SQL中SQLContext是创建DataFrames和执行SQL的入口
2.2.1 测试数据
2.2.2 数据上传到 HDFS
[hadoop@node1 ~]$ ll
total 148296
drwxrwxr-x. 6 hadoop hadoop 95 Oct 16 10:18 apps
-rw-rw-r--. 1 hadoop hadoop 707 Oct 23 10:19 derby.log
drwxrwxr-x. 4 hadoop hadoop 28 Sep 14 19:02 hbase
drwxrwxr-x. 4 hadoop hadoop 32 Sep 14 14:44 hdfsdir
-rw-r--r--. 1 hadoop hadoop 93886005 Oct 16 23:35 hellospark-1.0-SNAPSHOT.jar
-rw-r--r--. 1 hadoop hadoop 57953026 Oct 22 14:29 ipdata.txt
drwxrwxr-x. 5 hadoop hadoop 133 Oct 23 10:19 metastore_db
-rw-r--r--. 1 hadoop hadoop 78 Oct 23 10:22 person.txt
-rw-r--r--. 1 hadoop hadoop 48 Oct 12 17:12 words1.txt
drwxrwxr-x. 4 hadoop hadoop 29 Sep 16 22:58 zookeeper
[hadoop@node1 ~]$ hadoop fs -put person.txt /
2.2.3
scala> val rdd = sc.textFile("hdfs://node1:9000/person.txt").map(_.split(","))
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:27
scala> rdd.collect
res1: Array[Array[String]] = Array(Array(1, Mike, 25), Array(2, John, 24),
Array(3, Henny, 20), Array(4, Mary, 23), Array(5, Ken, 26),
Array(6, Lurxi, 25), Array(7, George, 24))
scala> case class Person(id : Long, name : String,age : Int)
defined class Person
scala> val personRDD = rdd.map(x => Person(x(0).toLong,x(1),x(2).toInt))
personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[3] at map at <console>:31
scala> personRDD.toDF
res2: org.apache.spark.sql.DataFrame = [id: bigint, name: string, age: int]
scala> val df = personRDD.toDF
df: org.apache.spark.sql.DataFrame = [id: bigint, name: string, age: int]
scala> df.show()
+---+------+---+
| id| name|age|
+---+------+---+
| 1| Mike| 25|
| 2| John| 24|
| 3| Henny| 20|
| 4| Mary| 23|
| 5| Ken| 26|
| 6| Lurxi| 25|
| 7|George| 24|
+---+------+---+
2.3 DataFrame常用操作
2.3.1 DSL风格语法
scala> df.select("id","name").show
+---+------+
| id| name|
+---+------+
| 1| Mike|
| 2| John|
| 3| Henny|
| 4| Mary|
| 5| Ken|
| 6| Lurxi|
| 7|George|
+---+------+
scala> df.filter(col("age") > 23).show
+---+------+---+
| id| name|age|
+---+------+---+
| 1| Mike| 25|
| 2| John| 24|
| 5| Ken| 26|
| 6| Lurxi| 25|
| 7|George| 24|
+---+------+---+
2.3.2 SQL风格语法
如果想使用SQL风格的语法,需要将DataFrame注册成表
scala> df.registerTempTable("t_person")
scala> sqlContext.sql("select * from t_person order by age limit 2").show
+---+-----+---+
| id| name|age|
+---+-----+---+
| 3|Henny| 20|
| 4| Mary| 23|
+---+-----+---+
scala> sqlContext.sql("desc t_person").show
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
| id| bigint| |
| name| string| |
| age| int| |
+--------+---------+-------+
2.3.3 结果写为 json 字符串
scala> case class Person(id:Long,name:String,age:Int)
scala> val rdd = sc.textFile("hdfs://node1:9000/person.txt").map(_.split(","))
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:27
scala> val personRDD = rdd.map(x => Person(x(0).toLong,x(1),x(2).toInt))
personRDD: org.apache.spark.rdd.RDD[Person] = MapPartitionsRDD[3] at map at <console>:31
scala> val personDF = personRDD.toDF
personDF: org.apache.spark.sql.DataFrame = [id: bigint, name: string, age: int]
scala> personDF.select("id","name").show
+---+------+
| id| name|
+---+------+
| 1| Mike|
| 2| John|
| 3| Henny|
| 4| Mary|
| 5| Ken|
| 6| Lurxi|
| 7|George|
+---+------+
scala> personDF.select("id","name").write.json("hdfs://node1:9000/json")
[hadoop@node1 ~]$ hadoop fs -ls /json
Found 3 items
-rw-r--r-- 3 hadoop supergroup 0 2018-10-23 14:47 /json/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 93 2018-10-23 14:47 /json/part-r-00000-cfad9e06-9186-45ab-93b5-3bfce5738eb8
-rw-r--r-- 3 hadoop supergroup 71 2018-10-23 14:47 /json/part-r-00001-cfad9e06-9186-45ab-93b5-3bfce5738eb8
[hadoop@node1 ~]$ hadoop fs -cat /json/p*
{"id":1,"name":"Mike"}
{"id":2,"name":"John"}
{"id":3,"name":"Henny"}
{"id":4,"name":"Mary"}
{"id":5,"name":"Ken"}
{"id":6,"name":"Lurxi"}
{"id":7,"name":"George"}
[hadoop@node1 ~]$
2.4 编程方式执行 Spark SQL
2.4.1 pom 文件
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.tzb.com</groupId>
<artifactId>hellospark</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.10.6</scala.version>
<spark.version>1.6.3</spark.version>
<hadoop.version>2.6.4</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.6.3</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-make:transitive</arg>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
2.4.2 通过反射推断Schema
package mysparksql
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object SQLDemo {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SQLDemo").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
System.setProperty("user.name", "hadoop")
val personRdd = sc.textFile("hdfs://node1:9000/person.txt").map(line => {
val fields = line.split(",")
Person(fields(0).toLong, fields(1), fields(2).toInt)
})
import sqlContext.implicits._
val personDF = personRdd.toDF
personDF.registerTempTable("person")
sqlContext.sql("select * from person where age>23 order by age desc limit 2").show()
sc.stop()
}
}
case class Person(id: Long, name: String, age: Int)
+---+-----+---+
| id| name|age|
+---+-----+---+
| 5| Ken| 26|
| 6|Lurxi| 25|
+---+-----+---+
18/10/23 11:59:41 INFO SparkUI: Stopped Spark web UI at http://10.210.22.170:4040
18/10/23 11:59:41 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/10/23 11:59:41 INFO MemoryStore: MemoryStore cleared
18/10/23 11:59:41 INFO BlockManager: BlockManager stopped
18/10/23 11:59:41 INFO BlockManagerMaster: BlockManagerMaster stopped
18/10/23 11:59:41 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/10/23 11:59:41 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
18/10/23 11:59:41 INFO SparkContext: Successfully stopped SparkContext
18/10/23 11:59:41 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
18/10/23 11:59:41 INFO ShutdownHookManager: Shutdown hook called
18/10/23 11:59:41 INFO ShutdownHookManager: Deleting directory C:\Users\tzb\AppData\Local\Temp\spark-1085904e-37f4-4836-860c-e6b9f7f4e3fc
18/10/23 11:59:41 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
打包jar
修改源码
val conf = new SparkConf().setAppName("SQLDemo")
修改pom
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>mysparksql.SQLDemo</mainClass>
</transformer>
运行
[hadoop@node1 ~]$ /home/hadoop/apps/spark-1.6.3-bin-hadoop2.6/bin/spark-submit --class mysparksql.SQLDemo --master spark://node1:7077 /home/hadoop/sqldemo.jar
2.4.3 通过StructType直接指定Schema
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkContext, SparkConf}
object SpecifyingSchema {
def main(args: Array[String]) {
//创建SparkConf()并设置App名称
val conf = new SparkConf().setAppName("SQL-2")
//SQLContext要依赖SparkContext
val sc = new SparkContext(conf)
//创建SQLContext
val sqlContext = new SQLContext(sc)
//从指定的地址创建RDD
val personRDD = sc.textFile(args(0)).map(_.split(" "))
//通过StructType直接指定每个字段的schema
val schema = StructType(
List(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("age", IntegerType, true)
)
)
//将RDD映射到rowRDD
val rowRDD = personRDD.map(p => Row(p(0).toInt, p(1).trim, p(2).toInt))
//将schema信息应用到rowRDD上
val personDataFrame = sqlContext.createDataFrame(rowRDD, schema)
//注册表
personDataFrame.registerTempTable("t_person")
//执行SQL
val df = sqlContext.sql("select * from t_person order by age desc limit 4")
//将结果以JSON的方式存储到指定位置
df.write.json(args(1))
//停止Spark Context
sc.stop()
}
}