Spark SQL精华及与Hive的集成
- SQL on Hadoop
- Spark SQL前身
- Spark SQL架构
- Spark SQL运行原理
- Spark SQL API(一)
- Spark SQL API(二)
- Spark SQL API(三)
- Spark SQL API(四)
- Spark SQL API(五)
- Spark SQL API(六)
- Spark SQL API(七)
- Spark SQL API(八)
- Spark SQL API(九)
- Spark SQL操作外部数据源(一)
- Spark SQL操作外部数据源(二)
- Spark SQL操作外部数据源(三)
- Spark SQL操作外部数据源(四)
- Spark SQL函数(一)
- Spark SQL函数(二)
- Spark SQL函数(三)
- Spark SQL函数(四)
- Spark SQL函数(五)
- Spark SQL CLI
- Spark性能优化(一)
- Spark性能优化(二)
- Spark性能优化(三)
- Spark性能优化(四)
SQL on Hadoop
SQL是一种传统的用来进行数据分析的标准
- Hive是原始的SQL-on-Hadoop解决方案
- Impala:和Hive一样,提供了一种可以针对已有Hadoop数据编写SQL查询的方法
- Presto:类似于Impala,未被主要供应商支持
- Shark:Spark SQL的前身,设计目标是作为Hive的一个补充
- Phoenix:基于HBase的开源SQL查询引擎
Spark SQL前身
Shark的初衷:让Hive运行字Spark之上
- 是对Hive的改造,继承了大量Hive代码,给优化和维护带来了大量的麻烦
Spark SQL架构
Spark SQL是Spark的核心组件之一(2014.4 Spark1.0)
能够直接访问现存的Hive数据
提供JDBC/ODBC接口供第三方工具借助Spark进行数据处理
提供了更高层级的接口方便地处理数据
支持多种操作方式:SQL、API编程
支持多种外部数据源:Parquet、JSON、RDBMS等
Spark SQL运行原理
Catalyst优化器是Spark SQL的核心
Catalyst Optimizer:Catalyst优化器,将逻辑计划转为物理计划
Catalyst优化器(一)
逻辑计划
SELECT name FROM
(
SELECT id, name FROM people
) p
WHERE p.id = 1
转为
Catalyst优化器(二)
优化
- 1、在投影仪上面查询过滤器
- 2、检查过滤是否可下压
Catalyst优化器(三)
物理计划
Spark SQL API(一)
SparkContext
SQLContext
- Spark SQL的编程入口
HiveContext
- SQLContext的子集,包含更多功能
SparkSession(Spark 2.x推荐)
- SparkSession:合并了SQLContext与HiveContext
- 提供了Spark功能交互单一入口点,并允许使用DataFrame和Dataset API对Spark进行编程
val spark = SparkSession.builder
.master("master")
.appName("appName")
.getOrCreate()
注意:
1、无特殊说明时,下文中“Spark”均指SparkSession实例
2、如果是spark-shell下,会自动创建“sc”和“spark”
Spark SQL API(二)
Dataset(Spark 1.6+)
- 特定域对象中的强类型集合
scala> spark.createDataset(1 to 3).show
scala> spark.createDataset(List(("a",1),("b",2),("c",3))).show
scala> spark.createDataset(sc.parallelize(List(("a",1,1),("b",2,2)))).show
1、createDataset()参数可以是:Seq、Array、RDD
2、上面三行代码生成的Dataset分别是:Dataset[Int]、Dataset[(String,Int)]、Dataset[(String,Int,Int)]
3、Dataset=RDD+Schema,所以Dataset与RDD有大部分共同的函数,如map、filter等
示例
scala> spark.createDataset(1 to 10)
scala> res0.show
scala> spark.createDataset(List(("a",1),("b",2),("c",3)))
scala> res2.show
scala> sc.parallelize(List(("a",1),("b",2),("c",3)))
scala> spark.createDataset(res6)
scala> res7.show
Spark SQL API(三)
使用Case Class创建Dataset
case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
val points=Seq(Point("bar",3.0,5.6),Point("foo",-1.0,3.0)).toDS
val categories=Seq(Category(1,"foo"), Category(2,"bar")).toDS
points.join(categories,points("label")===categories("name")).show
//典型应用场景RDD->Dataset
case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
val pointsRDD=sc.parallelize(List(("bar",3.0,5.6),("foo",-1.0,3.0)))
val categoriesRDD=sc.parallelize(List((1,"foo"),(2,"bar")))
val points=pointsRDD.map(line=>Point(line._1,line._2,line._3)).toDS
val categories=categories.map(line=>Category(line._1,line._2)).toDS
points.join(categories,points("label")===categories("name")).show
Scala中在class关键字前加上case关键字,这个类就成为了样例类,样例类和普通类区别:
1、不需要new可以直接生成对象
2、默认实现序列化接口
3、默认自动覆盖toString()、equals()、hashCode()
示例
scala> case class Point(label:String,x:Double,y:Double)
scala> val ds=Seq(Point("hello",21,22),Point("sparksql",23,24)).toDS
scala> ds.show
scala> ds.select("label","x").show
scala> case class Category(id:Long,name:String)
scala> val ds2=Seq(Category(1,"hello"),Category(2,"sparksql")).toDS
scala> ds2.show
scala> ds.join(ds2,ds("label")===ds2("name")).show
scala> sc.parallelize(List(("kb09",2,6),("kb10",3,7)))
scala> res13.toDS
scala> res14.printSchema
scala> res14.map(x=>Point(x._1,x._2,x._3))
scala> res16.show
Spark SQL API(四)
DataFrame(Spark 1.4+)
-
DataFrame=Dataset[Row]
-
类似传统数据的二维表格
-
在RDD基础上加入了Schema(数据结构信息)
-
DataFrame Schema支持嵌套数据类型
struct map array
-
提供更多类似SQL操作的API
Spark SQL API(五)
RDD与DataFrame对比
Person的属性:Name、Age、Height
Spark SQL API(六)
创建DataFrame
/** 将JSON文件转成DataFrame
* people.json内容如下
* {"name":"Michael"}
* {"name":"Andy", "age":30}
* {"name":"Justin", "age":19}
*/
val df = spark.read.json("file:///home/hadoop/data/people.json")
// 使用show方法将DataFrame的内容输出
df.show
Spark SQL API(七)
DataFrame API常用操作
val df = spark.read.json("file:///home/hadoop/data/people.json")
// 使用printSchema方法输出DataFrame的Schema信息
df.printSchema()
// 使用select方法来选择我们所需要的字段
df.select("name").show()
// 使用select方法选择我们所需要的字段,并未age字段加1
df.select(df("name"), df("age") + 1).show()
// 使用filter方法完成条件过滤
df.filter(df("age") > 21).show()
// 使用groupBy方法进行分组,求分组后的总数
df.groupBy("age").count().show()
//sql()方法执行SQL查询操作
df.registerTempTable("people")
spark.sql("SELECT * FROM people").show
示例
scala> spark.read.format("json").option("header","true").load("file:///opt/kb09file/users.json")
scala> res24.printSchema
scala> res24.show
scala> val df=spark.read.format("json").option("header","true").load("file:///opt/kb09file/users.json")
scala> df.select("name").show
scala> df.select(df("name"),df("Age")+1).show
scala> df.filter(df("Age")>21).show
scala> df.groupBy("Age").count().show
Spark SQL API(八)
RDD->DataFrame
- people.txt
people.txt:
Michael,29
Andy,30
Justin,19
//方式一:通过反射获取RDD内的Schema
case class Person(name:String,age:Int)
import spark.implicits._
val people=sc.textFile("file:///home/hadooop/data/people.txt")
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt)).toDF()
people.show
people.registerTempTable("people")
val teenagers = spark.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
teenagers.show()
//方式二:通过编程接口指定Schema
case class Person(name String,age Int)
val people=sc.textFile("file:///home/hadoop/data/people.txt")
// 以字符串的方式定义DataFrame的Schema信息
val schemaString = "name age"
//导入所需要的类
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType}
// 根据自定义的字符串schema信息产生DataFrame的Schema
val schema = StructType(schemaString.split(" ").map(fieldName =>StructField(fieldName,StringType, true)))
//将RDD转换成Row
val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
// 将Schema作用到RDD上
val peopleDataFrame = spark.createDataFrame(rowRDD, schema)
// 将DataFrame注册成临时表
peopleDataFrame.registerTempTable("people")
val results = spark.sql("SELECT name FROM people")
results.show
示例
package nj.zb.kb09
import org.apache.avro.generic.GenericData
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.{DataFrame, SparkSession}
case class Person(name:String,age:Int)
object a {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("a").getOrCreate()
val sc: SparkContext = spark.sparkContext
//将RDD转换成DataFrame 第一种方式
import spark.implicits._
val frame: DataFrame = sc.textFile("in/people.txt").map(_.split(",")).map(x=>Person(x(0),x(1).toInt)).toDF()
frame.printSchema()
frame.show()
//将RDD转换成DataFrame 第二种方式
val people: RDD[String] = sc.textFile("in/people.txt")
val schemaString="name age"
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
// import org.apache.spark.sql.types._
//判断类型
val schema: StructType =
StructType(schemaString.split(" ")
.map(fieldName=>{
if(fieldName.equals("name"))
StructField(fieldName,StringType,true)
else
StructField(fieldName,IntegerType,true)}
))
// val schema: StructType =
// StructType(schemaString.split(" ")
// .map(fieldName=>StructField(fieldName,StringType,true)))
val rowRDD: RDD[Row] = people.map(_.split(",")).map(x=>Row(x(0),x(1).toInt))
val peopleDataFrame: DataFrame = spark.createDataFrame(rowRDD,schema)
peopleDataFrame.printSchema()
peopleDataFrame.show()
//临时视图(没有就创建,有就替代)
peopleDataFrame.createOrReplaceTempView("aa")
spark.sql("select * from aa")
//临时表
peopleDataFrame.registerTempTable("aaa")
//DataFrame转RDD的方式
val rdd: RDD[Row] = peopleDataFrame.rdd
}
}
结果展示:
Spark SQL API(九)
DataFrame->RDD
/** people.json内容如下
* {"name":"Michael"}
* {"name":"Andy", "age":30}
* {"name":"Justin", "age":19}
*/
val df = spark.read.json("file:///home/hadoop/data/people.json")
//将DF转为RDD
df.rdd.collect
Spark SQL操作外部数据源(一)
Spark SQL支持的外部数据源
Spark SQL操作外部数据源(二)
Parquet文件:是一种流行的列式存储格式,以二进制存储,文件中包含数据与元数据
//Spark SQL写parquet文件
import org.apache.spark.sql.types.{StructType, StructField, StringType,ArrayType,IntegerType}
val schema=StructType(Array(StructField("name",StringType),
StructField("favorite_color",StringType),
StructField("favorite_numbers",ArrayType(IntegerType))))
val rdd=sc.parallelize(List(("Alyssa",null,Array(3,9,15,20)),("Ben","red",null)))
val rowRDD=rdd.map(p=>Row(p._1,p._2,p._3))
val df=spark.createDataFrame(rowRDD,schema)
df.write.parquet("/data/users") //在该目录下生成parquet文件
//Spark SQL读parquet文件
val df=spark.read.parquet("/data/users") //该目录下存在parquet文件
df.show
df.printSchema
示例
package nj.zb.kb09.sql
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
object ParDemo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("ParquetDemo").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val list=List(
("zhangsan","red",Array(3,4,5)),
("lisi","black",Array(12,14,55)),
("wangwu","orange",Array(23,64,15)),
("zhaoliu","red",Array(13,84,15)))
val rdd1: RDD[(String, String, Array[Int])] = sc.parallelize(list)
val schema = StructType(
Array(
StructField("name", StringType),
StructField("color", StringType),
StructField("numbers", ArrayType(IntegerType))
))
val rowRdd: RDD[Row] = rdd1.map(x=>Row(x._1,x._2,x._3))
val df: DataFrame = spark.createDataFrame(rowRdd,schema)
df.show()
df.write.parquet("out/color")
val frame: DataFrame = spark.read.parquet("out/color")
frame.printSchema()
frame.show()
}
}
结果展示:
Spark SQL操作外部数据源(三)
Hive表
#创建一个Hive表
hive>create table toronto(full_name string, ssn string, office_address string);
hive>insert into toronto(full_name, ssn, office_address) values('John S. ', '111-222-333 ', '123 Yonge Street ');
//集成Hive后spark-shell下可直接访问Hive表
val df=spark.table("toronto")
df.printSchema
df.show
val spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", warehouseLocation)
.enableHiveSupport()
.getOrCreate()
val df = spark.sql("select * from toronto")
df.filter($"ssn".startsWith("111")).write.saveAsTable("t1")
//$"ssn"等价于col("ssn")、df("ssn")
Spark SQL与Hive集成:
1、hive-site.xml拷贝至${SPAPRK_HOME}/conf下
2、检查hive.metastore.uris是否正确
3、启动元数据服务:$hive.service.metastore
4、自行创建SparkSession,应用配置仓库地址与启用Hive支持
示例
Linux版本
- 启动Hadoop
[root@hadoop100 ~]# start-all.sh
- 启动Spark
[root@hadoop100 ~]# cd /opt/spark245/sbin/
[root@hadoop100 sbin]# ./start-all.sh
[root@hadoop100 sbin]# jps
- 拷贝Hive的hive-site.xml至Spark
[root@hadoop100 sbin]# cd /opt/hive/conf
[root@hadoop100 conf]# cp hive-site.xml /opt/spark245/conf/
[root@hadoop100 conf]# cd ../../spark245/conf
[root@hadoop100 conf]# ll
- 启动Hive的元数据
[root@hadoop100 conf]# nohup hive --service metastore &
- 再敲击次回车
- 查看进程
[root@hadoop100 conf]# jps
- 启动Hive
[root@hadoop100 conf]# hive
- 使用default数据库,并创建表插入数据
hive> show databases;
hive> use default;
hive> show tables;
hive> create table toronto(full_name String,ssn String,office_address String);
hive> insert into toronto(full_name,ssn,office_address) values('John S.','111-222-333','123 Yonge Street');
hive> select * from toronto;
- 启动spark-shell
[root@hadoop100 ~]# spark-shell
- 连接至Hive
scala> val df=spark.table("toronto")
scala> df.printSchema
scala> df.show
Java版本
- 添加两个依赖包
//版本号根据自己的版本而选择
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
- 所有的依赖包
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.21</version>
</dependency>
<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
- 连接至Hive
package nj.zb.kb09.sql
import org.apache.spark.sql.{DataFrame, SparkSession}
object SparksqlOnHiveDemo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("sparkOnHiveDemo").config("hive.metastore.uris","thrift://192.168.136.100:9083").enableHiveSupport().getOrCreate()
//展示所有的数据库
spark.sql("show databases").collect().foreach(println)
//spark默认连接hive default数据库
println("-----------------------------")
val df: DataFrame = spark.sql("select * from toronto")
df.printSchema()
df.show()
println("-------------------------------")
//连接其他库请参考hello.abc
/*
切换默认库的方式
spark.sql("use shopping")
spark.sql("select * from vw_store_review").show()
*/
val df2: DataFrame = spark.sql("select * from hello.abc")
df2.printSchema()
df2.show()
df.where(df("ssn").startsWith("111")).show()
df.filter(df("ssn").startsWith("111")).show()
}
}
结果展示:
Spark SQL操作外部数据源(四)
RDBMS表
$spark-shell --jars /opt/spark/ext_jars/mysql-connector-java-5.1.38.jar
val url = "jdbc:mysql://localhost:3306/metastore"
val tableName = "TBLS"
// 设置连接用户、密码、数据库驱动类
val prop = new java.util.Properties
prop.setProperty("user","hive")
prop.setProperty("password","mypassword")
prop.setProperty("driver","com.mysql.jdbc.Driver")
// 取得该表数据
val jdbcDF = spark.read.jdbc(url,tableName,prop)
jdbcDF.show
//DF存为新的表
jdbcDF.write.mode("append").jdbc(url,"t1",prop)
示例
Linux版本
- 把驱动文件mysql-connector-java-5.1.38.jar放入Spark里的jars文件下
[root@hadoop100 ~]# cd /opt/spark245/jars
- 启动spark-shell
[root@hadoop100 conf]# spark-shell
- 连接至MySQL
scala> val url="jdbc:mysql://192.168.136.100:3306/hive"
scala> val tbname="TBLS"
scala> val prop=new java.util.Properties
scala> prop.setProperty("user","root")
scala> prop.setProperty("password","ok")
scala> prop.setProperty("driver","com.mysql.jdbc.Driver")
scala> val jdbcDF=spark.read.jdbc(url,tbname,prop)
scala> jdbcDF.printSchema
Java版本
package nj.zb.kb09.sql
import java.util.Properties
import org.apache.spark.sql.{DataFrame, SparkSession}
object SparksqlOnMysqlDemo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("sparksqlOnMysql").getOrCreate()
val url="jdbc:mysql://192.168.136.100:3306/hive"
val user="root"
val pwd="ok"
val driver="com.mysql.jdbc.Driver"
val prop=new Properties()
prop.setProperty("user",user)
prop.setProperty("password",pwd)
prop.setProperty("driver",driver)
val df: DataFrame = spark.read.jdbc(url,"TBLS",prop)
df.printSchema()
df.show()
df.where(df("CREATE_TIME").startsWith("159")).show()
df.groupBy("DB_ID").count().show()
}
}
结果展示:
Spark SQL函数(一)
内置函数(org.apache.spark.sql.functions.scala)
类别 | 函数举例 |
---|---|
聚合函数 | countDistinct、sumDistinct |
集合函数 | sort_array、explore |
日期、时间函数 | hour、quarter、next_day |
数学函数 | asin、atan、sqrt、tan、round |
开窗函数 | row_number |
字符串函数 | concat、format_number、regexp_extract |
其他函数 | isNaN、sha、randn、callUDF |
Spark SQL函数(二)
内置函数的使用
//模拟用户访问日志信息
val accessLog = Array(
"2016-12-27,001",
"2016-12-27,001",
"2016-12-27,002",
"2016-12-28,003",
"2016-12-28,004",
"2016-12-28,002",
"2016-12-28,002",
"2016-12-28,001"
)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
//根据集合数据生成RDD
val accessLogRDD = sc.parallelize(accessLog).map(row => {
val splited = row.split(",")
Row(splited(0), splited(1).toInt)
})
//定义DataFrame的结构
val structTypes = StructType(Array(
StructField("day", StringType, true),
StructField("userId", IntegerType, true)
))
//根据数据以及Schema信息生成DataFrame
val accessLogDF = spark.createDataFrame(accessLogRDD, structTypes)
//导入Spark SQL内置的函数
import org.apache.spark.sql.functions._
//求每天所有的访问量(pv)
accessLogDF.groupBy("day").agg(count("userId").as("pv")).select("day", "pv").collect.foreach(println)
//求每天的去重访问量(uv)
accessLogDF.groupBy("day").agg(countDistinct('userId).as("uv")).select("day", "uv").collect.foreach(println)
示例
- 示例一
package nj.zb.kb09.sql
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
object InnerFunctionDemo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("InnerFunctionDemo").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val accessLog = Array(
"2016-12-27,001",
"2016-12-27,001",
"2016-12-27,002",
"2016-12-28,003",
"2016-12-28,004",
"2016-12-28,002",
"2016-12-28,002",
"2016-12-28,001"
)
val schema = StructType(Array(
StructField("day", StringType,true),
StructField("userID", IntegerType,true)
))
val rdd: RDD[Row] = sc.parallelize(accessLog).map(_.split(",")).map(x=>Row(x(0),x(1).toInt))
val df: DataFrame = spark.createDataFrame(rdd,schema)
df.printSchema()
df.show()
import org.apache.spark.sql.functions._
df.groupBy("day").agg(count("userID").as("pv")).collect().foreach(println)
println("-------------------")
df.groupBy("day").agg(countDistinct("userID").as("pv")).collect().foreach(println)
}
}
结果展示:
- 示例二
package nj.zb.kb09.sql
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
case class Student(id:Integer,name:String,gender:String,age:Integer)
object InnerFunctionDemo2 {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("InnerFunctionDemo").getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
val sc: SparkContext = spark.sparkContext
val stuDF: DataFrame = Seq(
Student(1001, "zhangsan", "F", 20),
Student(1002, "lisi", "M", 16),
Student(1003, "wangwu", "M", 21),
Student(1004, "zhaoliu", "F", 21),
Student(1005, "zhouqi", "M", 22),
Student(1006, "qianba", "M", 19),
Student(1007, "liuliu", "F", 23)
).toDF()
stuDF.printSchema()
stuDF.show()
stuDF.groupBy("gender").agg(count("age")).show()
stuDF.groupBy("gender").agg(max("age")).show()
stuDF.groupBy("gender").agg(min("age")).show()
stuDF.groupBy("gender").agg("age"->"max","age"->"min","age"->"avg","id"->"count").show()
stuDF.groupBy("gender","age").count().show()
}
}
结果展示:
Spark SQL函数(三)
自定义函数
-
1、定义函数
-
2、注册函数
SparkSession.udf.register():只在sql()中有效 functions.udf():对DataFrame API均有效
-
3、函数调用
Spark SQL函数(四)
自定义函数实例
- 需求:用户行为喜好个数统计
- 输入数据格式
hobbies.txt:
alice jogging,Coding,cooking
lina travel,dance
- 输出数据格式:
alice jogging,Coding,cooking 3
lina travel,dance 2
Spark SQL函数(五)
自定义函数实例
case class Hobbies(name: String, hobbies: String)
val info = sc.textFile("/data/hobbies.txt")
//需要手动导入一个隐式转换,否则RDD无法转换成DF
import spark.implicits._
val hobbyDF = info.map(_.split("\t")).map(p => Hobbies(p(0), p(1))).toDF
hobbyDF.show
hobbyDF.registerTempTable("hobbies")
//注册自定义函数,注意是匿名函数
spark.udf.register("hobby_num", (s: String) => s.split(',').size)
spark.sql("select name, hobbies, hobby_num(hobbies) as hobby_num from hobbies").show
实例
- hobbies.txt
alice jogging,Coding,cooking
lina travel,dance
- 代码展示
package nj.zb.kb09.sql
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.catalyst.expressions.StringTrimRight
import org.apache.spark.sql.{DataFrame, SparkSession}
case class Hobbies(name:String,hobbies:String)
object SparkUDFDemo {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("SparkUDFDemo").getOrCreate()
import spark.implicits._
val sc: SparkContext = spark.sparkContext
val rdd: RDD[String] = sc.textFile("in/hobbies.txt")
val df: DataFrame = rdd.map(x=>x.split(" ")).map(x=>Hobbies(x(0),x(1))).toDF()
df.printSchema()
df.show()
df.registerTempTable("hobbies")
spark.udf.register("hobby_num",(v:String)=>v.split(",").size)
val frame: DataFrame = spark.sql(""+"select name,hobbies,hobby_num(hobbies) as hobbynum from hobbies")
frame.show()
}
}
结果展示:
Spark SQL CLI
Spark SQL CLI是在本地模式下使用Hive云存储服务和执行从命令行输入查询语句的简便工具
注意,Spark SQL CLI无法与thrift JDBC服务器通信
Spark SQL CLI等同于Hive CLI(old CLI)、Beeline CLI(new CLI)
启动Spark SQL CLI,请在Spark目录中运行以下内容:
./bin/spark-sql
将hive-site.xml、hdfs-site.xml、core-site.xml复制到$Spark_HOME/conf目录下
Spark SQL示例
$spark-sql
spark-sql> show databases;
default
spark-sql> show tables;
default toronto false
spark-sql> select * from toronto where ssn like '111%';
John S. 111-222-333 123 Yonge Street
spark-sql> create table montreal(full_name string, ssn string, office_address string);
spark-sql> insert into montreal values('Winnie K. ', '111-222-333 ', '62 John Street');
spark-sql> select t.full_name, m.ssn, t.office_address, m.office_address from toronto t inner join montreal m on t.ssn = m.ssn;
John S. 111-222-333 123 Yonge Street 62 John Street
Spark性能优化(一)
序列化
- Java序列化,Spark默认方式
- Kryo序列化,比Java序列化快约10倍,但不支持所有可序列化类型
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
//向Kryo注册自定义类型
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]));
如果没有注册需要序列化的class,Kryo依然可以照常工作,但会存储每个对象的全类名(full class name),这样往往比默认Java serialization更浪费空间
Spark性能优化(二)
使用对象数组、原始类型代替Java、Scala集合类(如HashMap)
避免嵌套
尽量使用数字作为Key,而非字符串
以较大的RDD使用MEMORY_ONLY_SER
加载CSV、JSON时,仅加载所需字段
仅在需要时持久化中间结果(RDD/DS/DF)
避免不必要的中间结果(RDD/DS/DF)的生成
DF的执行速度比DS快约3倍
Spark性能优化(三)
自定义RDD分区与spark.default.parallelism
- 该参数用于设置每个Stage的默认task数量
将大变量广播出去,而不是直接使用
尝试处理本地数据并最小化跨工作节点的数据传输
Spark性能优化(四)
表连接(join操作)
- 包含所有表的谓词(predicate)
select * from t1 join t2 on t1.name = t2.full_name
where t1.name = 'mike' and t2.full_name = 'mike'
- 最大的表放在第一位
- 广播最小的表
- 最小化表join的数量