文章目录
Spark SQL
Spark SQL 是 Spark 用来处理结构化数据的一个模块,它提供了一个编程抽象叫做 DataFrame 并且作为分布式 SQL 查询引擎的作用。在Hive中,Hive SQL 转换成 MapReduce 然后提交到集群上执行,大大简化了编写 MapReduce 的 程序的复杂性 , 由 于MapReduce 这种计算模型执行效率比较慢。所有 Spark SQL 的应运而生,它是将 Spark SQL 转换成 RDD,然后提交到集群执行,执行效率非常快!
SparkConf与SparkSession
SparkContext
任何Spark程序都是SparkContext开始的,SparkContext的初始化需要一个SparkConf对象,SparkConf包含了Spark集群配置的各种参数。
初始化后,就可以使用SparkContext对象所包含的各种方法来创建和操作RDD和共享变量。
SparkContext创建方法一 使用SparkConf
val conf = new SparkConf().setMaster("local[*]").setAppName("sparkTest")
val sc = SparkContext.getOrCreate(conf)
SparkContext创建方法二 使用SparkSession
val spark = SparkSession.builder().master("local[*]").appName("testSparkSQL").getOrCreate()
val sc = spark.sparkContext
RDD、DataSet(DS)、DataFrame(DF)
DataFrame和DataSet是SparkSQL中Spark为我们提供了两个新的抽象,他们和RDD有什么区别呢?
DataSet与RDD相比,RDD并不了解自生元素的列信息,而DataSet则可以知道数据的列和列信息。
DataFrame与DataSet相比:DataSet中的数据类型是根据实际情况确定的,而DataFrame中数据的类型固定为ROW
RDD转换为DS
import org.apache.spark.sql.SparkSession
object CreateDataSetDemo {
case class Point(label:String,x:Double,y:Double)
case class Category(id:Long,name:String)
def main(args: Array[String]): Unit = {
//1、第一步创建一个SparkSession对象
val spark = SparkSession.builder().master("local[*]").appName("testSparkSQL").getOrCreate()
val sc = spark.sparkContext
//导包
import spark.implicits._
//2、创建一个RDD
val pointRDD = sc.makeRDD(List(("bar",3.0,4.0),("foo",2.0,2.5)))
val categoryRDD = sc.makeRDD(List((1,"foo"),(2,"bar")))
//2、使用RDD 创建DataSet
val ds1 = pointRDD.toDS()
ds1.show()
println("------------------")
val pointDS = pointRDD.map(x=>Point(x._1,x._2,x._3)).toDS()
val categoryDS = categoryRDD.map(x=>Category(x._1,x._2)).toDS()
pointDS.show()
categoryDS.show()
val df = pointDS.join(categoryDS,pointDS("label")===categoryDS("name"))
df.show()
spark.stop()
}
}
(DF同理)
演示:装载数据,转换成DS
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
object test01 extends App {
case class Order(id:String,date:String,customerId:String,status:String)
case class Product(id:String,categoryId:String,name:String,description:String,price:String,image:String)
case class Customer(id:String,fname:String,lname:String,email:String,password:String,street:String,city:String,state:String,zipcode:String)
case class Order_Item(id:String,order_id:String,product_id:String,quantity:String,subtotal:String,product_price:String)
//1、第一步创建一个SparkSession对象
val spark = SparkSession.builder().master("local[*]").appName("testSparkSQL").getOrCreate()
val sc = spark.sparkContext
//导包
import spark.implicits._
private val orderRDD: RDD[String] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\orders.csv")
private val productRDD: RDD[String] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\products.csv")
private val customerRDD: RDD[String] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\customers.csv")
private val order_ItemRDD: RDD[String] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\order_items.csv")
private val orderDS: Dataset[Order] = orderRDD.map(x => {
val fields = x.split(",").map(y => y.replace("\"", ""))
Order(fields(0), fields(1), fields(2), fields(3))
}).toDS()
private val productDS: Dataset[Product] = productRDD.map(x => {
val fields = x.split(",").map(y => y.replace("\"", ""))
Product(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toDS()
private val customerDS: Dataset[Customer] = customerRDD.map(x => {
val fields = x.split(",").map(y => y.replace("\"", ""))
Customer(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5), fields(6), fields(7), fields(8))
}).toDS()
private val order_ItemDS: Dataset[Order_Item] = order_ItemRDD.map(x => {
val fields = x.split(",").map(y => y.replace("\"", ""))
Order_Item(fields(0), fields(1), fields(2), fields(3), fields(4), fields(5))
}).toDS()
order_ItemDS.show()
customerDS.show()
productDS.show()
orderDS.show()
spark.stop()
读取json文件并转换成DF
import org.apache.spark.sql.{DataFrame, SparkSession}
object CreateDataFrameDemo extends App {
//1、第一步创建一个SparkSession对象
val spark = SparkSession.builder().master("local[*]").appName("testSparkSQL").getOrCreate()
val sc = spark.sparkContext
//导包
import spark.implicits._
//通过spark.read读取json文件,生成DataFrame
private val jsontoDF: DataFrame = spark.read.json("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\people.json")
jsontoDF.show()
//打印字段信息
/**
* root
* |-- age: long (nullable = true)
* |-- name: string (nullable = true)
* 字段类型 是否可以为空
*/
jsontoDF.printSchema()
//select 查询
jsontoDF.select("name").show()
jsontoDF.select("age").show()
//查询多个字段
jsontoDF.select(jsontoDF("name"),jsontoDF("age")).show()
//age+1
jsontoDF.select(jsontoDF("name"),jsontoDF("age")+1).show()
//另一种形式
println("------------")
jsontoDF.select($"name",$"age"+1).show()
//过滤 类似sql中的where
jsontoDF.filter($"age">20).show()
//分组groupBy
jsontoDF.groupBy("age").count().show()
//吧DataFrame注册成一张表
jsontoDF.createOrReplaceTempView("people")
private val df2: DataFrame = spark.sql("select * from people where age>20")
df2.show()
jsontoDF.createOrReplaceGlobalTempView("p1")
spark.newSession().sql("select * from global_temp,p1 where age>20").show()
spark.stop()
}
通过样例类,将RDD转换成DF
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object DataFrameDemo02 extends App {
case class People(name:String,age:Int)
//1、第一步创建一个SparkSession对象
val spark = SparkSession.builder().master("local[*]").appName("testSparkSQL").getOrCreate()
val sc = spark.sparkContext
//导包
import spark.implicits._
private val peopleRDD: RDD[Array[String]] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\people.txt")
.map(_.split(","))
//通过样例类 将RDD转换为DF
private val peopleDF: DataFrame = peopleRDD.map(x=>People(x(0),x(1).trim.toInt)).toDF()
peopleDF.printSchema()
peopleDF.show()
spark.stop()
}
通过schema创建DF
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession, types}
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
/**
* @author WGY
*/
object createDataFrameBySchame extends App {
//1、第一步创建一个SparkSession对象
val spark = SparkSession.builder().master("local[*]").appName("testSparkSQL").getOrCreate()
val sc = spark.sparkContext
//导包
import spark.implicits._
private val textRDD: RDD[Array[String]] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\people.txt").map(_.split(","))
//定义schema信息 name,age
private val schema = StructType(Array(
StructField("name", StringType, true),
StructField("age", IntegerType, true)
))
schema
//那rdd转换成row类型
private val mapRDD: RDD[Row] = textRDD.map(x=>Row(x(0),x(1).trim.toInt))
//把RDD转换成DataFrame
private val df1: DataFrame = spark.createDataFrame(mapRDD,schema)
df1.printSchema()
df1.show()
println("--------------")
//DataFrame->RDD
private val r1: RDD[Row] = df1.rdd
println(r1.collect().mkString(" "))
}
spark.stop()
DF转换为RDD
/** people.json内容如下
* {"name":"Michael"}
* {"name":"Andy", "age":30}
* {"name":"Justin", "age":19}
*/
val df = spark.read.json("file:///home/hadoop/data/people.json")
//将DF转为RDD
df.rdd.collect
Spark SQL操作外部数据源
通过DF读取parquet文件
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import org.apache.spark.sql.types._
object TestParquet extends App {
private val spark: SparkSession = SparkSession.builder().master("local[4]").appName("Demo2").getOrCreate()
import spark.implicits._
private val sc: SparkContext = spark.sparkContext
private val rdd1: RDD[(String, String, Array[Int])] = sc.parallelize(List(
("zhangsan", "green", Array(3, 5, 6, 9)),
("lisi", "red", Array(3, 5, 6, 10)),
("wangwu", "black", Array(3, 5, 6, 11)),
("zhaoliu", "yellow", Array(3, 5, 6, 12)),
("wuqi", "green", Array(3, 5, 6, 13))
))
private val structType = StructType(Array(
StructField("name", StringType),
StructField("color", StringType),
StructField("numbers", ArrayType(IntegerType))
))
private val rowRDD: RDD[Row] = rdd1.map(x=>Row(x._1,x._2,x._3))
private val df: DataFrame = spark.createDataFrame(rowRDD,structType)
//读写parquet文件
// df.write.parquet("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\user")
private val parquetRDD: DataFrame = spark
.read
.parquet("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\user")
parquetRDD.printSchema()
parquetRDD.show()
spark.stop()
}
spark连接数据库(SparkToMySQL)
import java.util.Properties
import org.apache.spark.sql.{DataFrame, SparkSession}
object SparkToMySQL extends App{
private val spark: SparkSession = SparkSession.builder().master("local[2]").appName("mysql").getOrCreate()
val url = "jdbc:mysql://hadoop1:3306/test"
val tableName="student"//mysql中的表名
private val prop = new Properties()
prop.setProperty("user","kb07")//用户名
prop.setProperty("password","ok")//密码
prop.getProperty("driver","com.mysql.jdbc.Driver")
//连接
private val mysqlDF: DataFrame = spark.read.jdbc(url,tableName,prop)
mysqlDF.printSchema()
mysqlDF.show()
spark.stop()
}
spark连接hive (HiveWoSpark)
import org.apache.spark.SparkContext
import org.apache.spark.sql.{DataFrame, SparkSession}
object HiveOnSpark extends App {
private val spark: SparkSession = SparkSession.builder()
.master("local[4]")
.appName("Demo2")
.enableHiveSupport()
.config("hive.metastore.uris","thrift://192.168.226.101:9083")
.getOrCreate()
import spark.implicits._
private val sc: SparkContext = spark.sparkContext
private val df1: DataFrame = spark.sql("select * from toronto")
df1.printSchema()
df1.show()
spark.stop()
}
Spark SQL内置函数
package SQL0813
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SparkSession}
import scala.swing.event.AdjustingEvent
object testInterFunction extends App {
case class Student(id:Int, name:String,sex:String,age:Int)
private val spark: SparkSession = SparkSession.builder().master("local[4]").appName("Demo2").getOrCreate()
import spark.implicits._
private val sc: SparkContext = spark.sparkContext
//倒入内置函数需要用的包
import org.apache.spark.sql.functions._
private val accessLog = Array(
"2020-08-13,1",
"2020-08-13,1",
"2020-08-13,2",
"2020-08-13,2",
"2020-08-13,3",
"2020-08-13,3",
"2020-08-14,1",
"2020-08-14,1",
"2020-08-14,2",
"2020-08-14,3",
"2020-08-15,1",
"2020-08-15,1",
"2020-08-15,2",
"2020-08-15,3"
)
private val accessLogRDD: RDD[Row] = sc.parallelize(accessLog).map(x => {
val strings = x.split(",")
Row(strings(0), strings(1).toInt)
})
private val structType = StructType(Array(
StructField("day", StringType),
StructField("userID", IntegerType)
))
private val logDF: DataFrame = spark.createDataFrame(accessLogRDD,structType)
logDF.printSchema()
logDF.show()
//求每天多少访问量
//agg聚合
logDF.groupBy("day").agg(count("userID").as("pv")).show()
//去重的结果
logDF.groupBy("day").agg(countDistinct("userID").as("uv")).show()
//sql方法
logDF.createOrReplaceTempView("logs")
spark.sql(
"""
|select day,count(distinct userID) as pv
|from logs
|group by day
""".stripMargin
).show()
private val stuDF: DataFrame = Seq(
Student(1001, "zhangsan", "F", 20),
Student(1002, "zhangsan1", "M", 18),
Student(1003, "zhangsan2", "F", 20),
Student(1004, "zhangsan3", "F", 20),
Student(1005, "zhangsan4", "M", 20),
Student(1006, "zhangsan5", "F", 20),
Student(1007, "zhangsan6", "M", 20),
Student(1008, "zhangsan7", "F", 25),
Student(1009, "zhangsan8", "M", 20),
Student(1010, "zhangsan9", "F", 20)
).toDF()
stuDF.printSchema()
stuDF.show()
//按性别分组求平均年龄
stuDF.groupBy("sex").agg(avg("age")).show()
stuDF.groupBy("sex").avg("age").show()
//按性别分组求年龄(平均 最大 最小)
stuDF.groupBy("sex").agg("age"->"avg","age"->"max","age"->"min").show()
//按性别和年龄进行分组
stuDF.groupBy("sex","age").count().show()
//按年龄排序
stuDF.sort("age").show()
stuDF.sort($"age".desc).show()
//按年龄逆序排列
stuDF.sort($"age",$"id".desc).show()
//
stuDF.orderBy($"age".desc).show()
spark.stop()
}
Spark SQL自定义函数
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{DataFrame, SparkSession}
object TestUGFDemo extends App {
private val spark: SparkSession = SparkSession.builder().master("local[4]").appName("Demo2").getOrCreate()
import spark.implicits._
private val sc: SparkContext = spark.sparkContext
//倒入内置函数需要用的包
import org.apache.spark.sql.functions._
private val info: RDD[String] = sc.textFile("file:///D:\\IntelliJ IDEA 2018.2.2\\Project\\testscala\\spark\\src\\main\\scala\\data\\sparkSQL\\hobbies.txt")
private val hobbyDF: DataFrame = info.map(_.split("\t")).map(x=>Hobbies(x(0),x(1))).toDF()
hobbyDF.createOrReplaceTempView("hobby")
//注册UDF函数 并使用
spark.udf.register("hobby_num",(s:String)=>s.split(",").size)
spark.sql(
"""
|select name,hobbies,hobby_num(hobbies) as hobbyNum
|from hobby
""".stripMargin
).show()
spark.stop()
}
case class Hobbies(name:String,hobbies:(String))