Spark SQL操作外部数据源
1.创建 Dataset
Dataset=RDD+Schema
createDataset()的参数可以是:Seq、Array、RDD
scala> spark.createDataset(1 to 3).show
scala> spark.createDataset(List(("a",1),("b",2),("c",3))).show
scala> spark.createDataset(sc.parallelize(List(("a",1,1),("b",2,2)))).show
运行结果:
2.通过 Case Class创建Dataset/DataFrame
src/data/people文件内容
张三,13
李四,15
object DataFrame {
//创建样例类
case class Person(name:String,age:Int)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[2]").appName("sparkSession").getOrCreate()
val sc=spark.sparkContext
import spark.implicits._
//创建DataFrame
val people=sc.textFile("src/data/people")
.map(_.split(","))
.map(p => Person(p(0), p(1).trim.toInt)).toDF()/toDS()
//创建临时表
people.createOrReplaceTempView("people")
val teenagers = spark.sql("SELECT name, age FROM people WHERE age >= 14 AND age <= 19")
teenagers.show()
}
}
运行结果:
3.Spark SQL读写Parquet
Parquet文件:是一种流行的列式存储格式,以二进制存储,文件中包含数据与元数据
//导入所需要的类
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{ArrayType, IntegerType, StringType, StructField, StructType}
object Parquet {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[2]").appName("sparkSession").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
// 根据自定义的字符串schema信息产生DataFrame的Schema
val schema = StructType(Array(StructField("name", StringType),
StructField("favorite_color", StringType),
StructField("favorite_numbers", ArrayType(IntegerType))))
//通过集合生成RDD
val rdd = sc.parallelize(List(("Alyssa", "blue", Array(3, 6)), ("Ben", "red", Array(2, 8))))
//将RDD转换成Row
val rowRDD = rdd.map(p => Row(p._1, p._2, p._3))
// 将Schema作用到RDD上
val df = spark.createDataFrame(rowRDD, schema)
//如果直接写入文件,默认为parquet格式
df.write.save("src/data/parquet")
//手动指定文件
df.write.mode("overwrite")parquet("src/data/parquet")
//Spark SQL读该目录下存在parquet文件
val df1=spark.read.parquet("src/data/parquet")
df1.show
}
}
4.将 JSON 文件转成 DataFrame
//json文件
{"id":1,"name":"Michael","age":20}
{"id":2,"name":"Andy", "age":30}
{"id":3,"name":"Justin", "age":19}
object Json{
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[2]").appName("sparkSession").getOrCreate()
val sc=spark.sparkContext
//将JSON文件转成DataFrame
val df = spark.read.json("src/data/json.txt") //文件路径
df.show()
}
}
运行结果:
5.装载CSV数据源
/*文件预览
id|name|age
1| darren |18
2|anne|18
3|"test"|18
4|'test2'|18
*/
object Csv {
case class Person(name:String,age:Int)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[2]").appName("sparkSession").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
val df = spark.read.format("csv")
.option("delimiter", "|") //字段分隔符,默认为“,”
.option("header", "true") //第一行作为 Schema,而非内容
.option("quote", "'") //引号字符,默认为双引号“"”
.option("nullValue", "\\N") //指定一个字符串代表 null 值
.option("inferSchema", "true") //自动推测字段类型
.load("src/data/CSV")
df.show()
}
}
运行结果:
6. Spark SQL与Hive集成
方式一:spark-shell访问Hive表
1、hive-site.xml拷贝至${SPARK_HOME}/conf下
2、mysql驱动至${SPARK_HOME}/jars
查询结果:
方式二:IDEA访问hive
1、hive-site.xml添加到资源文件中
2、pom.xml添加依赖
添加spark_hive的jar包依赖和mysql驱动,版本必须一致
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object Hive {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]").enableHiveSupport()
.appName("sparkSession").getOrCreate()
val df=spark.sql("select * from sale")
df.groupBy("shop").agg(sum("money")).show
}
}
运行结果:
7. Spark SQL与Mysql集成
object Mysql {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[1]")
.appName("sparkSession").getOrCreate()
val url = "jdbc:mysql://192.168.37.200:3306/kb10mysqltestdb"
val tableName = "course"
// 设置连接用户、密码、数据库驱动类
val prop = new java.util.Properties
prop.setProperty("user","root")
prop.setProperty("password","javakb10")
prop.setProperty("driver","com.mysql.jdbc.Driver")
val jdbcDF = spark.read.jdbc(url,tableName,prop)
jdbcDF.show
//DF存为新的表
jdbcDF.write.mode("append").jdbc(url,"t1",prop)
}
}
运行结果: