1. Spark SQL
1.1. Spark SQL概述
1.1.1. 什么是Spark SQL
Spark SQL是Spark用来处理结构化数据的一个模块,它提供了一个编程抽象叫做DataFrame并且作为分布式SQL查询引擎的作用。
1.1.2. 为什么要学习Spark SQL
我们已经学习了Hive,它是将Hive SQL转换成MapReduce然后提交到集群上执行,大大简化了编写MapReduce的程序的复杂性,由于MapReduce这种计算模型执行效率比较慢。所有Spark SQL的应运而生,它是将SparkSQL转换成RDD,然后提交到集群执行,执行效率非常快!
1.易整合
2.统一的数据访问方式
3.兼容Hive
4.标准的数据连接
1.2. DataFrames
1.2.1. 什么是DataFrames(1.3)
与RDD类似,DataFrame也是一个分布式数据容器。然而DataFrame更像传统数据库的二维表格,除了数据以外,还记录数据的结构信息,即schema。同时,与Hive类似,DataFrame也支持嵌套数据类型(struct、array和map)。从API易用性的角度上看,DataFrame API提供的是一套高层的关系操作,比函数式的RDD API要更加友好,门槛更低。由于与R和Pandas的DataFrame类似,SparkDataFrame很好地继承了传统单机数据分析的开发体验。
1.2.2. 创建DataFrames
1.在本地创建一个文件,有三列,分别是id、name、age,用空格分隔,然后上传到hdfs上
hdfs dfs -put person.txt /
2.在spark shell执行下面命令,读取数据,将每一行的数据使用列分隔符分割
val lineRDD =sc.textFile("hdfs://node1.xiaoniu.com:9000/person.txt").map(_.split(""))
3.定义case class(相当于表的schema)
case class Person(id:Int, name:String, age:Int)
4.将RDD和case class关联
val personRDD = lineRDD.map(x =>Person(x(0).toInt, x(1), x(2).toInt))
5.将RDD转换成DataFrame
val personDF = personRDD.toDF
6.对DataFrame进行处理
personDF.show
1.3. DataFrame常用操作
1.3.1. SQLDemo1
在旧的版本中,使用情况
object SQLDemo1 { def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("SQLDemo1").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc) //必须导入隐式 import sqlContext.implicits._ //准备操作 //以后从哪里加载数据(先创建RDD,然后管理schema,将RDD转换成DataFrame) val lines = sc.parallelize(List("laoduan 99 30", "laozhao 9999 28", "laoyang 99 28", "laoxue 98 26")) //RDD -》 DataFrame val boyRDD: RDD[Boy] = lines.map(line => { val fields = line.split(" ") val n = fields(0) val f = fields(1).toDouble val a = fields(2).toInt Boy(n, f, a) }) val boyDF: DataFrame = boyRDD.toDF //将DataFrame注册成一张表,然后调用SQL的语法 boyDF.registerTempTable("t_boy") //书写SQL val df1: DataFrame = sqlContext.sql("SELECT * FROM t_boy ORDER BY fv DESC, age ASC") //查看结果(触发Action) df1.show() sc.stop() } } //创建case class类并序列化 case class Boy(name: String, fv: Double, age: Int) extends Serializable |
1.3.2. SQLDemo2
object SQLDemo2 { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("SQLDemo2").setMaster("local[*]") val sc = new SparkContext(conf) //new一个SQLContext val sqlContext = new SQLContext(sc)
//准备操作 //以后从哪里加载数据(先创建RDD,然后管理schema,将RDD转换成DataFrame) val lines = sc.parallelize(List("laoduan 99 30", "laozhao 9999 28", "laoyang 99 28", "laoxue 98 26")) //RDD -》 DataFrame val rowRdd: RDD[Row]= lines.map(line => { val fields = line.split(" ") val n = fields(0) val f = fields(1).toDouble val a = fields(2).toInt Row(n, f, a) }) val schema = StructType( List( StructField("name", StringType), StructField("fv", DoubleType), StructField("age", IntegerType) ) ) //通过createDataFrame将RDD管理schema val df:DataFrame = sqlContext.createDataFrame(rowRdd, schema) //将DataFrame注册成一张表,然后调用SQL的语法 df.registerTempTable("t_boy") //书写SQL val df1: DataFrame = sqlContext.sql("SELECT * FROM t_boy ORDER BY fv DESC, age ASC") //查看结果(触发Action) df1.show() sc.stop() } } |
1.3.3. SQLDemo3
sparkSQL新特性,使用DSL的风格写SQL
object SQLDemo3 { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName("SQLDemo3").setMaster("local[*]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) //准备操作 //以后从哪里加载数据(先创建RDD,然后管理schema,将RDD转换成DataFrame) val lines = sc.parallelize(List("laoduan 99 30", "laozhao 9999 28", "laoyang 99 28", "laoxue 98 26")) //RDD -》 DataFrame val rowRdd: RDD[Row]= lines.map(line => { val fields = line.split(" ") val n = fields(0) val f = fields(1).toDouble val a = fields(2).toInt Row(n, f, a) }) val schema = StructType( List( StructField("name", StringType), StructField("fv", DoubleType), StructField("age", IntegerType) ) ) //将RDD管理schema val df1:DataFrame = sqlContext.createDataFrame(rowRdd, schema)
//使用DSL风格的语法,直接使用DataFrame上的方法 //val df2: DataFrame = df1.select("name", "fv") //val df3: DataFrame = df2.where("fv > 98") import sqlContext.implicits._ //val df4: DataFrame = df3.orderBy($"fv" desc) val r = df1.select("name", "fv").where("fv > 98").orderBy($"fv" desc) r.show() sc.stop() } } |
1.3.4. DataSetWordCount
sparkSQL使用DSL写Wordcount
在Spark2.0里,想要使用Dataset、DataFrame、SQL,程序执行的入口是SparkSession,,
object DataSetWordCount { def main(args: Array[String]): Unit = { //如果在Spark2.0里,想要使用Dataset、DataFrame、SQL,程序执行的入口是SparkSession val session: SparkSession = SparkSession.builder() .appName("DataSetWordCount") .master("local[*]") .getOrCreate() //指定以后从哪里读取数据 val lines: Dataset[String] = session.read.textFile(args(0)) //导入隐式转换 import session.implicits._ val words: Dataset[String] = lines.flatMap(_.split(" ")) import org.apache.spark.sql.functions._ //val r: DataFrame = words.groupBy($"value" as "word").agg(count("*") as "counts").sort($"counts" desc) //第一个count是在组内进行Count, 第二个count是计算dataframe有多少行 //val r = words.groupBy($"value" as "word").count().count()//5 val r = words.groupBy($"value" as "word").count().withColumnRenamed("count", "counts").sort($"counts" desc) r.show() //println(r.toBuffer) session.stop() } } |
1.3.5. SQLWordCount
object SQLWordCount { def main(args: Array[String]): Unit = { val session: SparkSession = SparkSession.builder() .appName("SQLWordCount") .master("local[*]") .getOrCreate() //指定以后从哪里读取数据 val lines: Dataset[String] = session.read.textFile(args(0)) //导入隐式转换 import session.implicits._ val words: Dataset[String] = lines.flatMap(_.split(" ")) //第二中方式,用SQL的方式 //创建一个临时视图 words.createTempView("v_words") val r: DataFrame = session.sql("SELECT value word, COUNT(*) counts FROM v_words GROUP BY word ORDER BY counts DESC") r.show() //println(r.toBuffer) session.stop() } } |
1.3.6. DataSourceDemo1
文件读取有多种格式,写文件也有多种格式
object DataSourceDemo1 { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("DataSourceDemo1") .master("local[*]") .getOrCreate() import spark.implicits._ //指定以后从哪里读取数据(并行化的方式) //val lines: DataFrame = spark.read.text(args(0)) val lines: DataFrame = spark.read.csv(args(0)) val df = lines.withColumnRenamed("_c0", "name").withColumnRenamed("_c1", "age") //val result: Dataset[Row] = lines.select("name", "age").where("age >= 13") //result.write.json(args(1)) //设置模式进行覆盖重写,也可以追加等多种模式 //result.write.mode("overwrite").csv(args(1)) df.write.parquet(args(1)) //df.show()
spark.stop() } } |
1.3.7. ParquetSource
将文件读取为Parquet时,运算效果是最好的
object ParquetSource { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("DataSourceDemo1") .master("local[*]") .getOrCreate() val lines: DataFrame = spark.read.parquet(args(0)) lines.show() spark.stop() } } |
1.3.8. JDBCSource
实现从数据库中读取数据,并写入到数据库中
object JDBCSource { def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("DataSourceDemo1") .master("local[*]") .getOrCreate() //从数据库中读取,先链接数据库 val logs: DataFrame = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://localhost:3306/bigdata", "driver" -> "com.mysql.jdbc.Driver", "dbtable" -> "logs", "user" -> "root", "password" -> "123568") ).load()
val filtered = logs.where("age >= 13")
//创建Properties存储数据库相关属性 val prop = new Properties() prop.put("user", "root") prop.put("password", "123568") filtered.write.mode("append").jdbc("jdbc:mysql://localhost:3306/bigdata", "bigdata.logs", prop) //logs.show()
spark.stop() } } |
1.3.9. IPLocation
找出IP对应的省份,注意注册函数并指定名字,spark.udf.register("ip2Long", ip2Long)
object IPLocation_SQL { val ip2Long = (ip: String) => { val fragments = ip.split("[.]") var ipNum = 0L for (i <- 0 until fragments.length){ ipNum = fragments(i).toLong | ipNum << 8L } ipNum } def main(args: Array[String]): Unit = { val spark = SparkSession.builder() .appName("DataSourceDemo1") .master("local[*]") .getOrCreate() //导入隐式转换 import spark.implicits._ //先读取IP规则数据 val ipLines: Dataset[String] = spark.read.textFile(args(0)) //整理ip规则数据 val ruleDF: DataFrame = ipLines.map(line => { val fields = line.split("[|]") val startNum = fields(2).toLong val endNum = fields(3).toLong val province = fields(6) (startNum, endNum, province) }).toDF("start_num", "end_num", "province")
//读取并整理访问log数据 val logLine: Dataset[String] = spark.read.textFile(args(1)) val ipDF: DataFrame = logLine.map(line => { val fields = line.split("[|]") val ip = fields(1) ip }).toDF("ip")
ruleDF.createTempView("v_rules") ipDF.createTempView("v_logs")
//注册函数并指定名字 spark.udf.register("ip2Long", ip2Long)
val r = spark.sql("SELECT province FROM v_logs JOIN v_rules ON ip2Long(ip) >= start_num AND ip2Long(ip) <= end_num")
r.createTempView("v_temp") r.show(10) } } |