第70课:SparkSQL内置函数解密与实战学习笔记
本期内容:
1 SparkSQL内置函数解析
2 SparkSQL内置函数实战
SparkSQL的DataFrame引入了大量的内置函数,这些内置函数一般都有CG(CodeGeneration)功能,这样的函数在编译和执行时都会经过高度优化。
问题:SparkSQL操作Hive和Hive on spark一样吗?
=> 不一样。SparkSQL操作Hive只是把Hive当作数据仓库的来源,而计算引擎就是SparkSQL本身。Hive on spark是Hive的子项目,Hive on Spark的核心是把Hive的执行引擎换成Spark。众所周知,目前Hive的计算引擎是Mapreduce,因为性能低下等问题,所以Hive的官方就想替换这个引擎。
SparkSQL操作Hive上的数据叫Spark on Hive,而Hive on Spark依旧是以Hive为核心,只是把计算引擎由MapReduce替换为Spark。
Spark官网上DataFrame 的API Docs:
http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrame
classDataFrame extends Queryable with Serializable
Experimental
A distributed collection of data organized into named columns.
A DataFrame is equivalent to a relational table in Spark SQL. The following example creates a DataFrame by pointing Spark SQL to a Parquet data set.
val people = sqlContext.read.parquet("...") // in Scala
DataFrame people = sqlContext.read().parquet("...") // in Java
Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.
To select a column from the data frame, use apply method in Scala and col in Java.
val ageCol = people("age") // in Scala
Column ageCol = people.col("age") // in Java
Note that the Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10.
people("age") + 10 // in Scala
people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create DataFrame using SQLContextval people = sqlContext.read.parquet("...")val department = sqlContext.read.parquet("...")
people.filter("age > 30")
.join(department, people("deptId") === department("id"))
.groupBy(department("name"), "gender")
.agg(avg(people("salary")), max(people("age")))
and in Java:
// To create DataFrame using SQLContext
DataFrame people = sqlContext.read().parquet("...");
DataFrame department = sqlContext.read().parquet("...");
people.filter("age".gt(30))
.join(department, people.col("deptId").equalTo(department("id")))
.groupBy(department.col("name"), "gender")
.agg(avg(people.col("salary")), max(people.col("age")));
以上内容中的join,groupBy,agg都是SparkSQL的内置函数。
SParkl1.5.x以后推出了很多内置函数,据不完全统计,有一百多个内置函数。
下面实战开发一个聚合操作的例子:
package com.dt.spark.sql
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.functions._
/**
* 使用Scala开发集群运行的Spark WordCount程序
* @author DT大数据梦工厂
* 新浪微博:http://weibo.com/ilovepains/
* Created by hp on 2016/3/28.
*
* 使用Spark SQL中的内置函数对数据进行分析,Spark SQL API不同的是,DataFrame中的内置函数操作的结果是返回一个Column对象,而
* DataFrame天生就是"A distributed collection of data organized into named columns.",这就为数据的复杂分析建立了坚实的基础
* 并提供了极大的方便性,例如说,我们在操作DataFrame的方法中可以随时调用内置函数进行业务需要的处理,这之于我们构建附件的业务逻辑而言是可以
* 极大的减少不必须的时间消耗(基于上就是实际模型的映射),让我们聚焦在数据分析上,这对于提高工程师的生产力而言是非常有价值的
* Spark 1.5.x开始提供了大量的内置函数,例如agg:
* def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame = {
* groupBy().agg(aggExpr, aggExprs : _*)
*}
* 还有max、mean、min、sum、avg、explode、size、sort_array、day、to_date、abs、acros、asin、atan
* 总体上而言内置函数包含了五大基本类型:
* 1,聚合函数,例如countDistinct、sumDistinct等;
* 2,集合函数,例如sort_array、explode等
* 3,日期、时间函数,例如hour、quarter、next_day
* 4,数学函数,例如asin、atan、sqrt、tan、round等;
* 5,开窗函数,例如rowNumber等
* 6,字符串函数,concat、format_number、rexexp_extract
* 7,其它函数,isNaN、sha、randn、callUDF
*
*/
object SparkSQLAgg {
def main (args: Array[String]) {
/**
* 第1步:创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,
* 例如说通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置
* 为local,则代表Spark程序在本地运行,特别适合于机器配置条件非常差(例如
* 只有1G的内存)的初学者 *
*/
val conf = new SparkConf() //创建SparkConf对象
conf.setAppName("SparkSQLInnerFunctions") //设置应用程序的名称,在程序运行的监控界面可以看到名称
// conf.setMaster("spark://Master:7077") //此时,程序在Spark集群
conf.setMaster("local")
/**
* 第2步:创建SparkContext对象
* SparkContext是Spark程序所有功能的唯一入口,无论是采用Scala、Java、Python、R等都必须有一个SparkContext
* SparkContext核心作用:初始化Spark应用程序运行所需要的核心组件,包括DAGScheduler、TaskScheduler、SchedulerBackend
* 同时还会负责Spark程序往Master注册程序等
* SparkContext是整个Spark应用程序中最为至关重要的一个对象
*/
val sc = new SparkContext(conf) //创建SparkContext对象,通过传入SparkConf实例来定制Spark运行的具体参数和配置信息
val sqlContext = new SQLContext(sc) //构建SQL上下文
//要使用Spark SQL的内置函数,就一定要导入SQLContext下的隐式转换
import sqlContext.implicits._
/**
* 第三步:模拟电商访问的数据,实际情况会比模拟数据复杂很多,最后生成RDD
*/
val userData = Array(
"2016-3-27,001,http://spark.apache.org/,1000",
"2016-3-27,001,http://hadoop.apache.org/,1001",
"2016-3-27,002,http://fink.apache.org/,1002",
"2016-3-28,003,http://kafka.apache.org/,1020",
"2016-3-28,004,http://spark.apache.org/,1010",
"2016-3-28,002,http://hive.apache.org/,1200",
"2016-3-28,001,http://parquet.apache.org/,1500",
"2016-3-28,001,http://spark.apache.org/,1800"
)
val userDataRDD = sc.parallelize(userData) //生成DD分布式集合对象
/**
* 第四步:根据业务需要对数据进行预处理生成DataFrame,要想把RDD转换成DataFrame,需要先把RDD中的元素类型变成Row类型
* 于此同时要提供DataFrame中的Columns的元数据信息描述
*/
val userDataRDDRow = userDataRDD.map(row => {val splited = row.split(",") ;Row(splited(0),splited(1).toInt,splited(2),splited(3).toInt)})
val structTypes = StructType(Array(
StructField("time", StringType, true),
StructField("id", IntegerType, true),
StructField("url", StringType, true),
StructField("amount", IntegerType, true)
))
val userDataDF = sqlContext.createDataFrame(userDataRDDRow,structTypes)
/**
* 第五步:使用Spark SQL提供的内置函数对DataFrame进行操作,特别注意:内置函数生成的Column对象且自定进行CG;
*
*
*/
userDataDF.groupBy("time").agg('time, countDistinct('id))
.map(row=>Row(row(1),row(2))).collect.foreach(println)
userDataDF.groupBy("time").agg('time, sum('amount)).show()
}
}
在Eclipse中运行如下:
16/04/10 23:54:04 INFO TaskSetManager: Finished task 58.0 in stage 6.0 (TID 461) in 18 ms on localhost (59/199)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/10 23:54:04 INFO Executor: Finished task 59.0 in stage 6.0 (TID 462). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 60.0 in stage 6.0 (TID 463, localhost, partition 61,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO Executor: Running task 60.0 in stage 6.0 (TID 463)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 59.0 in stage 6.0 (TID 462) in 15 ms on localhost (60/199)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
16/04/10 23:54:04 INFO Executor: Finished task 60.0 in stage 6.0 (TID 463). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 61.0 in stage 6.0 (TID 464, localhost, partition 62,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 60.0 in stage 6.0 (TID 463) in 17 ms on localhost (61/199)
16/04/10 23:54:04 INFO Executor: Running task 61.0 in stage 6.0 (TID 464)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/10 23:54:04 INFO Executor: Finished task 61.0 in stage 6.0 (TID 464). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 62.0 in stage 6.0 (TID 465, localhost, partition 63,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO Executor: Running task 62.0 in stage 6.0 (TID 465)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 61.0 in stage 6.0 (TID 464) in 99 ms on localhost (62/199)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/10 23:54:04 INFO Executor: Finished task 62.0 in stage 6.0 (TID 465). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 63.0 in stage 6.0 (TID 466, localhost, partition 64,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 62.0 in stage 6.0 (TID 465) in 18 ms on localhost (63/199)
16/04/10 23:54:04 INFO Executor: Running task 63.0 in stage 6.0 (TID 466)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/10 23:54:04 INFO Executor: Finished task 63.0 in stage 6.0 (TID 466). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 64.0 in stage 6.0 (TID 467, localhost, partition 65,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 63.0 in stage 6.0 (TID 466) in 16 ms on localhost (64/199)
16/04/10 23:54:04 INFO Executor: Running task 64.0 in stage 6.0 (TID 467)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/10 23:54:04 INFO Executor: Finished task 64.0 in stage 6.0 (TID 467). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 65.0 in stage 6.0 (TID 468, localhost, partition 66,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO Executor: Running task 65.0 in stage 6.0 (TID 468)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 64.0 in stage 6.0 (TID 467) in 18 ms on localhost (65/199)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/04/10 23:54:04 INFO Executor: Finished task 65.0 in stage 6.0 (TID 468). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 66.0 in stage 6.0 (TID 469, localhost, partition 67,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 65.0 in stage 6.0 (TID 468) in 47 ms on localhost (66/199)
16/04/10 23:54:04 INFO Executor: Running task 66.0 in stage 6.0 (TID 469)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/10 23:54:04 INFO Executor: Finished task 66.0 in stage 6.0 (TID 469). 1652 bytes result sent to driver
16/04/10 23:54:04 INFO TaskSetManager: Starting task 67.0 in stage 6.0 (TID 470, localhost, partition 68,NODE_LOCAL, 1999 bytes)
16/04/10 23:54:04 INFO Executor: Running task 67.0 in stage 6.0 (TID 470)
16/04/10 23:54:04 INFO TaskSetManager: Finished task 66.0 in stage 6.0 (TID 469) in 17 ms on localhost (67/199)
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/04/10 23:54:04 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/04/10 23:54:04 INFO Executor: Finished task 67.0 in stage 6.