Spark系列文章目录
第一章 初识Spark
第二章 Spark-Core核心模型(一)
第二章 Spark-Core核心模型(二)
第三章 Spark-Core编程进阶(一)
第三章 Spark-Core编程进阶(二)
第四章 Spark-SQL基础(一)
第四章 Spark-SQL基础(二)
第五章 Spark-SQL进阶(一)
第五章 Spark-SQL进阶(二)
第五章 Spark-SQL进阶(三)
第四章 Spark-SQL基础(二)
8.核心对象
8.1SparkSession
Spark-SQL组件中的一个会话对象。
SparkSession对象中构建SparkSession的核心源代码如下:
@Stable
object SparkSession extends Logging {
/**
* Builder for [[SparkSession]].
*/
@Stable
class Builder extends Logging {
private[this] val options = new scala.collection.mutable.HashMap[String, String]
private[this] val extensions = new SparkSessionExtensions
private[this] var userSuppliedContext: Option[SparkContext] = None
private[spark] def sparkContext(sparkContext: SparkContext): Builder = synchronized {
userSuppliedContext = Option(sparkContext)
this
}
/**
* Sets a name for the application, which will be shown in the Spark web UI.
* If no application name is set, a randomly generated name will be used.
*
* @since 2.0.0
*/
def appName(name: String): Builder = config("spark.app.name", name)
/**
* Sets a config option. Options set using this method are automatically propagated to
* both `SparkConf` and SparkSession's own configuration.
*
* @since 2.0.0
*/
def config(key: String, value: String): Builder = synchronized {
options += key -> value
this
}
/**
* Sets a config option. Options set using this method are automatically propagated to
* both `SparkConf` and SparkSession's own configuration.
*
* @since 2.0.0
*/
def config(key: String, value: Long): Builder = synchronized {
options += key -> value.toString
this
}
/**
* Sets a config option. Options set using this method are automatically propagated to
* both `SparkConf` and SparkSession's own configuration.
*
* @since 2.0.0
*/
def config(key: String, value: Double): Builder = synchronized {
options += key -> value.toString
this
}
/**
* Sets a config option. Options set using this method are automatically propagated to
* both `SparkConf` and SparkSession's own configuration.
*
* @since 2.0.0
*/
def config(key: String, value: Boolean): Builder = synchronized {
options += key -> value.toString
this
}
/**
* Sets a list of config options based on the given `SparkConf`.
*
* @since 2.0.0
*/
def config(conf: SparkConf): Builder = synchronized {
conf.getAll.foreach { case (k, v) => options += k -> v }
this
}
/**
* Sets the Spark master URL to connect to, such as "local" to run locally, "local[4]" to
* run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.
*
* @since 2.0.0
*/
def master(master: String): Builder = config("spark.master", master)
/**
* Enables Hive support, including connectivity to a persistent Hive metastore, support for
* Hive serdes, and Hive user-defined functions.
*
* @since 2.0.0
*/
def enableHiveSupport(): Builder = synchronized {
if (hiveClassesArePresent) {
config(CATALOG_IMPLEMENTATION.key, "hive")
} else {
throw new IllegalArgumentException(
"Unable to instantiate SparkSession with Hive support because " +
"Hive classes are not found.")
}
}
/**
* Inject extensions into the [[SparkSession]]. This allows a user to add Analyzer rules,
* Optimizer rules, Planning Strategies or a customized parser.
*
* @since 2.2.0
*/
def withExtensions(f: SparkSessionExtensions => Unit): Builder = synchronized {
f(extensions)
this
}
/**
* Gets an existing [[SparkSession]] or, if there is no existing one, creates a new
* one based on the options set in this builder.
*
* This method first checks whether there is a valid thread-local SparkSession,
* and if yes, return that one. It then checks whether there is a valid global
* default SparkSession, and if yes, return that one. If no valid global default
* SparkSession exists, the method creates a new SparkSession and assigns the
* newly created SparkSession as the global default.
*
* In case an existing SparkSession is returned, the non-static config options specified in
* this builder will be applied to the existing SparkSession.
*
* @since 2.0.0
*/
def getOrCreate(): SparkSession = synchronized {
assertOnDriver()
// Get the session from current thread's active session.
var session = activeThreadSession.get()
if ((session ne null) && !session.sparkContext.isStopped) {
applyModifiableSettings(session)
return session
}
// Global synchronization so we will only set the default session once.
SparkSession.synchronized {
// If the current thread does not have an active session, get it from the global session.
session = defaultSession.get()
if ((session ne null) && !session.sparkContext.isStopped) {
applyModifiableSettings(session)
return session
}
// No active nor global default session. Create a new one.
val sparkContext = userSuppliedContext.getOrElse {
val sparkConf = new SparkConf()
options.foreach { case (k, v) => sparkConf.set(k, v) }
// set a random app name if not given.
if (!sparkConf.contains("spark.app.name")) {
sparkConf.setAppName(java.util.UUID.randomUUID().toString)
}
SparkContext.getOrCreate(sparkConf)
// Do not update `SparkConf` for existing `SparkContext`, as it's shared by all sessions.
}
applyExtensions(
sparkContext.getConf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS).getOrElse(Seq.empty),
extensions)
session = new SparkSession(sparkContext, None, None, extensions)
options.foreach { case (k, v) => session.initialSessionOptions.put(k, v) }
setDefaultSession(session)
setActiveSession(session)
registerContextListener(sparkContext)
}
return session
}
private def applyModifiableSettings(session: SparkSession): Unit = {
val (staticConfs, otherConfs) =
options.partition(kv => SQLConf.staticConfKeys.contains(kv._1))
otherConfs.foreach { case (k, v) => session.sessionState.conf.setConfString(k, v) }
if (staticConfs.nonEmpty) {
logWarning("Using an existing SparkSession; the static sql configurations will not take" +
" effect.")
}
if (otherConfs.nonEmpty) {
logWarning("Using an existing SparkSession; some spark core configurations may not take" +
" effect.")
}
}
}
/**
* Creates a [[SparkSession.Builder]] for constructing a [[SparkSession]].
*
* @since 2.0.0
*/
def builder(): Builder = new Builder
/**
* Changes the SparkSession that will be returned in this thread and its children when
* SparkSession.getOrCreate() is called. This can be used to ensure that a given thread receives
* a SparkSession with an isolated session, instead of the global (first created) context.
*
* @since 2.0.0
*/
def setActiveSession(session: SparkSession): Unit = {
activeThreadSession.set(session)
}
/**
* Clears the active SparkSession for current thread. Subsequent calls to getOrCreate will
* return the first created context instead of a thread-local override.
*
* @since 2.0.0
*/
def clearActiveSession(): Unit = {
activeThreadSession.remove()
}
/**
* Sets the default SparkSession that is returned by the builder.
*
* @since 2.0.0
*/
def setDefaultSession(session: SparkSession): Unit = {
defaultSession.set(session)
}
/**
* Clears the default SparkSession that is returned by the builder.
*
* @since 2.0.0
*/
def clearDefaultSession(): Unit = {
defaultSession.set(null)
}
/**
* Returns the active SparkSession for the current thread, returned by the builder.
*
* @note Return None, when calling this function on executors
*
* @since 2.2.0
*/
def getActiveSession: Option[SparkSession] = {
if (TaskContext.get != null) {
// Return None when running on executors.
None
} else {
Option(activeThreadSession.get)
}
}
/**
* Returns the default SparkSession that is returned by the builder.
*
* @note Return None, when calling this function on executors
*
* @since 2.2.0
*/
def getDefaultSession: Option[SparkSession] = {
if (TaskContext.get != null) {
// Return None when running on executors.
None
} else {
Option(defaultSession.get)
}
}
/**
* Returns the currently active SparkSession, otherwise the default one. If there is no default
* SparkSession, throws an exception.
*
* @since 2.4.0
*/
def active: SparkSession = {
getActiveSession.getOrElse(getDefaultSession.getOrElse(
throw new IllegalStateException("No active or default Spark session found")))
}
}
具体搭配使用如下:
//1.获取SparkSession对象
val spark=SparkSession.builder()
.master("local[*]")
.appName("spark_sql")
//.config("spark.app.name","spark_sql2")
//.config("spark.master","local[*]")
.config(conf)
.config("spark.sql.warehouse.dir", SparkConfUtil.sparkSqlWarehouse)
.config("spark.serializer", SparkConfUtil.sparkSerializer)
.config("spark.sql.streaming.checkpointLocation",SparkConfUtil.sparkStreamCheckPointLocaltion)
//支持hive操作
.enableHiveSupport()
.getOrCreate();
import spark.implicits._
8.2DataFrame
- 一个分布式数据组织成命名列的集合。
- 概念上相当于一个关系数据库中的表。
- 一个由
Row
对象组成的RDD
,附带包含每列数据类型的结构信息。Row
对象表示DataFrame
中的记录,其本质是一个定长的字段数组。- 在Scala和Java中,
Row
对象有一系列的get
取值方法,可以通过下标获取每个字段的值。
get取值方法源代码如下:
def apply(i: Int): Any = get(i)
def get(i: Int): Any
def getBoolean(i: Int): Boolean = getAnyValAs[Boolean](i)
def getByte(i: Int): Byte = getAnyValAs[Byte](i)
def getShort(i: Int): Short = getAnyValAs[Short](i)
def getInt(i: Int): Int = getAnyValAs[Int](i)
def getLong(i: Int): Long = getAnyValAs[Long](i)
def getFloat(i: Int): Float = getAnyValAs[Float](i)
def getDouble(i: Int): Double = getAnyValAs[Double](i)
def getString(i: Int): String = getAs[String](i)
def getDecimal(i: Int): java.math.BigDecimal = getAs[java.math.BigDecimal](i)
def getDate(i: Int): java.sql.Date = getAs[java.sql.Date](i)
def getTimestamp(i: Int): java.sql.Timestamp = getAs[java.sql.Timestamp](i)
def getSeq[T](i: Int): Seq[T] = getAs[Seq[T]](i)
def getList[T](i: Int): java.util.List[T] =
getSeq[T](i).asJava
def getMap[K, V](i: Int): scala.collection.Map[K, V] = getAs[Map[K, V]](i)
def getAs[T](i: Int): T = get(i).asInstanceOf[T]
def getAs[T](fieldName: String): T = getAs[T](fieldIndex(fieldName))
案例演示:
import org.apache.spark.sql.Row
val input=Row(“tom”,13)
val names=input.map( row => row.getString(0))
//or
val names=input.map( row => row(0))
思考1,抽象模型RDD的优缺点有哪些呢?
抽象模型RDD
的优缺点
- 优点
- 1.功能强大
- 内置很多函数操作,
group
、map
、filter
等 - 方便处理结构化或非结构化数据
- 内置很多函数操作,
- 2.面向对象编程
- 直接存储对象
- 类型转化也安全
- 1.功能强大
- 缺点
- 1.通用性强
- 因此没有针对特殊场景的优化,比如对于结构化数据处理相对于
SQL
来比非常麻烦。
- 因此没有针对特殊场景的优化,比如对于结构化数据处理相对于
- 2.序列化结果较大
- 默认采用的是
Java
序列化方式,而且数据存储在Java堆
内存中,导致GC
比较频繁。
- 默认采用的是
- 1.通用性强
思考2,借鉴了RDD的优缺点,DataFrame有什么具体操作?
DataFrame
引入了schema和off-heap
-
schema
- 结构信息
- Spark通过
schame
能读懂数据
-
off-heap
- 指
JVM
堆以外的内存,直接受操作系统管理(而不是JVM
) - Spark能够将数据按照二进制的形式序列化到
off-heap中
- 指
注意,
off-heap
就像地盘,schema
就像地图, Spark有地图又有自己地盘了,就可以自己说了算了,不再受JVM
的限制,也就不再收GC
的困扰了。
思考3,DataFrame解决了RDD的缺点了吗?代价是什么?
DataFrame的优缺点
-
优点
- 1.处理结构化数据非常方便
- 2.减少序列化开销和GC次数
- 3.与Hive兼容,且支持HQL、UDF等
-
缺点
-
1.编译时不能进行类型转化安全检查,运行时才能确定是否有问题
-
2.序列化开销仍需要改进
-
3.对于对象支持不友好
- RDD内部数据直接以对象形式存储
- DataFrame存储的是Row对象而不能是自定义对象
-
思考4,是否可以继续改进DataFrame的缺点?
8.3Dataset
- Spark1.6之后新添加的特性,优化了Spark SQL执行引擎。
- API目前只支持Scala和Java。
- 结合了
RDD
和DataFrame
的优点,并引入一个新的概念Encoder
。- 当序列化数据时,
Encoder
产生字节码与off-heap
进行交互,能够达到按需访问数据的效果,而不用反序列化整个对象。
- 当序列化数据时,
- Spark目前还没有提供自定义
Encoder
的API。
Datasets
与RDDs
是很相似的,不同的是在网络中传输对象时,
RDD
使用Java
序列化或Kryo
序列化方式Datasets
使用专业Encoder
编码器与序列化
-
相同之处
- 编码器和序列化都负责将对象转换为字节
-
不同之处
- 编码器是动态生成的代码,不需要将字节反序列化为对象。
- Spark就可以直接执行许多操作(filter、sort和shuffle等)。
Dataset的优点
- Dataset整合了RDD和DataFrame的优点,支持结构化和非结构化数据;
- 和RDD一样,支持自定义对象存储;
- 和DataFrame一样,支持结构化数据的SQL查询;
- 采用堆外内存存储,GC友好;
- 类型转化安全,代码友好;
- 官方建议使用Dataset;
总结:
- RDD API是函数式的,强调不变性,在大部分场景下倾向于创建新对象而不是修改老对象。
- 优点:带来了干净整洁的API
- 缺点:在运行期倾向于创建大量临时对象,对GC造成压力
为了解决上述缺点
-
可以利用mapPartitions方法来重载RDD单个分片内的数据创建方式
-
可以利用可复用的可变对象的方式来减小对象分配和GC的开销
但又出现新的问题
- 牺牲了代码的可读性
- 开发者必须对Spark运行时机制有一定的了解,门槛较高
Spark SQL在框架内部已经在各种可能的情况下尽量重用对象
- 缺点:在内部会打破了不变性
- 优点:数据返回给用户时,还会重新转为不可变数据
注意,利用 DataFrame/Dataset API进行开发,可以自动地享受到优化。