
Spark之Spark Core高级进阶一


ApplicationUser program built on Spark. Consists of a driver program and executors on the cluster.a driver program + executors
Application jarA jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.用户spark程序的jar包
Driver programThe process running the main() function of the application and creating the SparkContext程序的入口+创建SparkContext
Cluster managerAn external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)外部服务,用于申请作业资源
Deploy modeDistinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.driver process 运行在cluster端还是client端,client端即提交应用程序的地方
Worker nodeAny node that can run application code in the cluster运行spark程序的机器
ExecutorA process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.他是一个进程,用于运行tasks以及保存数据到内存或磁盘,每个s应用程序有自己的excutors,类似于MR的Container
TaskA unit of work that will be sent to one executor发送到executor上的最小工作单元,
JobA parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. save, collect); you’ll see this term used in the driver’s logs.一个action算子即构成一个job作业
StageEach job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.遇到shuffle即划分stage,一个job由多个stage组成,一个stage由多个task组成,同一job下的stage具有依赖关系,当前stage完成前,依赖的stage一定也完成了,类似于MR中R依赖于M



Spark applications run as independent(独立的) sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).

Specifically(具体的说), to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across(跨) applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

There are several useful things to note about this architecture(架构):

  • Each application gets its own executor processes, which stay up(运行) for the duration of the whole application and run tasks in multiple threads. This has the benefit(好处) of isolating(隔离) applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
  • Spark is agnostic(不关心) to the underlying cluster manager. As long as(只要) it can acquire executor processes, and these communicate with each other, it is relatively(相对 ) easy to run it even on(即使) a cluster manager that also supports other applications (e.g. Mesos/YARN).
  • The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such(因此), the driver program must be network addressable from the worker nodes.
  • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes,** preferably on the same local area network**. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
Spark-Core文档是本人经三年总结笔记汇总而来,对于自我学习Spark核心基础知识非常方便,资料中例举完善,内容丰富。具体目录如下: 目录 第一章 Spark简介与计算模型 3 1 What is Spark 3 2 Spark简介 3 3 Spark历史 4 4 BDAS生态系统 4 5 Spark与Hadoop的差异 5 6 Spark的适用场景 6 7 Spark成功案例 6 第二章 Spark开发环境搭建 8 1 Spark运行模式 8 2 Spark环境搭建 8 2.1Scala的安装 8 2.2Spark的单节点配置 9 2.3Spark-Standalone集群配置 9 2.4Spark-on-Yarn模式配置 12 2.5Spark-on-Mesos模式配置 13 2.6Hive-on-Spark配置 13 第三章 Spark计算模型 15 1 RDD编程 15 1.1弹性分布式数据集RDD 15 1.2构建RDD对象 15 2RDD操作 15 2.1将函数传递给Spark 16 2.2了解闭包 16 2.3Pair RDD模型 17 2.4Spark常见转换操作 18 2.5Spark常见行动操作 20 2.6RDD持久化操作 21 2.7注意事项 23 2.7并行度调优 24 2.8分区方式 25 3Examle:PageRank 27 第四章 Spark编程进阶 29 1共享变量 29 1.1累加器 30 1.2广播变量 31 2基于分区进行操作 32 3与外部程序间的管道 33 4数值RDD的操作 34 5 Spark Shuffle机制 34 第五章 Spark调优与调试 39 1开发调优: 40 1.1调优概述 40 1.2原则一:避免创建重复的RDD 40 1.3原则二:尽可能复用同一个RDD 41 1.4原则三:对多次使用的RDD进行持久化 42 1.5原则四:尽量避免使用shuffle类算子 43 1.6原则五:使用map-side预聚合的shuffle操作 44 1.7原则六:使用高性能的算子 45 1.8原则七:广播大变量 46 1.9原则八:使用Kryo优化序列化性能 47 1.10原则九:优化数据结构 48 2资源调优 48 2.1调优概述 48 2.2 Spark作业基本运行原理 49 2.3资源参数调优 50 第六章 Spark架构和工作机制 52 1 Spark架构 52 1.1 Spark架构组件简介 52 1.2 Spark架构图 54 2 Spark工作机制 54 2.1 Spark作业基本概念 54 2.2 Spark程序与作业概念映射 55 2.3 Spark作业运行流程 55 3 Spark工作原理 55 3.1 作业调度简介 55 3.2 Application调度 56 3.3 Job调度 56 3.4 Tasks延时调度 56 第七章 Spark运行原理 57 1 Spark运行基本流程 57 2 Spark在不同集群中的运行架构 58 2.1 Spark on Standalone运行过程 59 2.2 Spark on YARN运行过程 60




