Spark-Core详解.md

最新推荐文章于 2024-08-08 07:22:27 发布

逐风正在敲代码

最新推荐文章于 2024-08-08 07:22:27 发布

阅读量318

点赞数

文章标签： spark hadoop

本文链接：https://blog.csdn.net/qq_31405633/article/details/89966279

版权

Spark的运行架构

1.术语详解

Term	Meaning
Application	User program built on Spark. Consists of a driver program and executors on the cluster.
	应用程序Application包含一个driver program跟若干个executor
Application jar	A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.
	包含Spark应用程序的jar包
Driver program	The process running the main() function of the application and creating the SparkContext
	Driver program是一个进程，运行了应用程序application的main()方法，并在里面创建SparkContext
Cluster manager	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
	一个外部服务，用于去standalone manager，Mesos，YARN上申请资源
Deploy mode	Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.
	决定了driver进程的运行位置。如果使用cluster模式，Spark会把driver运行在集群上。如果使用client模式，Spark会将driver运行在本地Local。
Worker node	Any node that can run application code in the cluster
	在集群里面运行executor的工作节点(nodemanager)
Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
	worker node节点上面运行应用程序application的进程(container)，运行tasks，并且把数据存到内存或者持久化到存储设备,每个应用程序application有它自身相应的executors。
Task	A unit of work that will be sent to one executor
	application应用程序工作的基本单元,被发送到executor上执行。
Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. `save`, `collect`); you’ll see this term used in the driver’s logs.
	当Spark中action（如`save`, `collect`）被触发，一个包含很多个tasks的并行计算的job将会生成。你可以在driver’s logs看到这个术语。
Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.
	每个job将会被分成更小的tasks的集合，我们称之为stages。每个stage之间相互依赖，类似于MapReduce的map和reduce的stages。你可以在driver’s logs看到这个术语。

2.官网Components翻译

地址:http://spark.apache.org/docs/latest/cluster-overview.html

This document gives a short overview of how Spark runs on clusters, to make it easier to understand the components involved. Read through the application submission guide to learn about launching applications on a cluster.

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.

Spark应用程序作为独立的进程运行在一个集群上，在你应用程序的main函数里通过SparkContext对象来协调组织，我们也把Spark applications称之为driver program。

为了运行在集群之上，SparkContext连接到不同的集群管理器上(Mesos or YARN)，通过应用程序去申请资源,一旦申请到资源,Spark就会有executor在集群上的工作节点运行.executor是一堆进程,能够运行应用程序的计算并且存储数据.接着,Driver Program把应用程序的代码发送到executor上.最终Spark Context发送task到executor上执行.

There are several useful things to note about this architecture:

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

注意点:

1.每个应用程序application有自己的executor进程,executor进程在整个作业的生命周期内都是一直存在的,并且是以多线程的形式运行,这样就能保证一个应用程序跟其他应用程序的隔离.不管是从调度层面还是executor层面都是ok的.然而,这也意味数据不能跨spark应用程序共享数据.除非把数据写到外部存储中.

2.Spark不关注底层运行的cluster manager.一旦Spark拿到执行的executor进程之后,并且executor之间能进行通信.那么对于同样支持其他applications的cluster manager来说，比如Mesos/YARN，都是能去运行Spark程序的。

3.Driver program必须监听并接收他自己的executors的connections,所以driver program必须跟work node的网络是联通的.

4.因为driver program是调度作业在集群上运行的,所以driver program必须靠近work node(node manager),最好是在同一个局域网里。如果你需要发送一些远端的请求到集群,你需要使用RPC协议.

逐风正在敲代码

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark-Core详解.md

Spark的运行架构1.术语详解TermMeaningApplicationUser program built on Spark. Consists of a driver program and executors on the cluster.应用程序Application包含一个driver program跟若干个executorApplicatio...
复制链接

扫一扫