(转) Spark 集群模式概述

最新推荐文章于 2024-05-03 21:35:06 发布

不会唱歌的老樊

最新推荐文章于 2024-05-03 21:35:06 发布

阅读量2k

点赞数

文章标签： spark hadoop 大数据

本文链接：https://blog.csdn.net/Matthew_Fan/article/details/17605641

版权

This document gives a short overview of how Spark runs on clusters, to make it easier to understand the components involved.

Components

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called thedriver program). Specifically, to run on a cluster, the SparkContext can connect to several types ofcluster managers(either Spark’s own standalone cluster manager or Mesos/YARN), which allocate resources across applications. Once connected, Spark acquiresexecutorson nodes in the cluster, which are worker processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sendstasksfor the executors to run.

Spark cluster components

译者信息

这篇文章简单回顾下Spark如何在集群上运行，以使其中的组件更易理解。

组件

Spark应用在集群上以独立的进程集合运行，在你的主程序（称作驱动程序）中以SparkContext对象来调节。特别的，为了在集群上运行，SparkContext可以与几个类型的集群管理器（Spark自身单独的集群管理器或者Mesos/YARN）相连接，这些集群管理器可以在应用间分配资源。一旦连接，Spark需要在集群上的线程池子节点，也就是那些执行计算和存储应用数据的工作进程。然后，它将把你的应用代码（以JAR或者Python定义的文件并传送到SparkContext）发送到线程池。最后，SparkContext发送任务让线程池运行。

There are several useful things to note about this architecture:

Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

译者信息

关于这个架构有几个有用的地方需要注意：

各个应用有自己的线程池进程，会在整个应用的运行过程中保持并在多个线程中运行任务。这样做的好处是把应用相互孤立，既在调度方面（各个驱动调度它自己的任务）也在执行方面（不同应用的任务在不同的JVM上运行）。然而，这也意味着若不把数据写到额外的存储系统的话，数据就无法在不同的Spark应用间（SparkContext的实例）共享。
对于潜在的集群管理器来说，Spark是不可知的。只要它需要线程池的进程和它们间的通信，那么即使是在也支持其他应用的集群管理器（例如，Mesos/YARN）上运行也相对简单。
因为驱动在集群上调度任务，它应该运行接近到工作节点，在相同的局域网内更好。如果你想对远程的集群发送请求，较好的选择是为驱动打开一个RPC，让它就近提交操作而不是运行离工作节点很远的驱动。

Cluster Manager Types

The system currently supports three cluster managers:

Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications.
Hadoop YARN – the resource manager in Hadoop 2.0.

In addition, Spark’s EC2 launch scripts make it easy to launch a standalone cluster on Amazon EC2.

Shipping Code to the Cluster

The recommended way to ship your code to the cluster is to pass it through SparkContext’s constructor, which takes a list of JAR files (Java/Scala) or .egg and .zip libraries (Python) to disseminate to worker nodes. You can also dynamically add new files to be sent to executors withSparkContext.addJarandaddFile.

译者信息

集群管理类型

系统目前支持3种集群管理：

单例模式-一种简单的集群管理，其包括一个很容易搭建集群的Spark
Apache Mesos模式-一种通用的集群管理,可以运行Hadoop MapReduce和服务应用的模式
Hadoop YARN模式-Hadoop2.0中的资源管理模式

其实，在Amazon EC2(亚马逊弹性计算云)中Spark的EC2启动脚本可以很容易的启动单例模式。

给集群发布代码

给集群发布代码的一种推荐的方式是通过SparkContext的构造器，这个构造器可以给工作节点生成JAR文件列表(Java/Scala)或者.egg文件和.zip包文件(Python)。你也可以执行SparkContext.addJar和addFile来动态的创建发送文件。

Monitoring

Each driver program has a web UI, typically on port 4040, that displays information about running tasks, executors, and storage usage. Simply go tohttp://<driver-node>:4040in a web browser to access this UI. The monitoring guide also describes other monitoring options.

Job Scheduling

Spark gives control over resource allocation bothacrossapplications (at the level of the cluster manager) andwithinapplications (if multiple computations are happening on the same SparkContext). The job scheduling overview describes this in more detail.

译者信息

监控器

每个驱动程序有一个web UI，典型的是在4040端口，你可以看到有关运行的任务、程序和存储空间大小等信息。你可以在浏览器中输入简单的url方式来访问：http://<驱动节点>：4040.监控器也可以指导描述其它监控器信息。

任务调度

Spark可以通过在应用外(集群管理水平)和应用里(如果在同一个SparkContext中有多个计算指令)资源分配。你可以在这里了解更多任务调度的细节。

Glossary

The following table summarizes terms you’ll see used to refer to cluster concepts:

Term	Meaning
Application	User program built on Spark. Consists of adriver programandexecutorson the cluster.
Driver program	The process running the main() function of the application and creating the SparkContext
Cluster manager	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN)
Worker node	Any node that can run application code in the cluster
Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.
Task	A unit of work that will be sent to one executor
Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g.save,collect); you'll see this term used in the driver's logs.
Stage	Each job gets divided into smaller sets of tasks calledstagesthat depend on each other (similar to the map and reduce stages in MapReduce); you'll see this term used in the driver's logs.

译者信息

词汇表

你将在集群概念中看到下面表中总结的术语：

术语	意思
应用	在Spark上构建的程序. 由驱动程序和子执行集群.
驱动程序	运行man函数的进程，同时也创建SparkContext
集群管理员	在获得资源集群上的扩展服务(举例. 单例模式管理员, Mesos, YARN)
工作员节点	任何在集群中可以运行应用的节点
执行者	在工作员节点中为应用所启动的一个进程，它可以运行任务以及可以在内存或是硬盘中保存数据。每一个应用都有属于自己的执行者
任务	一个可以给执行者发送数据的工作单元
工作	一个由多任务组成的并行计算，并能从Spark动作中获得回应 (举例.保存，收集); 你可以在驱动日志中看到这个术语
阶段	每个工作被分为很多小的任务集合互称为阶段 (和MapReduce中的map和reduce阶段相似); 你可以在驱动日志中看到这个术语

转自：http://www.oschina.net/translate/spark-cluster-overview?cmp

不会唱歌的老樊

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
(转) Spark 集群模式概述

This document gives a short overview of how Spark runs on clusters, to make it easier to understand the components involved.ComponentsSpark applications run as independent sets of processes on a
复制链接

扫一扫