引用
Job Scheduling的官方文档:
Second, within each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads.
换句话说,多个线程可以使用单个SparkContext实例,从而能够提交可能或可能不并行运行的多个Spark作业.
Spark作业是否并行运行取决于CPU的数量(Spark不会跟踪调度的内存使用情况).如果有足够的CPU来处理来自多个Spark作业的任务,它们将同时运行.
但是,如果CPU数量不足,您可以考虑使用FAIR scheduling mode(FIFO是默认值):
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
只是为了清理一点.
> spark-submit是提交Spark应用程序以执行(而不是Spark作业).单个Spark应用程序至少可以有一个Spark作业.
> RDD操作可能阻止也可能不阻塞. SparkContext附带了两个提交(或运行)Spark作业的方法,即SparkContext.runJob和SparkContext.submitJob,因此无论动作是否阻塞都无关紧要,但使用哪种SparkContext方法进行非阻塞行为.
请注意,“RDD操作方法”已经编写,它们的实现使用Spark开发人员所投入的任何内容(主要是在count中的SparkContext.runJob):
// RDD.count
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
您必须编写自己的RDD操作(在自定义RDD上)以在Spark应用程序中具有所需的非阻塞功能.