



  • Spark提供的主要抽象是一个弹性分布式数据集(resilient distributed dataset, RDD),它是跨集群节点划分的元素集合,可以并行操作。

  • RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it.

  • Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.最后,RDD会自动从节点故障中恢复。

  • 共享变量:Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums


  • By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks


ounter = 0
rdd = sc.parallelize(data)

# Wrong: Don't do this!!
def increment_counter(x):
    global counter
    counter += x

print("Counter value: ", counter)
  • 发送给每个执行者的闭包中的变量现在是副本,因此,在foreach函数中引用计数器时,它不再是驱动程序节点上的计数器。驱动程序节点的内存中仍然存在一个计数器,但是执行者将不再看到该计数器!

  • 闭包-不应使用诸如循环或局部定义方法之类的结构来改变某些全局状态。 Spark并未定义或保证从闭包外部引用的对象的突变行为

  • 确保在此类情况下良好的行为,应使用累加器。 Spark中的累加器专门用于提供一种机制,用于在集群中的各个工作节点之间拆分执行时安全地更新变量。

Printing elements of an RDD
  • Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or On a single machine, this will generate the expected output and print all the RDD’s elements. However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these!

  • To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

  • collect()方法,试讲所有数据拉到一台机器上,所以有可能造成内存溢出

共享变量Shared Variables

  • 通常,远程节点执行函数时,所以函数里面的变量其实都是局部变量,这些变量在每个节点之间是独立的。
  • 这些变量将复制到每台计算机,并且远程计算机上的变量更新不会传播回驱动程序在各个任务之间读写共享变量将效率很低
  • 但是,Spark确实为两种常用用法模式提供了两种有限类型的共享变量:广播变量和累加器


  • 其实就是每个节点都可以看到这个变量。
  • boardcase是只读的,也可以理解,广播变量使程序员可以在每台计算机上保留一个只读变量,而不用随任务一起发送副本。以高效的方式为每个节点提供大型输入数据集的副本。
  • 当跨多个阶段(multiple stages)的任务需要相同数据时很有用,以反序列化形式缓存数据时。
  • 另外,对象v在广播后不应修改,以确保所有节点都具有相同的广播变量值(例如,如果变量稍后被传送到新节点)
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>

>>> broadcastVar.value
[1, 2, 3]


  • 它们可用于实现计数器。就是在每个节点之间共享的一个变量。类似于全局。

  • 只有driver可以读他的值,节点是不可以的。However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

  • 对于仅在操作内部执行的累加器更新,Spark保证每个任务对累加器的更新将仅应用一次,即重新启动的任务不会更新该值。

scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10

部署到集群Deploying to a Cluster

The application submission guide describes how to submit applications to a cluster. In short, once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for Python), the bin/spark-submit script lets you submit it to any supported cluster manager.

  • If your code depends on other projects, you will need to package them alongside your application in order to distribute the code to a Spark cluster,. To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies.
  • For Python, you can use the --py-files argument of spark-submit to add .py, .zip or .egg files to be distributed with your application. If you depend on multiple Python files we recommend packaging them into a .zip or .egg.

Launching Applications with spark-submit

Once a user application is bundled, it can be launched using the bin/spark-submit script. This script takes care of setting up the classpath with Spark and its dependencies, and can support different cluster managers and deploy modes that Spark supports:

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \

Some of the commonly used options are:

  • --class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
  • --master: The master URL for the cluster (e.g. spark://
  • --deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
  • --conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
  • application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
  • application-arguments: Arguments passed to the main method of your main class, if any

Master URLs

The master URL passed to Spark can be in one of the following formats:

Master URLMeaning
localRun Spark locally with one worker thread (i.e. no parallelism at all).
local[K]Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[K,F]Run Spark locally with K worker threads and F maxFailures (see spark.task.maxFailures for an explanation of this variable)
local[*]Run Spark locally with as many worker threads as logical cores on your machine.
local[*,F]Run Spark locally with as many worker threads as logical cores on your machine and F maxFailures.
spark://HOST:PORTConnect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:PORT1,HOST2:PORT2Connect to the given Spark standalone cluster with standby masters with Zookeeper. The list must have all the master hosts in the high availability cluster set up with Zookeeper. The port must be whichever each master is configured to use, which is 7077 by default.
mesos://HOST:PORTConnect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://.... To submit with --deploy-mode cluster, the HOST:PORT should be configured to connect to the MesosClusterDispatcher.
yarnConnect to a YARN cluster in client or cluster mode depending on the value of --deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.
k8s://HOST:PORTConnect to a Kubernetes cluster in cluster mode. Client mode is currently unsupported and will be supported in future releases. The HOST and PORT refer to the Kubernetes API Server. It connects using TLS by default. In order to force it to use an unsecured connection, you can use k8s://http://HOST:PORT.
  • 0
  • 0
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


