Spark Programming Guide Spark编程指南
- Overview综述
- Linking with Spark与Spark连接
- Initializing Spark初始化Spark
- Using the Shell使用Shell
- Resilient Distributed Datasets (RDDs)弹性分布式数据集
- Parallelized Collections并行集合
- External Datasets外部数据集
- RDD OperationsRDD操作
- Basics基本
- Passing Functions to Spark将Function()传递给Spark
- Understanding closures 了解关闭
- Example 例子
- Local vs. cluster modes本地与集群模式
- Printing elements of an RDD打印RDD元素
- Example 例子
- Working with Key-Value Pairs使用键值对
- Transformations转化操作
- Actions行动操作
- Shuffle operations混洗操作
- RDD PersistenceRDD持久化
- Which Storage Level to Choose? 哪个存储级别可供选择
- Removing Data删除数据
- Shared Variables共享变量
- Deploying to a Cluster配置到集群
- Launching Spark jobs from Java / Scala使用Java/Scala启动Spark任务
- Unit Testing单元测试
- Migrating from pre-1.0 Versions of Spark从1.0之前的Spark版本迁移
- Where to Go from Here下一步
Overview 综述
At a high level, every Spark application consists of a driver program that runs the user’s main
function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
在高层次上,每个Spark应用程序都由一个运行用户功能并在集群上执行各种并行操作的驱动程序组成。Spark提供的主要抽象是一个弹性分布式数据集(RDD),它是在可以并行运行的集群节点之间划分的元素集合。RDD通过从Hadoop文件系统(或任何其他Hadoop支持的文件系统)中的文件或驱动程序中的现有Scala集合开始,并进行转换来创建。用户还可以要求Spark 在内存中保留 RDD,从而在并行操作中有效地重用RDD。最后,RDD自动从节点故障中恢复。
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
Spark中的第二个抽象是可以在并行操作中使用的共享变量。默认情况下,当Spark将并行功能作为不同节点上的一组任务并行运行时,它会将功能中使用的每个变量的副本发送给每个任务。有时,变量需要在任务之间或任务和驱动程序之间共享。Spark支持两种类型的共享变量:广播变量,可用于缓存所有节点上的内存中的值,以及累加器,它们只是“添加”到诸如计数器和总和之间的变量。
This guide shows each of these features in each of Spark’s supported languages. It is easiest to follow along with if you launch Spark’s interactive shell – either bin/spark-shell
for the Scala shell or bin/pyspark
for the Python one.
本指南显示了Spark支持的每种语言中的每个功能。如果您启动Spark的交互式shell(bin/spark-shell
Scala shell或 bin/pyspark
Python ),最容易遵循的。
Linking with Spark
与Spark连接
Spark 1.6.3 uses Scala 2.10. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.10.X).
To write a Spark application, you need to add a Maven dependency on Spark. Spark is available through Maven Central at:
Spark 1.6.3使用Scala 2.10。要在Scala中编写应用程序,您将需要使用兼容的Scala版本(例如2.10.X)。
要编写Spark应用程序,您需要在Spark上添加一个Maven依赖关系。Spark可通过Maven Central获得:
groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client
for your version of HDFS.
另外,如果您希望访问HDFS集群,则需要hadoop-client
为您的HDFS版本添加依赖关系 。
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally, you need to import some Spark classes into your program. Add the following lines:
最后,您需要将一些Spark类导入到程序中。添加以下行:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
(Before Spark 1.3.0, you need to explicitly import org.apache.spark.SparkContext._
to enable essential implicit conversions.)
(在Spark 1.3.0之前,您需要明确import org.apache.spark.SparkContext._
地启用基本的隐式转换。)
Initializing Spark
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext
you first need to build a SparkConf object that contains information about your application.
Spark程序必须做的第一件事是创建一个SparkContext对象,该对象告诉Spark如何访问集群。要创建一个,SparkContext
您首先需要构建一个包含有关应用程序信息的SparkConf对象。
Only one SparkContext may be active per JVM. You must stop()
the active SparkContext before creating a new one.
每个JVM只有一个SparkContext可能是活动的。stop()
在创建一个新的SparkContext之前,您必须是活动的。
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName
parameter is a name for your application to show on the cluster UI. master
is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode master
in the program, but rather launch the application with spark-submit
and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
该appName
参数是应用程序在集群UI上显示的名称。master
是Spark,Mesos或YARN集群URL,或以本地模式运行的特殊“本地”字符串。实际上,当在集群上运行时,您不需要master
在程序中进行硬编码,而是在该应用程序启动spark-submit
并接收应用程序。但是,对于本地测试和单元测试,您可以通过“本地”运行Spark进程。
Using the Shell
使用Shell
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc
. Making your own SparkContext will not work. You can set which master the context connects to using the --master
argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars
argument. You can also add dependencies (e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates to the --packages
argument. Any additional repositories where dependencies might exist (e.g. SonaType) can be passed to the --repositories
argument. For example, to run bin/spark-shell
on exactly four cores, use:
在Spark shell中,已经为您创建了一个特殊的解释器感知SparkContext,在调用的变量中sc
。制作您自己的SparkContext将无法正常工作。您可以使用--master
参数设置上下文连接的主机,并且可以通过将逗号分隔的列表传递给参数来将JAR添加到类路径--jars
。您还可以通过为参数提供逗号分隔的maven坐标列表,将依赖关系(例如Spark Packages)添加到shell会话--packages
。任何可能存在依赖关系的其他存储库(例如SonaType)都可以传递给--repositories
参数。例如,要运行bin/spark-shell
在四个内核上,请使用:
$ ./bin/spark-shell --master local[4]
Or, to also add code.jar
to its classpath, use:
或者,也添加code.jar
到其类路径,使用:
$ ./bin/spark-shell --master local[4] --jars code.jar
To include a dependency using maven coordinates:
使用maven坐标来包含依赖关系:
$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
For a complete list of options, run spark-shell --help
. Behind the scenes, spark-shell
invokes the more general spark-submit
script.
有关选项的完整列表,请运行spark-shell --help
。在幕后, spark-shell
调用更一般的spark-submit
脚本。
Resilient Distributed Datasets (RDDs)
弹性分布式数据集(RDD)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
Spark围绕着弹性分布数据集(RDD)的概念,RDD是一种可以并行运行的元件的容错集合。创建RDD有两种方法:并行化 驱动程序中的现有集合,或者在外部存储系统(如共享文件系统,HDFS,HBase或提供Hadoop InputFormat的任何数据源)中引用数据集。
Parallelized Collections
并行集合
Parallelized collections are created by calling SparkContext
’s parallelize
method on an existing collection in your driver program (a Scala Seq
). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:
并行集合通过调用创建SparkContext
的parallelize
一个现有的收集方法,在你的驱动程序(斯卡拉Seq
)。集合的元素被复制以形成可并行操作的分布式数据集。例如,这里是如何创建一个保存数字1到5的并行集合:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Once created, the distributed dataset (distData
) can be operated on in parallel. For example, we might call distData.reduce((a, b) => a + b)
to add up the elements of the array. We describe operations on distributed datasets later on.
一旦创建,分布式数据集(distData
)可以并行运行。例如,我们可以调用distData.reduce((a, b) => a + b)
将数组的元素相加。我们稍后介绍分布式数据集的操作。
One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize
(e.g. sc.parallelize(data, 10)
). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.
并行集合的一个重要参数是将数据集切割到的分区数。Spark将为集群的每个分区运行一个任务。通常,您需要为集群中的每个CPU分配2-4个分区。通常,Spark会根据您的群集自动设置分区数。但是,您也可以通过将其作为第二个参数传递给parallelize
(例如sc.parallelize(data, 10)
)来手动设置。注意:代码中的某些地方使用术语片(分区的同义词)来保持向后兼容性。
External Datasets
外部数据集
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
Spark可以从Hadoop支持的任何存储源创建分布式数据集,包括本地文件系统,HDFS,Cassandra,HBase,Amazon S3等。Spark支持文本文件,SequenceFiles和任何其他Hadoop InputFormat。
Text file RDDs can be created using SparkContext
’s textFile
method. This method takes an URI for the file (either a local path on the machine, or a hdfs://
, s3n://
, etc URI) and reads it as a collection of lines. Here is an example invocation:
文本文件RDDS可以使用创建SparkContext
的textFile
方法。此方法需要一个URI的文件(本地路径的机器上,或一个hdfs://
,s3n://
等URI),并读取其作为行的集合。这是一个示例调用:
scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08
Once created, distFile
can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the map
and reduce
operations as follows: distFile.map(s => s.length).reduce((a, b) => a + b)
.
创建后,distFile
可以通过数据集操作进行操作。例如,我们可以使用map
和reduce
操作将所有行的大小相加,如下所示:distFile.map(s => s.length).reduce((a, b) => a + b)
。
Some notes on reading files with Spark:
有关Spark的阅读文件的一些注意事项
-
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
-
如果在本地文件系统上使用路径,该文件也必须在工作节点上的相同路径上可访问。将文件复制到所有工作人员或使用网络安装的共享文件系统。
-
All of Spark’s file-based input methods, including
textFile
, support running on directories, compressed files, and wildcards as well. For example, you can usetextFile("/my/directory")
,textFile("/my/directory/*.txt")
, andtextFile("/my/directory/*.gz")
. -
所有Spark的基于文件的输入法,包括
textFile
支持在目录,压缩文件和通配符上运行。例如,你可以使用textFile("/my/directory")
,textFile("/my/directory/*.txt")
和textFile("/my/directory/*.gz")
。
-
The
textFile
method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks. -
该
textFile
方法还采用可选的第二个参数来控制文件的分区数。默认情况下,Spark为文件的每个块创建一个分区(HDFS中默认为64MB),但是也可以通过传递更大的值来请求更高数量的分区。请注意,您不能有比块少的分区。
Apart from text files, Spark’s Scala API also supports several other data formats:
除了文本文件,Spark的Scala API还支持其他几种数据格式:
-
SparkContext.wholeTextFiles
lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast withtextFile
, which would return one record per line in each file. -
SparkContext.wholeTextFiles
让您阅读包含多个小文本文件的目录,并将其作为(文件名,内容)对返回。这与textFile
每个文件中的每行返回一条记录相反。
-
For SequenceFiles, use SparkContext’s
sequenceFile[K, V]
method whereK
andV
are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example,sequenceFile[Int, String]
will automatically read IntWritables and Texts. -
对于SequenceFiles,使用SparkContext的
sequenceFile[K, V]
方法,K
它V
是文件中键和值的类型。这些应该是Hadoop的Writable接口的子类,如IntWritable和Text。此外,Spark允许您为几个常见的Writables指定本机类型; 例如,sequenceFile[Int, String]
会自动读取IntWritables和Texts。
-
For other Hadoop InputFormats, you can use the
SparkContext.hadoopRDD
method, which takes an arbitraryJobConf
and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also useSparkContext.newAPIHadoopRDD
for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce
). -
对于其他Hadoop InputFormats,您可以使用该
SparkContext.hadoopRDD
方法,该方法采用任意JobConf
和输入格式类,键类和值类。将这些设置与使用输入源的Hadoop作业相同。您还可以使用SparkContext.newAPIHadoopRDD
基于“新”MapReduce API(org.apache.hadoop.mapreduce
)的InputFormats 。
-
RDD.saveAsObjectFile
andSparkContext.objectFile
support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD. -
对于其他Hadoop InputFormats,您可以使用该
SparkContext.hadoopRDD
方法,该方法采用任意JobConf
和输入格式类,键类和值类。将这些设置与使用输入源的Hadoop作业相同。您还可以使用SparkContext.newAPIHadoopRDD
基于“新”MapReduce API(org.apache.hadoop.mapreduce
)的InputFormats 。
RDD Operations
RDD操作
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map
is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce
is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey
that returns a distributed dataset).
RDDS支持两种类型的操作:转变,从现有的创建一个新的数据集和行动,其上运行的数据集的计算后的值返回驱动程序。例如,map
是通过函数传递每个数据集元素并返回表示结果的新RDD的转换。另一方面,reduce
是使用一些函数聚合RDD的所有元素并将最终结果返回给驱动程序的动作(尽管还有一个并行reduceByKey
返回分布式数据集)。
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map
will be used in a reduce
and return only the result of the reduce
to the driver, rather than the larger mapped dataset.
Spark中的所有转换都是懒惰的,因为它们不会马上计算它们的结果。相反,他们只记得应用于某些基本数据集(例如文件)的转换。只有当一个动作需要将结果返回给驱动程序时,才会计算转换。这种设计使Spark能够更有效地运行 - 例如,我们可以意识到,通过创建的数据集map
将被用于reduce
并仅返回reduce
到驱动程序的结果,而不是较大的映射数据集。
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist
(or cache
) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
默认情况下,每次转换的RDD可能会在每次对其进行操作时重新计算。但是,您也可以使用(或)方法在内存中持久使用RDD ,在这种情况下,Spark将会在集群上保留元素,以便在下次查询时进行更快的访问。还支持在磁盘上持久的RDD,或跨多个节点进行复制。
Basics
基本
To illustrate RDD basics, consider the simple program below:
为了说明RDD的基础知识,请考虑以下简单的程序:
val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)
The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines
is merely a pointer to the file. The second line defines lineLengths
as the result of a map
transformation. Again, lineLengths
is not immediately computed, due to laziness. Finally, we run reduce
, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.
第一行定义了一个外部文件的基础RDD。此数据集未加载到内存中或以其他方式处理:lines
仅指向文件的指针。第二行定义lineLengths
为map
转换的结果。再次,lineLengths
是不是马上计算,由于懒惰。最后,我们跑reduce
,这是一个动作。此时,Spark将计算分解为在单独的机器上运行的任务,每台机器都运行其地图的一部分和本地缩减,仅返回其对驱动程序的答案。
If we also wanted to use lineLengths
again later, we could add:
如果我们以后也想lineLengths
再次使用,我们可以添加:
lineLengths.persist()
before the reduce
, which would cause lineLengths
to be saved in memory after the first time it is computed.
之前reduce
,这将导致lineLengths
在第一次计算之后保存在内存中。
Passing Functions to Spark
将功能传递给Spark
Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:
Spark的API很大程度上依赖于驱动程序中的传递函数来运行在集群上。有两种建议的方法来做到这一点:
- Anonymous function syntax, which can be used for short pieces of code.
- 匿名函数语法,可用于短代码。
- Static methods in a global singleton object. For example, you can define
object MyFunctions
and then passMyFunctions.func1
, as follows: - 全局单例对象中的静态方法。例如,您可以定义
object MyFunctions
并通过MyFunctions.func1
,如下所示:
object MyFunctions {
def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)
Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:
请注意,虽然也可以传递对类实例(而不是单例对象)中的方法的引用,但这需要发送包含该类的对象以及该方法。例如,考虑:
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass
and call doStuff
on it, the map
inside there references the func1
method of that MyClass
instance, so the whole object needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x))
.
在这里,如果我们创建new MyClass
并调用doStuff
就可以了,map
里面有引用的 func1
方法是的MyClass
实例,所以需要发送到群集的整个对象。它类似于写作rdd.map(x => this.func1(x))
。
In a similar way, accessing fields of the outer object will reference the whole object:
以类似的方式,访问外部对象的字段将引用整个对象:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
is equilvalent to writing rdd.map(x => this.field + x)
, which references all of this
. To avoid this issue, the simplest way is to copy field
into a local variable instead of accessing it externally:
是平行的写作rdd.map(x => this.field + x)
,其中引用了所有this
。为了避免这个问题,最简单的方法是复制field
到局部变量中,而不是外部访问它。
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}
Understanding closures
了解关闭
One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach()
to increment a counter, but similar issues can occur for other operations as well.
Spark的一个较难的事情是在跨群集执行代码时了解变量和方法的范围和生命周期。在其范围之外修改变量的RDD操作可能是一个常见的混乱来源。在下面的示例中,我们将看看foreach()
用于增加计数器的代码,但是对于其他操作也可能会出现类似的问题。
Example
例
Consider the naive RDD element sum below, which may behave differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in local
mode (--master = local[n]
) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):
考虑下面的天真的RDD元素总和,其可能的行为不同,取决于执行是否在同一个JVM中发生。一个常见的例子是运行在local
模式(--master = local[n]
)中的Spark,而不是将Spark应用程序部署到集群(例如通过spark-submit to YARN):
var counter = 0
var rdd = sc.parallelize(data)
// Wrong: Don't do this!!
rdd.foreach(x => counter += x)
println("Counter value: " + counter)
Local vs. cluster modes
本地与群集模式
The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()
). This closure is serialized and sent to each executor.
上述代码的行为未定义,可能无法正常工作。为了执行任务,Spark将RDD操作的处理分解为任务,每个任务由执行者执行。在执行之前,Spark会计算任务的关闭。关闭是这些变量和方法,对于执行器在RDD(在这种情况下foreach()
)执行其计算必须是可见的。该关闭序列化并发送给每个执行器。
The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach
function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.
发送给每个执行器的闭包中的变量现在是副本,因此,当在函数中引用计数器时foreach
,它不再是驱动程序节点上的计数器。在驱动程序节点的内存中仍然有一个计数器,但这对执行程序不再可见!执行者只看到序列化关闭的副本。因此,计数器的最终值仍然为零,因为计数器上的所有操作都引用了序列化闭包中的值。
In local mode, in some circumstances the foreach
function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.
在本地模式下,在某些情况下,该foreach
函数实际上将在与驱动程序相同的JVM中执行,并引用相同的原始计数器,并可能实际更新它。
To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator
. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.
为了确保在这些场景中明确定义的行为,应该使用Accumulator
。Spark中的累加器专门用于提供一种机制,用于在群集中的工作者节点上执行分割时安全更新变量。本指南的“累加器”部分更详细地讨论了这些。
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
一般来说,闭包 - 构造如循环或本地定义的方法不应该用于突变某些全局状态。Spark不定义或保证从关闭外部引用的对象的突变行为。一些代码可以在本地模式下工作,但这只是偶然的,这样的代码在分布式模式下不会像预期那样运行。如果需要一些全局聚合,则使用累加器。
Printing elements of an RDD
打印RDD元素
Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println)
or rdd.map(println)
. On a single machine, this will generate the expected output and print all the RDD’s elements. However, in cluster
mode, the output to stdout
being called by the executors is now writing to the executor’s stdout
instead, not the one on the driver, so stdout
on the driver won’t show these! To print all elements on the driver, one can use the collect()
method to first bring the RDD to the driver node thus: rdd.collect().foreach(println)
. This can cause the driver to run out of memory, though, because collect()
fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take()
: rdd.take(100).foreach(println)
.
另一个常见的成语是尝试使用rdd.foreach(println)
或打印出RDD的元素rdd.map(println)
。在单个机器上,这将产生预期的输出并打印所有RDD的元素。但是,在cluster
模式中,stdout
执行器调用的输出现在正在向执行者写入,stdout
而不是驱动程序的输出,所以stdout
驱动程序不会显示这些!要打印在驱动器的所有元素,可以使用的collect()
方法,首先使RDD到驱动器节点从而:rdd.collect().foreach(println)
。这可能导致驱动程序用尽内存,因为collect()
将整个RDD提取到单个机器上; 如果您只需要打印RDD的几个元素,则更安全的方法是使用take()
:rdd.take(100).foreach(println)
。
Working with Key-Value Pairs
使用键值对
While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.
虽然大多数Spark操作适用于包含任何类型对象的RDD,但是几个特殊操作只能在键值对的RDD上使用。最常见的是分布式“随机播放”操作,例如按键分组或聚合元素。
In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)
). The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.
在Scala中,这些操作可以在包含Tuple2对象的RDD(内置的语言元组,通过简单的写入创建(a, b)
)中自动使用。键值对操作在PairRDDFunctions类中可用 ,它自动包围元组的RDD。
For example, the following code uses the reduceByKey
operation on key-value pairs to count how many times each line of text occurs in a file:
例如,以下代码使用reduceByKey
键值对上的操作来计算每行文本在文件中的发生次数:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
We could also use counts.sortByKey()
, for example, to sort the pairs alphabetically, and finally counts.collect()
to bring them back to the driver program as an array of objects.
counts.sortByKey()
例如,我们也可以使用字母顺序对对进行排序,最后 counts.collect()
将它们作为一组对象返回到驱动程序。
Note: when using custom objects as the key in key-value pair operations, you must be sure that a custom equals()
method is accompanied with a matching hashCode()
method. For full details, see the contract outlined in the Object.hashCode() documentation.
注意:使用自定义对象作为键值对操作中的键值时,必须确保equals()
使用匹配hashCode()
方法附带自定义方法。有关完整的详细信息,请参阅Object.hashCode()文档中概述的合同。
Transformations
转换
The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair RDD functions doc (Scala, Java) for details.
下表列出了Spark支持的一些常见转换。有关详细信息,请参阅RDD API文档(Scala, Java,Python, R)和RDD函数doc(Scala, Java)。
Actions
The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (Scala, Java, Python, R)
and pair RDD functions doc (Scala, Java) for details.
Action | Meaning |
---|---|
reduce(func) | Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel. |
collect() | Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data. |
count() | Return the number of elements in the dataset. |
first() | Return the first element of the dataset (similar to take(1)). |
take(n) | Return an array with the first n elements of the dataset. |
takeSample(withReplacement, num, [seed]) | Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed. |
takeOrdered(n, [ordering]) | Return the first n elements of the RDD using either their natural order or a custom comparator. |
saveAsTextFile(path) | Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. |
saveAsSequenceFile(path) (Java and Scala) | Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). |
saveAsObjectFile(path) (Java and Scala) | Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile() . |
countByKey() | Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. |
foreach(func) | Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details. |
Shuffle operations 混洗操作
Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.
Spark内的某些操作会触发称为随机播放的事件。shuffle是Spark的重新分布数据的机制,因此它在分区之间的分组不同。这通常涉及将数据复制到执行器和机器上,从而使洗牌成为复杂而昂贵的操作。
Background
To understand what happens during the shuffle we can consider the example of the reduceByKey
operation. The reduceByKey
operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.
要了解在洗牌过程中会发生什么,我们可以考虑reduceByKey
操作的例子 。该reduceByKey
操作生成一个新的RDD,其中单个密钥的所有值都被组合成一个元组 - 关键字和对与该密钥相关联的所有值执行reduce函数的结果。挑战在于,并不是单个密钥的所有值都必须驻留在同一个分区上,甚至是同一个机器上,但它们必须位于同一个位置才能计算结果。
In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey
reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.
在Spark中,数据通常不会跨分区分布,以便在特定操作的必要位置。在计算过程中,单个任务将在单个分区上运行 - 因此,要组织所有数据reduceByKey
以执行单个reduce任务,Spark需要执行一对一的操作。它必须从所有分区中读取以查找所有键的所有值,然后将分区之间的值汇聚在一起,以计算每个键的最终结果 - 这称为随机播放。
Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:
虽然新混洗数据的每个分区中的元素集将是确定性的,分区本身的排序也是如此,但是这些元素的排序不是。如果一个人想要随机播放之后可预测的有序数据,那么可以使用:
mapPartitions
to sort each partition using, for example,.sorted
mapPartitions
为了对每个分区进行排序,例如,.sorted
repartitionAndSortWithinPartitions
to efficiently sort partitions while simultaneously repartitioningrepartitionAndSortWithinPartitions
有效地对分区进行分类,同时重新分区sortBy
to make a globally ordered RDDsortBy
制作全球有序的RDD
Operations which can cause a shuffle include repartition operations like repartition
and coalesce
, ‘ByKey operations (except for counting) like groupByKey
and reduceByKey
, and join operations like cogroup
and join
.
这可能会导致一个洗牌的操作包括重新分区一样操作 repartition
和coalesce
,ByKey”操作,比如(除计数)groupByKey
,并reduceByKey
和参加操作,如cogroup
和join
。
Performance Impact
性能影响
The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map
and reduce
operations.
所述随机播放是昂贵的操作,因为它涉及的磁盘I / O,数据序列,和网络I / O。要组织随机播放的数据,Spark会生成一组任务 - 映射任务以组织数据,以及一组缩减任务以进行汇总。这个命名法来自于MapReduce,并不直接涉及到Spark map
和reduce
操作。
Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.
在内部,单独的地图任务的结果将保存在内存中,直到它们不适合为止。然后,这些根据目标分区进行排序并写入单个文件。在减少的一面,任务读取相关的排序块。
Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey
and aggregateByKey
create these structures on the map side, and 'ByKey
operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
某些随机播放操作可能会占用大量的堆内存,因为它们在传输之前或之后使用内存中的数据结构来组织记录。具体来说, reduceByKey
并aggregateByKey
在地图上创建这些结构,并且'ByKey
操作在减少的一面生成这些结构。当数据不适合内存时,Spark会将这些表溢出到磁盘,导致磁盘I / O的额外开销和增加的垃圾回收。
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by thespark.local.dir
configuration parameter when configuring the Spark context.
随机播放还会在磁盘上生成大量的中间文件。从Spark 1.3开始,这些文件将被保留,直到相应的RDD不再使用并被垃圾回收。这样做,所以如果重新计算谱系,则不需要重新创建洗牌文件。如果应用程序保留对这些RDD的引用或GC不频繁启动,垃圾收集可能仅在长时间之后才会发生。这意味着长时间运行的Spark作业可能会消耗大量的磁盘空间。spark.local.dir
配置Spark上下文时,配置参数指定临时存储目录 。
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.
RDD Persistence
One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.
You can mark an RDD to be persisted using the persist()
or cache()
methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel
object (Scala, Java, Python) to persist()
. The cache()
method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY
(store deserialized objects in memory). The full set of storage levels is:
Storage Level | Meaning |
---|---|
MEMORY_ONLY | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level. |
MEMORY_AND_DISK | Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed. |
MEMORY_ONLY_SER | Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read. |
MEMORY_AND_DISK_SER | Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed. |
DISK_ONLY | Store the RDD partitions only on disk. |
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. | Same as the levels above, but replicate each partition on two cluster nodes. |
OFF_HEAP (experimental) | Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon out-of-the-box. Please refer to this page for the suggested version pairings. |
Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level.
Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey
), even without users calling persist
. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist
on the resulting RDD if they plan to reuse it.
Which Storage Level to Choose?
Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:
-
If your RDDs fit comfortably with the default storage level (
MEMORY_ONLY
), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible. -
If not, try using
MEMORY_ONLY_SER
and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access. -
Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.
-
Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.
-
In environments with high amounts of memory or multiple applications, the experimental
OFF_HEAP
mode has several advantages:- It allows multiple executors to share the same pool of memory in Tachyon.
- It significantly reduces garbage collection costs.
- Cached data is not lost if individual executors crash.
Removing Data
Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist()
method.
Shared Variables
Normally, when a function passed to a Spark operation (such as map
or reduce
) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
Broadcast variables are created from a variable v
by calling SparkContext.broadcast(v)
. The broadcast variable is a wrapper around v
, and its value can be accessed by calling the value
method. The code below shows this:
scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)
scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)
After the broadcast variable is created, it should be used instead of the value v
in any functions run on the cluster so that v
is not shipped to the nodes more than once. In addition, the object v
should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
Accumulators
Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).
An accumulator is created from an initial value v
by calling SparkContext.accumulator(v)
. Tasks running on the cluster can then add to it using the add
method or the +=
operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value
method.
The code below shows an accumulator being used to add up the elements of an array:
scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
scala> accum.value
res2: Int = 10
While this code used the built-in support for accumulators of type Int, programmers can also create their own types by subclassing AccumulatorParam. The AccumulatorParam interface has two methods: zero
for providing a “zero value” for your data type, and addInPlace
for adding two values together. For example, supposing we had a Vector
class representing mathematical vectors, we could write:
object VectorAccumulatorParam extends AccumulatorParam[Vector] {
def zero(initialValue: Vector): Vector = {
Vector.zeros(initialValue.size)
}
def addInPlace(v1: Vector, v2: Vector): Vector = {
v1 += v2
}
}
// Then, create an Accumulator of this type:
val vecAccum = sc.accumulator(new Vector(...))(VectorAccumulatorParam)
In Scala, Spark also supports the more general Accumulable interface to accumulate data where the resulting type is not the same as the elements added (e.g. build a list by collecting together elements), and the SparkContext.accumulableCollection
method for accumulating common Scala collection types.
For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map()
. The below code fragment demonstrates this property:
val accum = sc.accumulator(0)
data.map { x => accum += x; f(x) }
// Here, accum is still 0 because no actions have caused the <code>map</code> to be computed.
Deploying to a Cluster
The application submission guide describes how to submit applications to a cluster. In short, once you package your application into a JAR (for Java/Scala) or a set of .py
or .zip
files (for Python), the bin/spark-submit
script lets you submit it to any supported cluster manager.
Launching Spark jobs from Java / Scala
The org.apache.spark.launcher package provides classes for launching Spark jobs as child processes using a simple Java API.
Unit Testing
Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext
in your test with the master URL set to local
, run your operations, and then call SparkContext.stop()
to tear it down. Make sure you stop the context within a finally
block or the test framework’s tearDown
method, as Spark does not support two contexts running concurrently in the same program.
Migrating from pre-1.0 Versions of Spark
Spark 1.0 freezes the API of Spark Core for the 1.X series, in that any API available today that is not marked “experimental” or “developer API” will be supported in future versions. The only change for Scala users is that the grouping operations, e.g. groupByKey
, cogroup
and join
, have changed from returning (Key, Seq[Value])
pairs to (Key, Iterable[Value])
.
Migration guides are also available for Spark Streaming, MLlib and GraphX.
Where to Go from Here
You can see some example Spark programs on the Spark website. In addition, Spark includes several samples in the examples
directory (Scala,Java, Python, R). You can run Java and Scala examples by passing the class name to Spark’s bin/run-example
script; for instance:
./bin/run-example SparkPi
For Python examples, use spark-submit
instead:
./bin/spark-submit examples/src/main/python/pi.py
For R examples, use spark-submit
instead:
./bin/spark-submit examples/src/main/r/dataframe.R
For help on optimizing your programs, the configuration and tuning guides provide information on best practices. They are especially important for making sure that your data is stored in memory in an efficient format. For help on deploying, the cluster mode overview describes the components involved in distributed operation and supported cluster managers.
Finally, full API documentation is available in Scala, Java, Python and R.