Spark1.6.3学习02——Spark Programming Guide

Spark Programming Guide Spark编程指南

Overview 综述

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

在高层次上,每个Spark应用程序都由一个运行用户功能并在集群上执行各种并行操作驱动程序组成。Spark提供的主要抽象是一个弹性分布式数据集(RDD),它是在可以并行运行的集群节点之间划分的元素集合。RDD通过从Hadoop文件系统(或任何其他Hadoop支持的文件系统)中的文件或驱动程序中的现有Scala集合开始,并进行转换来创建。用户还可以要求Spark 在内存中保留 RDD,从而在并行操作中有效地重用RDD。最后,RDD自动从节点故障中恢复。

A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

Spark中的第二个抽象是可以在并行操作中使用的共享变量。默认情况下,当Spark将并行功能作为不同节点上的一组任务并行运行时,它会将功能中使用的每个变量的副本发送给每个任务。有时,变量需要在任务之间或任务和驱动程序之间共享。Spark支持两种类型的共享变量:广播变量,可用于缓存所有节点上的内存中的值,以及累加器,它们只是“添加”到诸如计数器和总和之间的变量。

This guide shows each of these features in each of Spark’s supported languages. It is easiest to follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell or bin/pyspark for the Python one.

本指南显示了Spark支持的每种语言中的每个功能。如果您启动Spark的交互式shell(bin/spark-shellScala shell或 bin/pysparkPython ),最容易遵循的。

Linking with Spark 

与Spark连接

Spark 1.6.3 uses Scala 2.10. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.10.X).

To write a Spark application, you need to add a Maven dependency on Spark. Spark is available through Maven Central at:

Spark 1.6.3使用Scala 2.10。要在Scala中编写应用程序,您将需要使用兼容的Scala版本(例如2.10.X)。

要编写Spark应用程序,您需要在Spark上添加一个Maven依赖关系。Spark可通过Maven Central获得:

groupId = org.apache.spark
artifactId = spark-core_2.10
version = 1.6.3

In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS.

另外,如果您希望访问HDFS集群,则需要hadoop-client为您的HDFS版本添加依赖关系 。

groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

Finally, you need to import some Spark classes into your program. Add the following lines:

最后,您需要将一些Spark类导入到程序中。添加以下行:

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

(Before Spark 1.3.0, you need to explicitly import org.apache.spark.SparkContext._ to enable essential implicit conversions.)

(在Spark 1.3.0之前,您需要明确import org.apache.spark.SparkContext._地启用基本的隐式转换。)

Initializing Spark

The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContextyou first need to build a SparkConf object that contains information about your application.

Spark程序必须做的第一件事是创建一个SparkContext对象,该对象告诉Spark如何访问集群。要创建一个,SparkContext您首先需要构建一个包含有关应用程序信息的SparkConf对象。

Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.

每个JVM只有一个SparkContext可能是活动的。stop()在创建一个新的SparkContext之前,您必须是活动的。

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

The appName parameter is a name for your application to show on the cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.

appName参数是应用程序在集群UI上显示的名称。masterSpark,Mesos或YARN集群URL,或以本地模式运行的特殊“本地”字符串。实际上,当在集群上运行时,您不需要master在程序中进行硬编码,而是在该应用程序启动spark-submit并接收应用程序。但是,对于本地测试和单元测试,您可以通过“本地”运行Spark进程。

Using the Shell

使用Shell

In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work. You can set which master the context connects to using the --master argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars argument. You can also add dependencies (e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates to the --packages argument. Any additional repositories where dependencies might exist (e.g. SonaType) can be passed to the --repositories argument. For example, to run bin/spark-shell on exactly four cores, use:

在Spark shell中,已经为您创建了一个特殊的解释器感知SparkContext,在调用的变量中sc。制作您自己的SparkContext将无法正常工作。您可以使用--master参数设置上下文连接的主机,并且可以通过将逗号分隔的列表传递给参数来将JAR添加到类路径--jars。您还可以通过为参数提供逗号分隔的maven坐标列表,将依赖关系(例如Spark Packages)添加到shell会话--packages。任何可能存在依赖关系的其他存储库(例如SonaType)都可以传递给--repositories参数。例如,要运行bin/spark-shell在四个内核上,请使用:

$ ./bin/spark-shell --master local[4]

Or, to also add code.jar to its classpath, use:

或者,也添加code.jar到其类路径,使用:

$ ./bin/spark-shell --master local[4] --jars code.jar

To include a dependency using maven coordinates:

使用maven坐标来包含依赖关系:

$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"

For a complete list of options, run spark-shell --help. Behind the scenes, spark-shell invokes the more general spark-submit script.

有关选项的完整列表,请运行spark-shell --help。在幕后, spark-shell调用更一般的spark-submit脚本

Resilient Distributed Datasets (RDDs)

弹性分布式数据集(RDD)

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

Spark围绕着弹性分布数据集(RDD)的概念,RDD是一种可以并行运行的元件的容错集合。创建RDD有两种方法:并行化 驱动程序中的现有集合,或者在外部存储系统(如共享文件系统,HDFS,HBase或提供Hadoop InputFormat的任何数据源)中引用数据集。

Parallelized Collections

并行集合

Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:

并行集合通过调用创建SparkContextparallelize一个现有的收集方法,在你的驱动程序(斯卡拉Seq)。集合的元素被复制以形成可并行操作的分布式数据集。例如,这里是如何创建一个保存数字1到5的并行集合:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

Once created, the distributed dataset (distData) can be operated on in parallel. For example, we might call distData.reduce((a, b) => a + b)to add up the elements of the array. We describe operations on distributed datasets later on.

一旦创建,分布式数据集(distData)可以并行运行。例如,我们可以调用distData.reduce((a, b) => a + b)将数组的元素相加。我们稍后介绍分布式数据集的操作。

One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.

并行集合的一个重要参数是将数据集切割到的分区数。Spark将为集群的每个分区运行一个任务。通常,您需要为集群中的每个CPU分配2-4个分区。通常,Spark会根据您的群集自动设置分区数。但是,您也可以通过将其作为第二个参数传递给parallelize(例如sc.parallelize(data, 10))来手动设置。注意:代码中的某些地方使用术语片(分区的同义词)来保持向后兼容性。

External Datasets

外部数据集

Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.

Spark可以从Hadoop支持的任何存储源创建分布式数据集,包括本地文件系统,HDFS,Cassandra,HBase,Amazon S3等。Spark支持文本文件,SequenceFiles和任何其他Hadoop InputFormat

Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://s3n://, etc URI) and reads it as a collection of lines. Here is an example invocation:

文本文件RDDS可以使用创建SparkContexttextFile方法。此方法需要一个URI的文件(本地路径的机器上,或一个hdfs://s3n://等URI),并读取其作为行的集合。这是一个示例调用:

scala> val distFile = sc.textFile("data.txt")
distFile: RDD[String] = MappedRDD@1d4cee08

Once created, distFile can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the map and reduceoperations as follows: distFile.map(s => s.length).reduce((a, b) => a + b).

创建后,distFile可以通过数据集操作进行操作。例如,我们可以使用mapreduce操作将所有行的大小相加,如下所示:distFile.map(s => s.length).reduce((a, b) => a + b)

Some notes on reading files with Spark:

有关Spark的阅读文件的一些注意事项

  • If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.

  • 如果在本地文件系统上使用路径,该文件也必须在工作节点上的相同路径上可访问。将文件复制到所有工作人员或使用网络安装的共享文件系统。

  • All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory")textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").

  • 所有Spark的基于文件的输入法,包括textFile支持在目录,压缩文件和通配符上运行。例如,你可以使用textFile("/my/directory")textFile("/my/directory/*.txt")textFile("/my/directory/*.gz")

  • The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 64MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.

  • textFile方法还采用可选的第二个参数来控制文件的分区数。默认情况下,Spark为文件的每个块创建一个分区(HDFS中默认为64MB),但是也可以通过传递更大的值来请求更高数量的分区。请注意,您不能有比块少的分区。

Apart from text files, Spark’s Scala API also supports several other data formats:

除了文本文件,Spark的Scala API还支持其他几种数据格式:

  • SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

  • SparkContext.wholeTextFiles让您阅读包含多个小文本文件的目录,并将其作为(文件名,内容)对返回。这与textFile每个文件中的每行返回一条记录相反。

  • For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.

  • 对于SequenceFiles,使用SparkContext的sequenceFile[K, V]方法,KV是文件中键和值的类型。这些应该是Hadoop的Writable接口的子类,如IntWritableText。此外,Spark允许您为几个常见的Writables指定本机类型; 例如,sequenceFile[Int, String]会自动读取IntWritables和Texts。

  • For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).

  • 对于其他Hadoop InputFormats,您可以使用该SparkContext.hadoopRDD方法,该方法采用任意JobConf和输入格式类,键类和值类。将这些设置与使用输入源的Hadoop作业相同。您还可以使用SparkContext.newAPIHadoopRDD基于“新”MapReduce API(org.apache.hadoop.mapreduce)的InputFormats 。

  • RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.

  • 对于其他Hadoop InputFormats,您可以使用该SparkContext.hadoopRDD方法,该方法采用任意JobConf和输入格式类,键类和值类。将这些设置与使用输入源的Hadoop作业相同。您还可以使用SparkContext.newAPIHadoopRDD基于“新”MapReduce API(org.apache.hadoop.mapreduce)的InputFormats 。

RDD Operations 

RDD操作

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

RDDS支持两种类型的操作:转变,从现有的创建一个新的数据集和行动,其上运行的数据集的计算后的值返回驱动程序。例如,map是通过函数传递每个数据集元素并返回表示结果的新RDD的转换。另一方面,reduce是使用一些函数聚合RDD的所有元素并将最终结果返回给驱动程序的动作(尽管还有一个并行reduceByKey返回分布式数据集)。

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

Spark中的所有转换都是懒惰的,因为它们不会马上计算它们的结果。相反,他们只记得应用于某些基本数据集(例如文件)的转换。只有当一个动作需要将结果返回给驱动程序时,才会计算转换。这种设计使Spark能够更有效地运行 - 例如,我们可以意识到,通过创建的数据集map将被用于reduce并仅返回reduce到驱动程序的结果,而不是较大的映射数据集。

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

默认情况下,每次转换的RDD可能会在每次对其进行操作时重新计算。但是,您也可以使用(或)方法在内存中持久使用RDD ,在这种情况下,Spark将会在集群上保留元素,以便在下次查询时进行更快的访问。还支持在磁盘上持久的RDD,或跨多个节点进行复制。

Basics 

基本


To illustrate RDD basics, consider the simple program below:

为了说明RDD的基础知识,请考虑以下简单的程序:

val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length)
val totalLength = lineLengths.reduce((a, b) => a + b)

The first line defines a base RDD from an external file. This dataset is not loaded in memory or otherwise acted on: lines is merely a pointer to the file. The second line defines lineLengths as the result of a map transformation. Again, lineLengths is not immediately computed, due to laziness. Finally, we run reduce, which is an action. At this point Spark breaks the computation into tasks to run on separate machines, and each machine runs both its part of the map and a local reduction, returning only its answer to the driver program.

第一行定义了一个外部文件的基础RDD。此数据集未加载到内存中或以其他方式处理:lines仅指向文件的指针。第二行定义lineLengthsmap转换的结果。再次,lineLengths 是不是马上计算,由于懒惰。最后,我们跑reduce,这是一个动作。此时,Spark将计算分解为在单独的机器上运行的任务,每台机器都运行其地图的一部分和本地缩减,仅返回其对驱动程序的答案。

If we also wanted to use lineLengths again later, we could add:

如果我们以后也想lineLengths再次使用,我们可以添加:

lineLengths.persist()

before the reduce, which would cause lineLengths to be saved in memory after the first time it is computed.

之前reduce,这将导致lineLengths在第一次计算之后保存在内存中。

Passing Functions to Spark 

将功能传递给Spark

Spark’s API relies heavily on passing functions in the driver program to run on the cluster. There are two recommended ways to do this:

Spark的API很大程度上依赖于驱动程序中的传递函数来运行在集群上。有两种建议的方法来做到这一点:

  • Anonymous function syntax, which can be used for short pieces of code.
  • 匿名函数语法,可用于短代码。
  • Static methods in a global singleton object. For example, you can define object MyFunctions and then pass MyFunctions.func1, as follows:
  • 全局单例对象中的静态方法。例如,您可以定义object MyFunctions并通过MyFunctions.func1,如下所示:
object MyFunctions {
  def func1(s: String): String = { ... }
}

myRdd.map(MyFunctions.func1)

Note that while it is also possible to pass a reference to a method in a class instance (as opposed to a singleton object), this requires sending the object that contains that class along with the method. For example, consider:

请注意,虽然也可以传递对类实例(而不是单例对象)中的方法的引用,但这需要发送包含该类的对象以及该方法。例如,考虑:

class MyClass {
  def func1(s: String): String = { ... }
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}

Here, if we create a new MyClass and call doStuff on it, the map inside there references the func1 method of that MyClass instance, so the whole object needs to be sent to the cluster. It is similar to writing rdd.map(x => this.func1(x)).

在这里,如果我们创建new MyClass并调用doStuff就可以了,map里面有引用的 func1方法是的MyClass实例,所以需要发送到群集的整个对象。它类似于写作rdd.map(x => this.func1(x))

In a similar way, accessing fields of the outer object will reference the whole object:

以类似的方式,访问外部对象的字段将引用整个对象:

class MyClass {
  val field = "Hello"
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}

is equilvalent to writing rdd.map(x => this.field + x), which references all of this. To avoid this issue, the simplest way is to copy field into a local variable instead of accessing it externally:

是平行的写作rdd.map(x => this.field + x),其中引用了所有this。为了避免这个问题,最简单的方法是复制field到局部变量中,而不是外部访问它。

def doStuff(rdd: RDD[String]): RDD[String] = {
  val field_ = this.field
  rdd.map(x => field_ + x)
}

Understanding closures

了解关闭 

One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach() to increment a counter, but similar issues can occur for other operations as well.

Spark的一个较难的事情是在跨群集执行代码时了解变量和方法的范围和生命周期。在其范围之外修改变量的RDD操作可能是一个常见的混乱来源。在下面的示例中,我们将看看foreach()用于增加计数器的代码,但是对于其他操作也可能会出现类似的问题。

Example

Consider the naive RDD element sum below, which may behave differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in local mode (--master = local[n]) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):

考虑下面的天真的RDD元素总和,其可能的行为不同,取决于执行是否在同一个JVM中发生。一个常见的例子是运行在local模式(--master = local[n])中的Spark,而不是将Spark应用程序部署到集群(例如通过spark-submit to YARN):

var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)
Local vs. cluster modes
本地与群集模式

The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.

上述代码的行为未定义,可能无法正常工作。为了执行任务,Spark将RDD操作的处理分解为任务,每个任务由执行者执行。在执行之前,Spark会计算任务的关闭。关闭是这些变量和方法,对于执行器在RDD(在这种情况下foreach())执行其计算必须是可见的。该关闭序列化并发送给每个执行器。

The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.

发送给每个执行器的闭包中的变量现在是副本,因此,当在函数中引用计数器foreach,它不再是驱动程序节点上的计数器。在驱动程序节点的内存中仍然有一个计数器,但这对执行程序不再可见!执行者只看到序列化关闭的副本。因此,计数器的最终值仍然为零,因为计数器上的所有操作都引用了序列化闭包中的值。

In local mode, in some circumstances the foreach function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.

在本地模式下,在某些情况下,该foreach函数实际上将在与驱动程序相同的JVM中执行,并引用相同的原始计数器,并可能实际更新它。

To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.

为了确保在这些场景中明确定义的行为,应该使用Accumulator。Spark中的累加器专门用于提供一种机制,用于在群集中的工作者节点上执行分割时安全更新变量。本指南的“累加器”部分更详细地讨论了这些。

In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.

一般来说,闭包 - 构造如循环或本地定义的方法不应该用于突变某些全局状态。Spark不定义或保证从关闭外部引用的对象的突变行为。一些代码可以在本地模式下工作,但这只是偶然的,这样的代码在分布式模式下不会像预期那样运行。如果需要一些全局聚合,则使用累加器。

Printing elements of an RDD
打印RDD元素

Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the expected output and print all the RDD’s elements. However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these! To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take()rdd.take(100).foreach(println).

另一个常见的成语是尝试使用rdd.foreach(println)或打印出RDD的元素rdd.map(println)。在单个机器上,这将产生预期的输出并打印所有RDD的元素。但是,在cluster模式中,stdout执行器调用的输出现在正在向执行者写入,stdout而不是驱动程序的输出,所以stdout驱动程序不会显示这些!要打印在驱动器的所有元素,可以使用的collect()方法,首先使RDD到驱动器节点从而:rdd.collect().foreach(println)。这可能导致驱动程序用尽内存,因为collect()将整个RDD提取到单个机器上; 如果您只需要打印RDD的几个元素,则更安全的方法是使用take()rdd.take(100).foreach(println)

Working with Key-Value Pairs

使用键值对

While most Spark operations work on RDDs containing any type of objects, a few special operations are only available on RDDs of key-value pairs. The most common ones are distributed “shuffle” operations, such as grouping or aggregating the elements by a key.

虽然大多数Spark操作适用于包含任何类型对象的RDD,但是几个特殊操作只能在键值对的RDD上使用。最常见的是分布式“随机播放”操作,例如按键分组或聚合元素。

In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)). The key-value pair operations are available in the PairRDDFunctions class, which automatically wraps around an RDD of tuples.

在Scala中,这些操作可以在包含Tuple2对象的RDD(内置的语言元组,通过简单的写入创建(a, b))中自动使用。键值对操作在PairRDDFunctions类中可用 ,它自动包围元组的RDD。

For example, the following code uses the reduceByKey operation on key-value pairs to count how many times each line of text occurs in a file:

例如,以下代码使用reduceByKey键值对上的操作来计算每行文本在文件中的发生次数:

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

We could also use counts.sortByKey(), for example, to sort the pairs alphabetically, and finally counts.collect() to bring them back to the driver program as an array of objects.

counts.sortByKey()例如,我们也可以使用字母顺序对对进行排序,最后 counts.collect()将它们作为一组对象返回到驱动程序。

Note: when using custom objects as the key in key-value pair operations, you must be sure that a custom equals() method is accompanied with a matching hashCode() method. For full details, see the contract outlined in the Object.hashCode() documentation.

注意:使用自定义对象作为键值对操作中的键值时,必须确保equals()使用匹配hashCode()方法附带自定义方法。有关完整的详细信息,请参阅Object.hashCode()文档中概述的合同。

Transformations

转换

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (ScalaJavaPythonR) and pair RDD functions doc (ScalaJava) for details.

下表列出了Spark支持的一些常见转换。有关详细信息,请参阅RDD API文档(Scala, JavaPython, R)和RDD函数doc(Scala, Java)。

Transformation Meaning
map(func)

Return a new distributed dataset formed by passing each element of the source through a function func

返回通过函数func传递源的每个元素形成的新的分布式数据集。.

filter(func)

Return a new dataset formed by selecting those elements of the source on which func returns true

返回通过选择func返回true 的源的元素形成的新数据集。.

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item)

类似于地图,但是每个输入项可以映射到0个或更多的输出项(所以func应该返回一个Seq而不是单个项)。

mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T

类似于地图,但是每个输入项可以映射到0个或更多的输出项(所以func应该返回一个Seq而不是单个项)。.

mapPartitionsWithIndex(func)

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T

类似于mapPartitions,而且还提供FUNC与表示所述分区的索引的整数值,所以FUNC必须是类型(中间体,迭代器<T>)=>的迭代器上类型T的RDD运行时<U>.

sample(withReplacementfractionseed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed

试分数分数的数据的,具有或不具有替换,使用给定的随机数发生器的种子。.

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument

返回包含源数据集和参数中元素的并集的新数据集.

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument

返回一个包含源数据集和参数中的元素的新RDD。..

distinct([numTasks]))

Return a new dataset that contains the distinct elements of the source dataset

返回一个包含源数据集的不同元素的新数据集。

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 

当对(K,V)对的数据集进行调用时,返回(K,Iterable <V>)对的数据集

Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 

注意:如果您正在进行分组,以便在每个键上执行聚合(如总和或平均值),则使用reduceByKeyaggregateByKey将产生更好的性能。 

Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

注意:默认情况下,输出中的并行级别取决于父RDD的分区数。您可以传递一个可选numTasks参数来设置不同数量的任务。

reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument

当(K,V)对的数据集被调用时,返回一个数据集(K,V)对,其中使用给定的reduce函数func聚合每个键的值,该函数必须是类型(V,V)=> V.同样groupByKey,减少任务的数量可以通过可选的第二个参数进行配置

aggregateByKey(zeroValue)(seqOpcombOp, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument

当对(K,V)对的数据集进行调用时,返回(K,U)对的数据集,其中使用给定的组合函数和中性“零”值对每个键的值进行聚合。允许与输入值类型不同的聚合值类型,同时避免不必要的分配。像在groupByKey,减少任务的数量可以通过可选的第二个参数进行配置。.

sortByKey([ascending], [numTasks]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoinrightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numTasks]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.
cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command[envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

Actions

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (ScalaJavaPythonR)

and pair RDD functions doc (ScalaJava) for details.

Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacementnum, [seed]) Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n[ordering]) Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path
(Java and Scala)
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path
(Java and Scala)
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded using SparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. 
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.

Shuffle operations 混洗操作

Certain operations within Spark trigger an event known as the shuffle. The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Spark内的某些操作会触发称为随机播放的事件。shuffle是Spark的重新分布数据的机制,因此它在分区之间的分组不同。这通常涉及将数据复制到执行器和机器上,从而使洗牌成为复杂而昂贵的操作。

Background

To understand what happens during the shuffle we can consider the example of the reduceByKey operation. The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key. The challenge is that not all values for a single key necessarily reside on the same partition, or even the same machine, but they must be co-located to compute the result.

要了解在洗牌过程中会发生什么,我们可以考虑reduceByKey操作的例子 。该reduceByKey操作生成一个新的RDD,其中单个密钥的所有值都被组合成一个元组 - 关键字和对与该密钥相关联的所有值执行reduce函数的结果。挑战在于,并不是单个密钥的所有值都必须驻留在同一个分区上,甚至是同一个机器上,但它们必须位于同一个位置才能计算结果。

In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation. During computations, a single task will operate on a single partition - thus, to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation. It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle.

在Spark中,数据通常不会跨分区分布,以便在特定操作的必要位置。在计算过程中,单个任务将在单个分区上运行 - 因此,要组织所有数据reduceByKey以执行单个reduce任务,Spark需要执行一对一的操作。它必须从所有分区中读取以查找所有键的所有值,然后将分区之间的值汇聚在一起,以计算每个键的最终结果 - 这称为随机播放

Although the set of elements in each partition of newly shuffled data will be deterministic, and so is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably ordered data following shuffle then it’s possible to use:

虽然新混洗数据的每个分区中的元素集将是确定性的,分区本身的排序也是如此,但是这些元素的排序不是。如果一个人想要随机播放之后可预测的有序数据,那么可以使用:

  • mapPartitions to sort each partition using, for example, .sorted
  • mapPartitions 为了对每个分区进行排序,例如, .sorted
  • repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
  • repartitionAndSortWithinPartitions 有效地对分区进行分类,同时重新分区
  • sortBy to make a globally ordered RDD
  • sortBy 制作全球有序的RDD

Operations which can cause a shuffle include repartition operations like repartition and coalesce‘ByKey operations (except for counting) like groupByKey and reduceByKey, and join operations like cogroup and join.

这可能会导致一个洗牌的操作包括重新分区一样操作 repartitioncoalesceByKey”操作,比如(除计数)groupByKey,并reduceByKey参加操作,如cogroupjoin

Performance Impact 
性能影响

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. This nomenclature comes from MapReduce and does not directly relate to Spark’s map and reduce operations.

所述随机播放是昂贵的操作,因为它涉及的磁盘I / O,数据序列,和网络I / O。要组织随机播放的数据,Spark会生成一组任务 - 映射任务以组织数据,以及一组缩减任务以进行汇总。这个命名法来自于MapReduce,并不直接涉及到Spark mapreduce操作。

Internally, results from individual map tasks are kept in memory until they can’t fit. Then, these are sorted based on the target partition and written to a single file. On the reduce side, tasks read the relevant sorted blocks.

在内部,单独的地图任务的结果将保存在内存中,直到它们不适合为止。然后,这些根据目标分区进行排序并写入单个文件。在减少的一面,任务读取相关的排序块。

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side, and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

某些随机播放操作可能会占用大量的堆内存,因为它们在传输之前或之后使用内存中的数据结构来组织记录。具体来说, reduceByKeyaggregateByKey在地图上创建这些结构,并且'ByKey操作在减少的一面生成这些结构。当数据不适合内存时,Spark会将这些表溢出到磁盘,导致磁盘I / O的额外开销和增加的垃圾回收。

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed. Garbage collection may happen only after a long period time, if the application retains references to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may consume a large amount of disk space. The temporary storage directory is specified by thespark.local.dir configuration parameter when configuring the Spark context.

随机播放还会在磁盘上生成大量的中间文件。从Spark 1.3开始,这些文件将被保留,直到相应的RDD不再使用并被垃圾回收。这样做,所以如果重新计算谱系,则不需要重新创建洗牌文件。如果应用程序保留对这些RDD的引用或GC不频繁启动,垃圾收集可能仅在长时间之后才会发生。这意味着长时间运行的Spark作业可能会消耗大量的磁盘空间。spark.local.dir配置Spark上下文时,配置参数指定临时存储目录 。

Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the ‘Shuffle Behavior’ section within the Spark Configuration Guide.

RDD Persistence

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes, or store it off-heap in Tachyon. These levels are set by passing a StorageLevel object (ScalaJavaPython) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). The full set of storage levels is:

Storage Level Meaning
MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental) Store RDD in serialized format in Tachyon. Compared to MEMORY_ONLY_SER, OFF_HEAP reduces garbage collection overhead and allows executors to be smaller and to share a pool of memory, making it attractive in environments with large heaps or multiple concurrent applications. Furthermore, as the RDDs reside in Tachyon, the crash of an executor does not lead to losing the in-memory cache. In this mode, the memory in Tachyon is discardable. Thus, Tachyon does not attempt to reconstruct a block that it evicts from memory. If you plan to use Tachyon as the off heap store, Spark is compatible with Tachyon out-of-the-box. Please refer to this page for the suggested version pairings.

Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level.

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

Which Storage Level to Choose?

Spark’s storage levels are meant to provide different trade-offs between memory usage and CPU efficiency. We recommend going through the following process to select one:

  • If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.

  • If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.

  • Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from disk.

  • Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.

  • In environments with high amounts of memory or multiple applications, the experimental OFF_HEAP mode has several advantages:

    • It allows multiple executors to share the same pool of memory in Tachyon.
    • It significantly reduces garbage collection costs.
    • Cached data is not lost if individual executors crash.

Removing Data

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. If you would like to manually remove an RDD instead of waiting for it to fall out of the cache, use the RDD.unpersist() method.

Shared Variables

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

Accumulators

Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

The code below shows an accumulator being used to add up the elements of an array:

scala> val accum = sc.accumulator(0, "My Accumulator")
accum: spark.Accumulator[Int] = 0

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Int = 10

While this code used the built-in support for accumulators of type Int, programmers can also create their own types by subclassing AccumulatorParam. The AccumulatorParam interface has two methods: zero for providing a “zero value” for your data type, and addInPlace for adding two values together. For example, supposing we had a Vector class representing mathematical vectors, we could write:

object VectorAccumulatorParam extends AccumulatorParam[Vector] {
  def zero(initialValue: Vector): Vector = {
    Vector.zeros(initialValue.size)
  }
  def addInPlace(v1: Vector, v2: Vector): Vector = {
    v1 += v2
  }
}

// Then, create an Accumulator of this type:
val vecAccum = sc.accumulator(new Vector(...))(VectorAccumulatorParam)

In Scala, Spark also supports the more general Accumulable interface to accumulate data where the resulting type is not the same as the elements added (e.g. build a list by collecting together elements), and the SparkContext.accumulableCollection method for accumulating common Scala collection types.

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.

Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). The below code fragment demonstrates this property:

val accum = sc.accumulator(0)
data.map { x => accum += x; f(x) }
// Here, accum is still 0 because no actions have caused the <code>map</code> to be computed.

Deploying to a Cluster

The application submission guide describes how to submit applications to a cluster. In short, once you package your application into a JAR (for Java/Scala) or a set of .py or .zip files (for Python), the bin/spark-submit script lets you submit it to any supported cluster manager.

Launching Spark jobs from Java / Scala

The org.apache.spark.launcher package provides classes for launching Spark jobs as child processes using a simple Java API.

Unit Testing

Spark is friendly to unit testing with any popular unit test framework. Simply create a SparkContext in your test with the master URL set to local, run your operations, and then call SparkContext.stop() to tear it down. Make sure you stop the context within a finally block or the test framework’s tearDown method, as Spark does not support two contexts running concurrently in the same program.

Migrating from pre-1.0 Versions of Spark

Spark 1.0 freezes the API of Spark Core for the 1.X series, in that any API available today that is not marked “experimental” or “developer API” will be supported in future versions. The only change for Scala users is that the grouping operations, e.g. groupByKeycogroup and join, have changed from returning (Key, Seq[Value]) pairs to (Key, Iterable[Value]).

Migration guides are also available for Spark StreamingMLlib and GraphX.

Where to Go from Here

You can see some example Spark programs on the Spark website. In addition, Spark includes several samples in the examples directory (Scala,JavaPythonR). You can run Java and Scala examples by passing the class name to Spark’s bin/run-example script; for instance:

./bin/run-example SparkPi

For Python examples, use spark-submit instead:

./bin/spark-submit examples/src/main/python/pi.py

For R examples, use spark-submit instead:

./bin/spark-submit examples/src/main/r/dataframe.R

For help on optimizing your programs, the configuration and tuning guides provide information on best practices. They are especially important for making sure that your data is stored in memory in an efficient format. For help on deploying, the cluster mode overview describes the components involved in distributed operation and supported cluster managers.

Finally, full API documentation is available in ScalaJavaPython and R.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
### 回答1: 要在Linux上搭建Hive on Spark环境,需要以下步骤: 1. 安装Hadoop和Spark 首先需要安装Hadoop和Spark,可以选择Hadoop 2.8.0和Spark 1.6.3版本。安装过程可以参考官方文档。 2. 安装Hive 安装Hive需要下载Hive 2.1.1版本,并解压到指定目录。然后需要配置Hive的环境变量,将Hive的bin目录添加到PATH中。 3. 配置Hive on Spark 在Hive的conf目录下,需要创建hive-site.xml文件,并添加以下配置: ``` <property> <name>hive.execution.engine</name> <value>spark</value> </property> <property> <name>spark.master</name> <value>local[*]</value> </property> <property> <name>spark.submit.deployMode</name> <value>client</value> </property> <property> <name>spark.executor.memory</name> <value>1g</value> </property> ``` 其中,hive.execution.engine配置为sparkspark.master配置为local[*],表示使用本地模式运行Sparkspark.submit.deployMode配置为client,表示以客户端模式提交Spark任务。spark.executor.memory配置为1g,表示每个executor的内存为1GB。 4. 启动Spark和Hive 启动Spark和Hive需要分别执行以下命令: ``` $SPARK_HOME/sbin/start-all.sh $HIVE_HOME/bin/hive ``` 其中,$SPARK_HOME和$HIVE_HOME分别为Spark和Hive的安装目录。 5. 测试Hive on Spark 在Hive命令行中,可以执行以下命令测试Hive on Spark: ``` hive> set hive.execution.engine=spark; hive> select count(*) from table_name; ``` 其中,table_name为需要查询的表名。如果查询结果正确,则说明Hive on Spark环境搭建成功。 ### 回答2: Hadoop是一个开源的分布式文件系统和计算框架,在大数据领域中应用广泛,而Hive则是基于Hadoop的数据仓库系统,通过将数据存储在Hadoop中,并使用类SQL的语言查询和分析数据。但是,Hive的执行速度很慢,而Spark是速度很快的内存计算框架,能够比Hadoop更快地处理大数据。因此,用户可以使用Hive on Spark来加速Hive查询。 要在Linux上搭建Hive on Spark环境, 需要按照以下步骤进行操作: 1. 下载并安装Hadoop:在官方网站上下载Hadoop的最新版本,然后解压和配置。 2. 下载并安装Spark:在官方网站上下载Spark的最新版本,然后解压和配置。 3. 下载并安装Hive:在官方网站上下载Hive的最新版本,然后解压和配置。 4. 配置环境变量:在.bashrc或.bash_profile中添加Hadoop和Spark的路径,并运行source命令使其生效。 5. 启动Hadoop集群:运行start-all.sh脚本启动Hadoop集群,可以通过jps命令检查集群是否正常运行。 6. 启动Spark:运行spark-shell来启动Spark,可以通过测试程序检查Spark是否正常运行。 7. 启动Hive:运行hive命令来启动Hive,可以通过测试程序测试Hive是否正常运行。 8. 配置Hive on Spark:在hive-site.xml文件中添加以下变量来配置Hive on Spark: hive.execution.engine=spark hive.spark.client.server.connect.timeout=600 hive.spark.client.connect.timeout=600 9. 验证Hive on Spark:运行一些查询来验证Hive on Spark是否正常运行,并通过Spark网页界面查看运行情况。 总之,搭建Hive on Spark环境需要仔细地完成操作,按照步骤进行操作,将会帮助你更快更有效地处理大数据。 ### 回答3: 首先,在准备搭建 Hive on Spark 环境之前,我们需要确保已经安装了 Java JDK 、Hadoop 和 Spark 环境。在此基础上,按照以下步骤完成 Hive on Spark 的搭建: 1. 下载Hive 在 Apache Hive 的官网上可以下载到需要的版本,我们这里选择 hive-2.1.1 版本,下载后解压。 2. 配置Hadoop环境变量 在 ~/.bashrc 中添加如下内容: export HADOOP_HOME=/your/path/to/hadoop export PATH=$PATH:$HADOOP_HOME/bin 保存文件,并使用 source ~/.bashrc 命令来使环境变量立即生效。 3. 配置Hive环境变量 在 ~/.bashrc 中添加如下内容: export HIVE_HOME=/your/path/to/hive export PATH=$PATH:$HIVE_HOME/bin 保存文件,并使用 source ~/.bashrc 命令来使环境变量立即生效。 4. 配置Spark环境变量 在 ~/.bashrc 中添加如下内容: export SPARK_HOME=/your/path/to/spark export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin 保存文件,并使用 source ~/.bashrc 命令来使环境变量立即生效。 5. 配置Hive的hive-site.xml文件 将 $HIVE_HOME/conf 目录下的 hive-default.xml.template 文件复制一份并命名为 hive-site.xml,然后根据需要进行修改。在 hive-site.xml 中添加以下内容: ``` <property> <name>spark.master</name> <value>spark://<SPARK_MASTER_HOST>:<SPARK_MASTER_PORT></value> <description>URL of the Spark Master</description> </property> <property> <name>spark.submit.deployMode</name> <value>client</value> <description>Whether to run Spark in client or cluster mode</description> </property> <property> <name>hive.execution.engine</name> <value>spark</value> <description>Execution engine</description> </property> <property> <name>hive.spark.client.connect.timeout</name> <value>300s</value> </property> ``` 注意,其中的 <SPARK_MASTER_HOST> 和 <SPARK_MASTER_PORT> 分别应该替换为实际使用的 Spark Master 的地址和端口。 6. 配置Sparkspark-defaults.conf文件 将 $SPARK_HOME/conf 目录下的 spark-defaults.conf.template 文件复制一份并命名为 spark-defaults.conf,然后根据需要进行修改。在 spark-defaults.conf 中添加以下内容: ``` spark.executor.memory 4g spark.driver.memory 2g spark.sql.shuffle.partitions 200 ``` 根据需要调整默认的内存大小(如果已经分配过多可能会导致OOM),设置适当的partition数(避免执行时的数据倾斜问题)。 7. 启动Hive服务 执行启动Hive的命令: ``` hive --service metastore & hive ``` 需要注意的是,需要先启动 metastore 服务,然后才能启动 Hive 客户端。 8. 准备测试数据 接下来,为了测试 Hive on Spark 的功能,可以使用 Hive 提供的测试数据集来进行测试。 将 https://github.com/facebookarchive/facebook-360-spatial-workstation.git 克隆到本地,进入 samples 文件夹,执行以下命令来生成哈希表: ``` beeline -n hadoop -d org.apache.hive.jdbc.HiveDriver \ -jdbc:hive2://localhost:10000 \ -e "CREATE TABLE h3 (id int, lat double, lon double, geog string) \ ROW FORMAT DELIMITED \ FIELDS TERMINATED BY ',' \ LINES TERMINATED BY '\n' \ STORED AS TEXTFILE;" cd h3/ /data/gdal/gdal-2.2.0/bin/ogr2ogr -f CSV GEOM{FID}H3v11.csv geohash-cells.geojson -lco COMPRESS=DEFLATE beeline -n hadoop -d org.apache.hive.jdbc.HiveDriver \ -jdbc:hive2://localhost:10000 \ -e "LOAD DATA LOCAL INPATH '/h3/GEOMFIDH3v11.csv' INTO TABLE h3;" ``` 在以上命令中,我们使用了 beeline 来连接到 Hive 服务器,并使用 ogr2ogr 工具读取 geojson 文件并转存为 CSV 文件后导入到 Hive 中。 9. 执行Spark SQL查询 接下来可以使用 Spark SQL 来查询 Hive 中的数据。 运行 Spark Shell: ``` $SPARK_HOME/bin/spark-shell --master spark://<SPARK_MASTER_HOST>:<SPARK_MASTER_PORT> \ --jars $HIVE_HOME/lib/hive-exec-<HIVE_VERSION>.jar,$HIVE_HOME/lib/hive-metastore-<HIVE_VERSION>.jar ``` 如果以上命令运行正常,将会进入 Spark Shell 中。 在 Shell 中运行如下代码: ``` import org.apache.spark.sql._ val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext.setConf("hive.metastore.uris","thrift://<IP_ADDRESS>:9083") hiveContext.sql("use default") hiveContext.sql("show databases").foreach(println) hiveContext.sql("select count(*) from h3").foreach(println) ``` 其中,<IP_ADDRESS> 应该替换为实际使用的 Thrift 服务器的 IP 地址。 10. 结束Spark SQL查询 完成测试后,可以使用以下命令退出 Spark Shell: ``` scala> :q ``` 至此,Hive on Spark 环境已经搭建完成。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值