Spark官方文档翻译：Spark Programming Guide(一)

最新推荐文章于 2024-07-21 21:43:54 发布

coley-wu

最新推荐文章于 2024-07-21 21:43:54 发布

阅读量980

点赞数 1

分类专栏：翻译 spark 文章标签： spark 文档中文文档

本文链接：https://blog.csdn.net/soul_code/article/details/77864717

版权

Spark应用由驱动程序和弹性分布式数据集(RDD)组成，RDD是可并行操作的容错数据集。创建SparkContext对象是初始化Spark的第一步，通过SparkConf配置连接集群。Spark支持在Scala和Python中编写应用程序，需要添加相应的依赖。RDD可以通过现有集合或外部数据创建，如文本文件、HDFS等。RDD操作包括转换和行为，转换是延迟计算的，行为触发实际计算。RDD可以通过persist方法缓存在内存中以提高性能。

摘要由CSDN通过智能技术生成

Overview

At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.

在高版本中，每一个Spark应用程序都包含一个程序驱动器，它运行用户的主函数并在集群上执行各种并行计算，Spark提供了一个最重要的抽象概念：弹性分布式数据集（RDD）,它是在集群的节点上分区的集合，可以执行并行计算。RDDs可以通过Hadoop的文件系统（或任何Hadopp支持的文件系统）或者在驱动程序中使用已经存在的Scala集合进行创建，用户也可以使用spark将RDD持久化到内存中，使得并行操作中能够高效的复用数据集。最后，RDDs还提供了能从故障的节点中自动重试的机制

A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.

Spark提供的第二个抽象概念是共享变量可用于并行计算。默认情况下，当Spark在不同的节点上并行运行一组任务时，它会将每个变量的一个副本装载到每个任务中。有时，变量需要在任务之间、任务和驱动程序之间共享。Spark支持两种类型的共享变量:广播变量，它可以用来缓存所有节点上的内存值。以及累加器，把所有的变量累加到一起。

This guide shows each of these features in each of Spark’s supported languages. It is easiest to follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell or bin/pyspark for the Python one.

这个指南展示了Spark支持的每个语言中的每个特性。如果您启动Spark的交互式shell，那么它是最容易遵循的，即Scala shell的bin / Spark - shell，或者Python one的bin / pyspark。

Linking with Spark

Spark 2.2.0 is built and distributed to work with Scala 2.11 by default. (Spark can be built to work with other versions of Scala, too.) To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.11.X).

spark2.2.0需要scala2.11.x以上版本

To write a Spark application, you need to add a Maven dependency on Spark. Spark is available through Maven Central at:
编写一个spark应用，你需要添加如下maven依赖

groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.2.0

In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS.

此外，如果你想访问HDFS集群，你需要添加如下HDFS依赖

groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>

Finally, you need to import some Spark classes into your program. Add the following lines:
最后，你需要导入一些spark类到你的项目中，如下：

imp

最低0.47元/天解锁文章

coley-wu

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录