spark官方指南文档翻译（二）

最新推荐文章于 2021-03-02 11:54:05 发布

不想动的zzz

最新推荐文章于 2021-03-02 11:54:05 发布

阅读量565

点赞数

分类专栏： spark 文章标签： spark 大数据 hadoop

spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Quick Start

This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’sinteractive shell (in Python or Scala),then show how to write applications in Java, Scala, and Python.See theprogramming guide for a more complete reference.

To follow along with this guide, first download a packaged release of Spark from theSpark website. Since we won’t be using HDFS,you can download a package for any version of Hadoop.

快速启动

目录不翻了。。。

这个教程提供一个简要的介绍来使用spark。我们将首先通过spark的交互shell介绍相关api，然后展示怎么写应用基于java，scala，还有python。更完整的参考请看programming guide。

要跟随这个指南，首先从Spark website下载spark相关包。从这里开始我们不将使用hdfs，你能下载hadoop任何版本的包来替代。

Interactive Analysis with the Spark Shell

Basics

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries)or Python. Start it by running the following in the Spark directory:

./bin/spark-shell

使用spark shell 交互的分析

基础

spark的shell提供简单的方法来学习api，同时有一个强有力的工具来分析数据交互。在scala（运行在java虚拟机上同时也是一个好方式来使用存在的java库）和python上都是可行的。在spark目录通过运行下列命令来开始他。

./bin/spark-shell

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:

scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25

RDDs have actions, which return values, andtransformations, which return pointers to new RDDs. Let’s start with a few actions:

scala> textFile.count() // Number of items in this RDD
res0: Long = 126

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark

Now let’s use a transformation. We will use the filter transformation to return a new RDD with a subset of the items in the file.

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27

We can chain together transformations and actions:

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

spark的主要抽象概念是一个项目分布式集合叫rdd。rdds能从hadoop inputFormats创建，或者其他rdds转换。在spark根目录中，让我们创建一个新的rdd利用readme文本文件：

scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25

rdds有动作操作，返回值，还有转换操作，返回新的rdds指标。让我们以动作开始：

scala> textFile.count() // Number of items in this RDD
res0: Long = 126

scala> textFile.first() // First item in this RDD
res1: String = # Apache Spar

现在让我们使用转换。我们将使用filter转换来返回一个新的rdd是文件项目中的一个子集。

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27

我们能链在一起使用转换盒动作：

scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15

More on RDD Operations

RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15

This first maps a line to an integer value, creating a new RDD. reduce is called on that RDD to find the largest line count. The arguments tomap and reduce are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll useMath.max() function to make this code easier to understand:

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15

One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28

Here, we combined the flatMap, map, and reduceByKey transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use thecollect action:

scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

更多rdd操作

rdd动作和转换能够用于更复杂的计算。假设我们想找到字最多的一行：

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15

第一个maps把一行转换成整数值，创建一个新的rdd。这个rdd叫做reduce是为了找到最大的一行。参数map和reduce是scala函数关键字（闭包），并且可以使用任何语言特性或者scala/java库。列如，我们能很容易地调用函数声明。我们将使用math.max()函数来使这段代码更易懂：

scala> import java.lang.Math
import java.lang.Math

scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15

一个共同的数据流模式是mapreduce，从hadoop流行起来。spark能轻松地实现mapreduce流：

scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28

这里，我们联合flatmap，map，reducebykey转换操作来计算每个单词总数，文件作为一个(String, Int) 键值对的rdd。在我们的shell中为了收集单词总数，我们能使用collect动作：

scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)

Caching

Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algotrihm like PageRank. As a simple example, let’s mark our linesWithSpark dataset to be cached:

scala> linesWithSpark.cache()
res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27

scala> linesWithSpark.count()
res8: Long = 19

scala> linesWithSpark.count()
res9: Long = 19

It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part isthat these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell to a cluster, as described in the programming guide.

缓存

spark还支持将数据集放到集群内存缓存。当数据重复访问的时候这是非常有用的，像查询很小的热点数据集，或者运行像pagerank的迭代算法。相似的例子，让我们把我们的linesWithSpark数据集放到缓存：

scala> linesWithSpark.cache()
res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27

scala> linesWithSpark.count()
res8: Long = 19

scala> linesWithSpark.count()
res9: Long = 19

这看起来是愚蠢的使用spark来探索并且缓存100行文本文件。有趣的是这些功能在非常大的数据集上使用，即使他们跨越数十上百个节点。你也可以通过连接命令 bin/spark-shell做这些交互对于一个集群，就像 programming guide描述的一样。

Self-Contained Applications

Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python.

We’ll create a very simple Spark application in Scala–so simple, in fact, that it’s named SimpleApp.scala:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

独立的应用

或许我们意图使用spark api 写一个独立的应用。我们将通过一个在scala（通过sbt），java（通过Maven），python上简单的应用来介绍。

我们将在scala上创建一个非常简单的spark应用，事实上，他的名字就叫simpleApp.scala:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

Note that applications should define a main() method instead of extending scala.App.Subclasses of scala.App may not work correctly.

This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext,we initialize a SparkContext as part of the program.

We pass the SparkContext constructor a SparkConf object which contains information about our application.

Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt, which explains that Spark is a dependency. This file also adds a repository that Spark depends on:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"

注意，这个应用程序应该定义一个main（）方法，而不是继承一个属于scala.App的子类scala.App.Subclasses，不这样的话，有可能不能正常工作。

这个程序仅仅统计spark readme 中每行包含'a'和包含'b'的数目。注意，你将需要把YOUR_SPARK_HOME替换为spark安装的位置。不像前面使用spark shell的例子，初始化的是我们的sparkContext，我们初始化sparkContext来作为程序的一部分。

我们通过一个包含我们应用信息的sparkConf对象构建sparkContext。

我们的应用依赖于spark api，因此，我们也要包含一个sbt配置文件，simple.sbt，他解析spark的依赖。这个文件也添加一个spark依赖仓库：

name := "Simple Project"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"

For sbt to work correctly, we’ll need to layout SimpleApp.scala and simple.sbt according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the spark-submit script to run our program.

# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.11/simple-project_2.11-1.0.jar
...
Lines with a: 46, Lines with b: 23

为了sbt正确工作，我们需要根据标准的目录结构布局simpleApp.scala和simple.sbt。一旦就位，我们能创建一个包含应用代码的JAR包，然后使用spark—summit脚本运行我们的程序。

# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.11/simple-project_2.11-1.0.jar
...

不想动的zzz

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark官方指南文档翻译（二）

Quick StartInteractive Analysis with the Spark ShellBasicsMore on RDD OperationsCachingSelf-Contained ApplicationsWhere to Go from HereThis tutorial provides a quick introduction to using
复制链接

扫一扫