Quick Start
This tutorial provides a quick introduction to using Spark. We will first introduce the API through Spark’sinteractive shell (in Python or Scala),then show how to write applications in Java, Scala, and Python.See theprogramming guide for a more complete reference.
To follow along with this guide, first download a packaged release of Spark from theSpark website. Since we won’t be using HDFS,you can download a package for any version of Hadoop.
快速启动
目录不翻了。。。
这个教程提供一个简要的介绍来使用spark。我们将首先通过spark的交互shell介绍相关api,然后展示怎么写应用基于java,scala,还有python。更完整的参考请看programming guide。
要跟随这个指南,首先从Spark website下载spark相关包。从这里开始我们不将使用hdfs,你能下载hadoop任何版本的包来替代。
Interactive Analysis with the Spark Shell
Basics
Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively.It is available in either Scala (which runs on the Java VM and is thus a good way to use existing Java libraries)or Python. Start it by running the following in the Spark directory:
./bin/spark-shell
使用spark shell 交互的分析
基础
spark的shell提供简单的方法来学习api,同时有一个强有力的工具来分析数据交互。在scala(运行在java虚拟机上同时也是一个好方式来使用存在的java库)和python上都是可行的。在spark目录通过运行下列命令来开始他。
./bin/spark-shell
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs. Let’s make a new RDD from the text of the README file in the Spark source directory:
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
RDDs have actions, which return values, andtransformations, which return pointers to new RDDs. Let’s start with a few actions:
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spark
Now let’s use a transformation. We will use the filter
transformation to return a new RDD with a subset of the items in the file.
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27
We can chain together transformations and actions:
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
spark的主要抽象概念是一个项目分布式集合叫rdd。rdds能从hadoop inputFormats创建,或者其他rdds转换。在spark根目录中,让我们创建一个新的rdd利用readme文本文件:
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:25
scala> textFile.count() // Number of items in this RDD
res0: Long = 126
scala> textFile.first() // First item in this RDD
res1: String = # Apache Spar
现在让我们使用转换。我们将使用filter转换来返回一个新的rdd是文件项目中的一个子集。
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27
scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
res3: Long = 15
More on RDD Operations
RDD actions and transformations can be used for more complex computations. Let’s say we want to find the line with the most words:
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
This first maps a line to an integer value, creating a new RDD. reduce
is called on that RDD to find the largest line count. The arguments tomap
and reduce
are Scala function literals (closures), and can use any language feature or Scala/Java library. For example, we can easily call functions declared elsewhere. We’ll useMath.max()
function to make this code easier to understand:
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
One common data flow pattern is MapReduce, as popularized by Hadoop. Spark can implement MapReduce flows easily:
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28
Here, we combined the flatMap
, map
, and reduceByKey
transformations to compute the per-word counts in the file as an RDD of (String, Int) pairs. To collect the word counts in our shell, we can use thecollect
action:
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
更多rdd操作
rdd动作和转换能够用于更复杂的计算。假设我们想找到字最多的一行:
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[8] at reduceByKey at <console>:28
scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Caching
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algotrihm like PageRank. As a simple example, let’s mark our linesWithSpark
dataset to be cached:
scala> linesWithSpark.cache()
res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27
scala> linesWithSpark.count()
res8: Long = 19
scala> linesWithSpark.count()
res9: Long = 19
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part isthat these same functions can be used on very large data sets, even when they are striped across tens or hundreds of nodes. You can also do this interactively by connecting bin/spark-shell
to a cluster, as described in the programming guide.
缓存
spark还支持将数据集放到集群内存缓存。当数据重复访问的时候这是非常有用的,像查询很小的热点数据集,或者运行像pagerank的迭代算法。相似的例子,让我们把我们的linesWithSpark
数据集放到缓存:
scala> linesWithSpark.cache()
res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27
scala> linesWithSpark.count()
res8: Long = 19
scala> linesWithSpark.count()
res9: Long = 19
bin/spark-shell
做这些交互对于一个集群,就像
programming guide描述的一样。
Self-Contained Applications
Suppose we wish to write a self-contained application using the Spark API. We will walk through a simple application in Scala (with sbt), Java (with Maven), and Python.
We’ll create a very simple Spark application in Scala–so simple, in fact, that it’s named SimpleApp.scala
:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
独立的应用
或许我们意图使用spark api 写一个独立的应用。我们将通过一个在scala(通过sbt),java(通过Maven),python上简单的应用来介绍。
我们将在scala上创建一个非常简单的spark应用,事实上,他的名字就叫simpleApp.scala:
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
Note that applications should define a main()
method instead of extending scala.App
.Subclasses of scala.App
may not work correctly.
This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in the Spark README. Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext,we initialize a SparkContext as part of the program.
We pass the SparkContext constructor a SparkConf object which contains information about our application.
Our application depends on the Spark API, so we’ll also include an sbt configuration file, simple.sbt
, which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"
注意,这个应用程序应该定义一个main()方法,而不是继承一个属于scala.App的子类scala.App
.Subclasses,不这样的话,有可能不能正常工作。
这个程序仅仅统计spark readme 中每行包含'a'和包含'b'的数目。注意,你将需要把YOUR_SPARK_HOME替换为spark安装的位置。不像前面使用spark shell的例子,初始化的是我们的sparkContext,我们初始化sparkContext来作为程序的一部分。
我们通过一个包含我们应用信息的sparkConf对象构建sparkContext。
我们的应用依赖于spark api,因此,我们也要包含一个sbt配置文件,simple.sbt,他解析spark的依赖。这个文件也添加一个spark依赖仓库:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.1"
For sbt to work correctly, we’ll need to layout SimpleApp.scala
and simple.sbt
according to the typical directory structure. Once that is in place, we can create a JAR package containing the application’s code, then use the spark-submit
script to run our program.
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.11/simple-project_2.11-1.0.jar
...
Lines with a: 46, Lines with b: 23
为了sbt正确工作,我们需要根据标准的目录结构布局simpleApp.scala和simple.sbt。一旦就位,我们能创建一个包含应用代码的JAR包,然后使用spark—summit脚本运行我们的程序。
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.11/simple-project_2.11-1.0.jar
...