看example源码学spark系列(2)-SparkPi

最新推荐文章于 2021-11-07 14:28:45 发布

pan12jian

最新推荐文章于 2021-11-07 14:28:45 发布

阅读量4.8k

点赞数

分类专栏： spark 文章标签： spark 分布式计算 scala

本文链接：https://blog.csdn.net/pan12jian/article/details/25403819

版权

spark 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

第二节继续计算Pi，这个是用到了Spark包的pi，显然更高大上了！

同样，simple.sbt文件

name := "SparkPi"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" %% "spark-core" % "0.9.1"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

SparkPi.scala文件

import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    if (args.length == 0) {
      System.err.println("Usage: SparkPi <master> [<slices>]")
      System.exit(1)
    }
    val spark = new SparkContext(args(0), "SparkPi",
      System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))
    val slices = if (args.length > 1) args(1).toInt else 2
    val n = 100000 * slices
    val count = spark.parallelize(1 to n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}

文件目录

$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SparkPi.scala

运行(注意，这个是带参数的参数是“local[3]”)：

$ sbt "project sparkpi" "run local[3]"
[info] Set current project to SparkPi (in build file:/home/jpan/Mywork/spark-example/exspark/SparkPi/)
[info] Set current project to SparkPi (in build file:/home/jpan/Mywork/spark-example/exspark/SparkPi/)
[info] Updating {file:/home/jpan/Mywork/spark-example/exspark/SparkPi/}sparkpi...
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[info] Done updating.
[info] Compiling 1 Scala source to /home/jpan/Mywork/spark-example/exspark/SparkPi/target/scala-2.10/classes...
[info] Running SparkPi local[3]
..............................................
Pi is roughly 3.14652
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/static,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/metrics/json,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/executors,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/environment,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/stages,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/stages/pool,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/stages/stage,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/storage,null}
14/05/09 16:10:46 INFO handler.ContextHandler: stopped o.e.j.s.h.ContextHandler{/storage/rdd,null}
14/05/09 16:10:48 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
14/05/09 16:10:48 INFO network.ConnectionManager: Selector thread was interrupted!
14/05/09 16:10:48 INFO network.ConnectionManager: ConnectionManager stopped
14/05/09 16:10:48 INFO storage.MemoryStore: MemoryStore cleared
14/05/09 16:10:48 INFO storage.BlockManager: BlockManager stopped
14/05/09 16:10:48 INFO storage.BlockManagerMasterActor: Stopping BlockManagerMaster
14/05/09 16:10:48 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
14/05/09 16:10:48 INFO spark.SparkContext: Successfully stopped SparkContext
14/05/09 16:10:48 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
14/05/09 16:10:48 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
[success] Total time: 12 s, completed May 9, 2014 4:10:48 PM

源码分析

 if (args.length == 0) {
      System.err.println("Usage: SparkPi <master> [<slices>]")
      System.exit(1)
    }

这段是要求用户必须输入参数，<>表示必须输入的，<master>就是你运行spark程序的主节点，因为我使用的是local,所以输入local就行了。

[ ]这是可选输入，表示线程数。我输入的是3，即3个线程运行。

    val spark = new SparkContext(args(0), "SparkPi",
      System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))

SparkContext是Spark里最重要的类之一，它指定运行配置环境。其api文档解释是

Main entry point for Spark functionality. A SparkContext represents the connection to a Sparkcluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.

下面是并行执行运行计算pi,其中parallelize定义为如下：

    val count = spark.parallelize(1 to n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

Distribute a local Scala collection to form an RDD.

OK，这节结束，基本熟悉了spark的编程和运行方式了吧。