SparK安装

1.Spark简介

简单总结一下Spark的特性:

首先介绍一下Spark的生态系统。

  • fast: Spark引入了一种叫做RDD的概念(下一篇详细介绍),官方宣称性能比MapReduce高100倍
  • fault-tolerant: Spark的RDD采用lineage(血统)来保存其生成轨迹,一旦节点挂掉,可重新生成来保证Job的自动容错
  • scalable: Spark跟MapReduce一样,采用Master-Worker架构,可通过增加Worker来自动扩容
  • compatible: Spark的存储接口兼容Hadoop,采用inputformat/outputformat来读取HDFS/HBase/Cassandra/S3 etc上的数据
  • conciseness: Spark采用Scala语言编写,充分利用了Scala语法的简洁性,以及functional编程的便利,整个Spark项目的代码才2W行。当然Spark不仅仅提供了Scala的API,还提供了Java和Python的API。

    Spark的定位(适合的场景):

    MapReduce框架在分布式处理领域取得了巨大的成功,但是MapReduce的优势在于处理大规模无环数据流(acyclic data flows ),适合于批处理作业,但在以下两种场景下,MapReduce并不高效:

  • Iterative jobs(迭代计算型作业): 许多机器学习算法(比如KMeans/Logistic Regression等)会针对同一数据集进行多轮迭代运算,每次迭代,仅仅是函数参数的变化,数据集的变化不大。用MapReduce实现这样的算法(比如mahout),每次迭代将是一个MapReduce Job,而每个MapReduce Job都会重新load数据,然后计算,最后持久化到HDFS,这无疑是巨大的IO开销,也是巨大的时间浪费。如果能够将这些需要多轮迭代的数据集Cache在内存中将会带来极大的性能提升,Spark采用RDD思想做到了这一点。
  • Interactive analytics(交互式查询分析): 虽然在MapReduce框架之上提供了Hive/Pig这样的类SQL引擎来方便用户进行adhoc query,但是其查询效率仍然是巨大的诟病,通常一次查询需要分钟甚至小时级别,难以达到像数据库一样的交互式查询体验。这归结于MapReduce框架设计的初衷并不是提供交互式处理,而是批处理类型的处理任务,例如MapReduce的Map处理要将中间结果持久化到本地磁盘需要Disk IO开销,shuffle阶段需要将Map中间结果HTTP fetch到Reduce端需要网络IO开销,每个Job的Reduce需要将结果持久化到HDFS才能进行下一次Job,下一次Job又需要重新从HDFS load上一次Job的结果。这种计算模型进行一些大规模数据集的批处理作业是OK的,但是不能够提供快速的交互式adhoc查询(秒级别)。Spark放弃了Map-Shuffle-Reduce这样简单粗暴的编程模型而采用Transformation/Action模型,利用RDD思想能够将中间结果缓存起来而不是持久化,同时提供了一个与Hive一致的SQL接口(Shark)达到了MPP分布式数据库交互式查询的效率(39GB数据/次秒级响应时间)。

  • 2. 前提条件

    Spark依赖JDK 6.0以及Scala 2.9.3以上版本,所以首先确保已安装JDK和Scala的合适版本并加入PATH。 本节比较简单,不在此赘述。安装完毕,请验证JDK和Scala的版本.

  • 2.1验证java版本
  • [root@hadoop spark-0.8.0]# java -version
    java version "1.6.0_24"
    Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
    Java HotSpot(TM) Client VM (build 19.1-b02, mixed mode, sharing)
    You have new mail in /var/spool/mail/root

  • 2.2安装 Scala 2.10.3

    Spark 0.8.0 依赖 Scala 2.9.3, 我们安装了最新的Scala 2.10.3.

  • [root@hadoop ~]# wget http://www.scala-lang.org/files/archive/scala-2.10.3.tgz
    --2013-12-06 06:32:14--  http://www.scala-lang.org/files/archive/scala-2.10.3.tgz
    Resolving www.scala-lang.org... 128.178.154.159
    Connecting to www.scala-lang.org|128.178.154.159|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 30531249 (29M) [application/x-gzip]
    Saving to: “scala-2.10.3.tgz”


    100%[======================================>] 30,531,249  38.0K/s   in 13m 56s 


    2013-12-06 06:46:10 (35.7 KB/s) - “scala-2.10.3.tgz” saved [30531249/30531249]

  • [root@hadoop ~]# tar -zxf scala-2.10.3.tgz

  • [root@hadoop ~]# vi /etc/profile

  • export SCALA_HOME=/root/scala-2.10.3
    export PATH=.:$PATH:$JAVA_HOME/bin:$MAVEN_HOME/bin:/usr/local/protobuf/bin:/root/hadoop-2.2.0/sbin:/root/hadoop-2.2.0/bin:$SCALA_HOME/bin

    3. 下载预编译好的Spark

  • [root@hadoop ~]# wget http://spark-project.org/download/spark-0.8.0-incubating-bin-hadoop1.tgz
    --2013-12-06 07:03:30--  http://spark-project.org/download/spark-0.8.0-incubating-bin-hadoop1.tgz
    Resolving spark-project.org... 128.32.37.248
    Connecting to spark-project.org|128.32.37.248|:80... connected.
    HTTP request sent, awaiting response... 302 Found
    Location: http://d3kbcqa49mib13.cloudfront.net/spark-0.8.0-incubating-bin-hadoop1.tgz [following]
    --2013-12-06 07:03:31--  http://d3kbcqa49mib13.cloudfront.net/spark-0.8.0-incubating-bin-hadoop1.tgz
    Resolving d3kbcqa49mib13.cloudfront.net... 54.240.168.9, 54.240.168.148, 54.230.156.40, ...
    Connecting to d3kbcqa49mib13.cloudfront.net|54.240.168.9|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 133589594 (127M) [application/x-compressed]
    Saving to: “spark-0.8.0-incubating-bin-hadoop1.tgz”


    100%[======================================>] 133,589,594  752K/s   in 2m 49s  


    2013-12-06 07:06:22 (771 KB/s) - “spark-0.8.0-incubating-bin-hadoop1.tgz” saved [133589594/133589594]

    4. 本地模式

  • 4.1 解压
    • [root@hadoop ~]# tar zxf spark-0.8.0-incubating-bin-hadoop1.tgz

    •  [root@hadoop ~]# mv spark-0.8.0-incubating-bin-hadoop1 spark-0.8.0

  • 4.2 设置SPARK_EXAMPLES_JAR 环境变量

    [root@hadoop ~]# vi ~/.bash_profile

  • export SPARK_EXAMPLES_JAR=/root/spark-0.8.0/examples/target/spark-examples_2.9.3-0.8.0-incubating.jar

  • 4.3 设置 SPARK_HOME环境变量,并将SPARK_HOME/bin加入PATH
  • [root@hadoop ~]#  vi /etc/profile
  • export SPARK_HOME=/root/spark-0.8.0
  • export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME
  • export PATH
  • 4.4 现在可以运行SparkPi了
  • [root@hadoop spark-0.8.0]# run-example org.apache.spark.examples.SparkPi local
    13/12/06 07:29:30 INFO slf4j.Slf4jEventHandler: Slf4jEventHandler started
    13/12/06 07:29:31 INFO spark.SparkEnv: Registering BlockManagerMaster
    13/12/06 07:29:31 INFO storage.MemoryStore: MemoryStore started with capacity 160.8 MB.
    13/12/06 07:29:31 INFO storage.DiskStore: Created local directory at /tmp/spark-local-20131206072931-d10f
    13/12/06 07:29:31 INFO network.ConnectionManager: Bound socket to port 55557 with id = ConnectionManagerId(hadoop,55557)
    13/12/06 07:29:31 INFO storage.BlockManagerMaster: Trying to register BlockManager
    13/12/06 07:29:31 INFO storage.BlockManagerMasterActor$BlockManagerInfo: Registering block manager hadoop:55557 with 160.8 MB RAM
    13/12/06 07:29:31 INFO storage.BlockManagerMaster: Registered BlockManager
    13/12/06 07:29:32 INFO server.Server: jetty-7.x.y-SNAPSHOT
    13/12/06 07:29:32 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:53517
    13/12/06 07:29:32 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.223.129:53517
    13/12/06 07:29:32 INFO spark.SparkEnv: Registering MapOutputTracker
    13/12/06 07:29:32 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-c0e7a5ff-6c44-4311-a45d-372abc6bcf51
    13/12/06 07:29:32 INFO server.Server: jetty-7.x.y-SNAPSHOT
    13/12/06 07:29:32 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:39780
    13/12/06 07:29:33 INFO server.Server: jetty-7.x.y-SNAPSHOT
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage/rdd,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/storage,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/stage,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages/pool,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/stages,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/environment,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/executors,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/metrics/json,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/static,null}
    13/12/06 07:29:33 INFO handler.ContextHandler: started o.e.j.s.h.ContextHandler{/,null}
    13/12/06 07:29:33 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
    13/12/06 07:29:33 INFO ui.SparkUI: Started Spark Web UI at http://hadoop:4040
    13/12/06 07:29:33 INFO spark.SparkContext: Added JAR /root/spark-0.8.0/examples/target/spark-examples_2.9.3-0.8.0-incubating.jar at http://192.168.223.129:39780/jars/spark-examples_2.9.3-0.8.0-incubating.jar with timestamp 1386343773274
    13/12/06 07:29:34 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:39
    13/12/06 07:29:34 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:39) with 2 output partitions (allowLocal=false)
    13/12/06 07:29:34 INFO scheduler.DAGScheduler: Final stage: Stage 0 (reduce at SparkPi.scala:39)
    13/12/06 07:29:34 INFO scheduler.DAGScheduler: Parents of final stage: List()
    13/12/06 07:29:34 INFO scheduler.DAGScheduler: Missing parents: List()
    13/12/06 07:29:34 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkPi.scala:35), which has no missing parents
    13/12/06 07:29:34 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (MappedRDD[1] at map at SparkPi.scala:35)
    13/12/06 07:29:34 INFO local.LocalTaskSetManager: Size of task 0 is 1442 bytes
    13/12/06 07:29:35 INFO local.LocalScheduler: Running 0
    13/12/06 07:29:36 INFO local.LocalScheduler: Fetching http://192.168.223.129:39780/jars/spark-examples_2.9.3-0.8.0-incubating.jar with timestamp 1386343773274
    13/12/06 07:29:37 INFO util.Utils: Fetching http://192.168.223.129:39780/jars/spark-examples_2.9.3-0.8.0-incubating.jar to /tmp/fetchFileTemp8568286284020440036.tmp
    13/12/06 07:29:39 INFO local.LocalScheduler: Adding file:/tmp/spark-06477ca4-b294-449a-8fca-03dad945ab80/spark-examples_2.9.3-0.8.0-incubating.jar to class loader
    13/12/06 07:29:39 INFO local.LocalScheduler: Finished 0
    13/12/06 07:29:39 INFO local.LocalTaskSetManager: Size of task 1 is 1442 bytes
    13/12/06 07:29:39 INFO scheduler.DAGScheduler: Completed ResultTask(0, 0)
    13/12/06 07:29:39 INFO local.LocalScheduler: Running 1
    13/12/06 07:29:39 INFO local.LocalScheduler: Finished 1
    13/12/06 07:29:39 INFO scheduler.DAGScheduler: Completed ResultTask(0, 1)
    13/12/06 07:29:39 INFO scheduler.DAGScheduler: Stage 0 (reduce at SparkPi.scala:39) finished in 5.021 s
    13/12/06 07:29:39 INFO spark.SparkContext: Job finished: reduce at SparkPi.scala:39, took 5.358120684 s
    Pi is roughly 3.13888
    13/12/06 07:29:39 INFO local.LocalScheduler: Remove TaskSet 0.0 from pool 



  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值