在amazon ec2上部署spark cluster

1,利用spark自带的ec2脚本来生成spark

在spark的安装目录下,执行如下命令

./ec2/spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> --vpc-id=<vpc-id> --subnet-id=<subnet-id> launch <cluster-name>, where <keypair> is the name of your EC2 key pair (that you gave it when you created it), <key-file> is the private key file for your key pair,<num-slaves> is the number of slave nodes to launch (try 1 at first), <vpc-id> is the name of your VPC, <subnet-id> is the name of your subnet, and <cluster-name> is the name to give to your cluster.

比如:

$ export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123

$ ./ec2/spark-ec2 --key-pair=spark_study --identity-file=/home/ubuntu/spark_study.pem --region=ap-northeast-1 --zone=ap-northeast-1a  launch my-spark-cluster

2,在本地生成scala程序

首先创建如下目录

./src

./src/main

./src/main/scala


然后创建文件,./src/main/scala/SimpleApp.scala

内容为

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "/home/ubuntu/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

我们使用sbt来生成scala的jar包,首先需要安装sbt。

通过如下命令安装sbt

echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
sudo apt-get update
sudo apt-get install sbt


创建如下文件 ./simple.sbt

内容为

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"

然后执行

sbt package

可以看到在target目录下生成了一个jar包,这个包我们可以在cluster上面执行。


3,将jar包上传到spark的master结点上。

4,注意我们的程序里面的logfile路径

/home/ubuntu/spark-1.6.0-bin-hadoop2.6/README.md

在spark cluster上面,默认会去读取hdfs的路径,这就需要现在hdfs上面创建一个文本文件

首先登陆到master的机器结点,然后进入hdfs的安装目录

cd ~/ephemeral-hdfs

创建一个文本文件~/README.md,然后把文本文件拷贝到hdfs的
/home/ubuntu/spark-1.6.0-bin-hadoop2.6/
目录下。

通过如下命令拷贝.

bin/hadoop fs -mkdir /home/ubuntu/spark-1.6.0-bin-hadoop2.6/

bin/hadoop fs -put ~/README.md /home/ubuntu/spark-1.6.0-bin-hadoop2.6/

然后可以通过如下命令检查是否拷贝正确

bin/hadoop fs -ls /home/ubuntu/spark-1.6.0-bin-hadoop2.6/

这时应该可以看到README.md文件。这就是我们拷贝到HDFS上的文件。


5,提交运行拷贝到master结点的jar包

进入spark的安装路径,然后执行如下命令

./bin/spark-submit --class SimpleApp --master spark://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:7077 --deploy-mode client /home/ec2-user/simple-project_2.10-1.0.jar

我们可以看到运行结果

16/02/07 12:55:31 INFO spark.SecurityManager: Changing view acls to: root
16/02/07 12:55:31 INFO spark.SecurityManager: Changing modify acls to: root
16/02/07 12:55:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/02/07 12:55:32 INFO util.Utils: Successfully started service 'sparkDriver' on port 36092.
16/02/07 12:55:32 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/02/07 12:55:32 INFO Remoting: Starting remoting
16/02/07 12:55:33 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.31.12.26:33336]
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 33336.
16/02/07 12:55:33 INFO spark.SparkEnv: Registering MapOutputTracker
16/02/07 12:55:33 INFO spark.SparkEnv: Registering BlockManagerMaster
16/02/07 12:55:33 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/blockmgr-76f22bc5-a78a-4847-ab29-b7292f7a7cff
16/02/07 12:55:33 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/blockmgr-dd59300b-9590-41c3-b94a-d76a3c7fe8db
16/02/07 12:55:33 INFO storage.MemoryStore: MemoryStore started with capacity 511.5 MB
16/02/07 12:55:33 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/02/07 12:55:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/02/07 12:55:33 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/02/07 12:55:33 INFO ui.SparkUI: Started SparkUI at http://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:4040
16/02/07 12:55:33 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf/httpd-77ec3777-a82b-4ef3-9fca-5a4c790a5747
16/02/07 12:55:33 INFO spark.HttpServer: Starting HTTP Server
16/02/07 12:55:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/02/07 12:55:33 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:56378
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'HTTP file server' on port 56378.
16/02/07 12:55:33 INFO spark.SparkContext: Added JAR file:/home/ec2-user/simple-project_2.10-1.0.jar at http://172.31.12.26:56378/jars/simple-project_2.10-1.0.jar with timestamp 1454849733709
16/02/07 12:55:33 INFO client.AppClient$ClientEndpoint: Connecting to master spark://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:7077...
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160207125534-0003
16/02/07 12:55:34 INFO client.AppClient$ClientEndpoint: Executor added: app-20160207125534-0003/0 on worker-20160207085403-172.31.2.135-39140 (172.31.2.135:39140) with 2 cores
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160207125534-0003/0 on hostPort 172.31.2.135:39140 with 2 cores, 6.0 GB RAM
16/02/07 12:55:34 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47128.
16/02/07 12:55:34 INFO netty.NettyBlockTransferService: Server created on 47128
16/02/07 12:55:34 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/02/07 12:55:34 INFO storage.BlockManagerMasterEndpoint: Registering block manager 172.31.12.26:47128 with 511.5 MB RAM, BlockManagerId(driver, 172.31.12.26, 47128)
16/02/07 12:55:34 INFO storage.BlockManagerMaster: Registered BlockManager
16/02/07 12:55:34 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160207125534-0003/0 is now RUNNING
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 46.3 KB, free 46.3 KB)
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.4 KB, free 50.7 KB)
16/02/07 12:55:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.31.12.26:47128 (size: 4.4 KB, free: 511.5 MB)
16/02/07 12:55:35 INFO spark.SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/07 12:55:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/07 12:55:35 WARN snappy.LoadSnappy: Snappy native library not loaded
16/02/07 12:55:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/02/07 12:55:35 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:12
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 53.9 KB)
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1891.0 B, free 55.7 KB)
16/02/07 12:55:35 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.12.26:47128 (size: 1891.0 B, free: 511.5 MB)
16/02/07 12:55:35 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/07 12:55:35 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/07 12:55:38 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-2-135.ap-northeast-1.compute.internal:38142) with ID 0
16/02/07 12:55:38 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 0,NODE_LOCAL, 2286 bytes)
16/02/07 12:55:38 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 1,NODE_LOCAL, 2286 bytes)
16/02/07 12:55:38 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-2-135.ap-northeast-1.compute.internal:46994 with 4.1 GB RAM, BlockManagerId(0, ip-172-31-2-135.ap-northeast-1.compute.internal, 46994)
16/02/07 12:55:38 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 1891.0 B, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 4.4 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added rdd_1_0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 3.1 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added rdd_1_1 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 2.9 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1477 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (1/2)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1502 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (2/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 3.923 s
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 4.132609 s
16/02/07 12:55:39 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:13
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/07 12:55:39 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 58.8 KB)
16/02/07 12:55:39 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1893.0 B, free 60.7 KB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.12.26:47128 (size: 1893.0 B, free: 511.5 MB)
16/02/07 12:55:39 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 0,PROCESS_LOCAL, 2286 bytes)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 1,PROCESS_LOCAL, 2286 bytes)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 1893.0 B, free: 4.1 GB)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 65 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (1/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.078 s
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (2/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.100431 s
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
Lines with a: 32, Lines with b: 11
16/02/07 12:55:39 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/02/07 12:55:39 INFO ui.SparkUI: Stopped Spark web UI at http://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:4040
16/02/07 12:55:39 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/02/07 12:55:39 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/02/07 12:55:39 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/07 12:55:39 INFO storage.MemoryStore: MemoryStore cleared
16/02/07 12:55:39 INFO storage.BlockManager: BlockManager stopped
16/02/07 12:55:39 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/02/07 12:55:39 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/07 12:55:39 INFO spark.SparkContext: Successfully stopped SparkContext
16/02/07 12:55:39 INFO util.ShutdownHookManager: Shutdown hook called
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf/httpd-77ec3777-a82b-4ef3-9fca-5a4c790a5747
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt2/spark/spark-8d522cb1-beec-4d21-9a8c-e9c1cf635ea7
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf


需要继续研究的是当前是在主节点通过client的模式来部署程序的,当使用cluster模式来部署的时候,会报错。这个需要继续研究了。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值