1,利用spark自带的ec2脚本来生成spark
在spark的安装目录下,执行如下命令
$ ./ec2/spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> --vpc-id=<vpc-id> --subnet-id=<subnet-id> launch <cluster-name>
, where <keypair>
is the name of your EC2 key pair (that you gave it when you created it), <key-file>
is the private key file for your key pair,<num-slaves>
is the number of slave nodes to launch (try 1 at first), <vpc-id>
is the name of your VPC, <subnet-id>
is the name of your subnet, and <cluster-name>
is the name to give to your cluster.
比如:
$ export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123
$ ./ec2/spark-ec2 --key-pair=spark_study --identity-file=/home/ubuntu/spark_study.pem --region=ap-northeast-1 --zone=ap-northeast-1a launch my-spark-cluster
2,在本地生成scala程序
首先创建如下目录
./src
./src/main
./src/main/scala
然后创建文件,./src/main/scala/SimpleApp.scala
内容为
/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "/home/ubuntu/spark-1.6.0-bin-hadoop2.6/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
我们使用sbt来生成scala的jar包,首先需要安装sbt。
通过如下命令安装sbt
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 642AC823
sudo apt-get update
sudo apt-get install sbt
创建如下文件 ./simple.sbt
内容为
name := "Simple Project"
version := "1.0"
scalaVersion := "2.10.3"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.6.0"
然后执行
sbt package
可以看到在target目录下生成了一个jar包,这个包我们可以在cluster上面执行。
3,将jar包上传到spark的master结点上。
4,注意我们的程序里面的logfile路径
/home/ubuntu/spark-1.6.0-bin-hadoop2.6/README.md
在spark cluster上面,默认会去读取hdfs的路径,这就需要现在hdfs上面创建一个文本文件
首先登陆到master的机器结点,然后进入hdfs的安装目录
cd ~/ephemeral-hdfs
创建一个文本文件~/README.md,然后把文本文件拷贝到hdfs的
/home/ubuntu/spark-1.6.0-bin-hadoop2.6/
目录下。
通过如下命令拷贝.
bin/hadoop fs -mkdir /home/ubuntu/spark-1.6.0-bin-hadoop2.6/
bin/hadoop fs -put ~/README.md /home/ubuntu/spark-1.6.0-bin-hadoop2.6/
然后可以通过如下命令检查是否拷贝正确
bin/hadoop fs -ls /home/ubuntu/spark-1.6.0-bin-hadoop2.6/
这时应该可以看到README.md文件。这就是我们拷贝到HDFS上的文件。
5,提交运行拷贝到master结点的jar包
进入spark的安装路径,然后执行如下命令
./bin/spark-submit --class SimpleApp --master spark://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:7077 --deploy-mode client /home/ec2-user/simple-project_2.10-1.0.jar
我们可以看到运行结果
16/02/07 12:55:31 INFO spark.SecurityManager: Changing view acls to: root
16/02/07 12:55:31 INFO spark.SecurityManager: Changing modify acls to: root
16/02/07 12:55:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/02/07 12:55:32 INFO util.Utils: Successfully started service 'sparkDriver' on port 36092.
16/02/07 12:55:32 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/02/07 12:55:32 INFO Remoting: Starting remoting
16/02/07 12:55:33 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@172.31.12.26:33336]
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 33336.
16/02/07 12:55:33 INFO spark.SparkEnv: Registering MapOutputTracker
16/02/07 12:55:33 INFO spark.SparkEnv: Registering BlockManagerMaster
16/02/07 12:55:33 INFO storage.DiskBlockManager: Created local directory at /mnt/spark/blockmgr-76f22bc5-a78a-4847-ab29-b7292f7a7cff
16/02/07 12:55:33 INFO storage.DiskBlockManager: Created local directory at /mnt2/spark/blockmgr-dd59300b-9590-41c3-b94a-d76a3c7fe8db
16/02/07 12:55:33 INFO storage.MemoryStore: MemoryStore started with capacity 511.5 MB
16/02/07 12:55:33 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/02/07 12:55:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/02/07 12:55:33 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/02/07 12:55:33 INFO ui.SparkUI: Started SparkUI at http://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:4040
16/02/07 12:55:33 INFO spark.HttpFileServer: HTTP File server directory is /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf/httpd-77ec3777-a82b-4ef3-9fca-5a4c790a5747
16/02/07 12:55:33 INFO spark.HttpServer: Starting HTTP Server
16/02/07 12:55:33 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/02/07 12:55:33 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:56378
16/02/07 12:55:33 INFO util.Utils: Successfully started service 'HTTP file server' on port 56378.
16/02/07 12:55:33 INFO spark.SparkContext: Added JAR file:/home/ec2-user/simple-project_2.10-1.0.jar at http://172.31.12.26:56378/jars/simple-project_2.10-1.0.jar with timestamp 1454849733709
16/02/07 12:55:33 INFO client.AppClient$ClientEndpoint: Connecting to master spark://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:7077...
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160207125534-0003
16/02/07 12:55:34 INFO client.AppClient$ClientEndpoint: Executor added: app-20160207125534-0003/0 on worker-20160207085403-172.31.2.135-39140 (172.31.2.135:39140) with 2 cores
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20160207125534-0003/0 on hostPort 172.31.2.135:39140 with 2 cores, 6.0 GB RAM
16/02/07 12:55:34 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47128.
16/02/07 12:55:34 INFO netty.NettyBlockTransferService: Server created on 47128
16/02/07 12:55:34 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/02/07 12:55:34 INFO storage.BlockManagerMasterEndpoint: Registering block manager 172.31.12.26:47128 with 511.5 MB RAM, BlockManagerId(driver, 172.31.12.26, 47128)
16/02/07 12:55:34 INFO storage.BlockManagerMaster: Registered BlockManager
16/02/07 12:55:34 INFO client.AppClient$ClientEndpoint: Executor updated: app-20160207125534-0003/0 is now RUNNING
16/02/07 12:55:34 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 46.3 KB, free 46.3 KB)
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.4 KB, free 50.7 KB)
16/02/07 12:55:35 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.31.12.26:47128 (size: 4.4 KB, free: 511.5 MB)
16/02/07 12:55:35 INFO spark.SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
16/02/07 12:55:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/07 12:55:35 WARN snappy.LoadSnappy: Snappy native library not loaded
16/02/07 12:55:35 INFO mapred.FileInputFormat: Total input paths to process : 1
16/02/07 12:55:35 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:12
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (count at SimpleApp.scala:12)
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.1 KB, free 53.9 KB)
16/02/07 12:55:35 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1891.0 B, free 55.7 KB)
16/02/07 12:55:35 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.31.12.26:47128 (size: 1891.0 B, free: 511.5 MB)
16/02/07 12:55:35 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/02/07 12:55:35 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
16/02/07 12:55:35 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/02/07 12:55:38 INFO cluster.SparkDeploySchedulerBackend: Registered executor NettyRpcEndpointRef(null) (ip-172-31-2-135.ap-northeast-1.compute.internal:38142) with ID 0
16/02/07 12:55:38 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 0,NODE_LOCAL, 2286 bytes)
16/02/07 12:55:38 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 1,NODE_LOCAL, 2286 bytes)
16/02/07 12:55:38 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-172-31-2-135.ap-northeast-1.compute.internal:46994 with 4.1 GB RAM, BlockManagerId(0, ip-172-31-2-135.ap-northeast-1.compute.internal, 46994)
16/02/07 12:55:38 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 1891.0 B, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 4.4 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added rdd_1_0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 3.1 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added rdd_1_1 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 2.9 KB, free: 4.1 GB)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1477 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (1/2)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1502 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (2/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: ResultStage 0 (count at SimpleApp.scala:12) finished in 3.923 s
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 4.132609 s
16/02/07 12:55:39 INFO spark.SparkContext: Starting job: count at SimpleApp.scala:13
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (count at SimpleApp.scala:13)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Missing parents: List()
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
16/02/07 12:55:39 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.1 KB, free 58.8 KB)
16/02/07 12:55:39 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1893.0 B, free 60.7 KB)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 172.31.12.26:47128 (size: 1893.0 B, free: 511.5 MB)
16/02/07 12:55:39 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 0,PROCESS_LOCAL, 2286 bytes)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, ip-172-31-2-135.ap-northeast-1.compute.internal, partition 1,PROCESS_LOCAL, 2286 bytes)
16/02/07 12:55:39 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-2-135.ap-northeast-1.compute.internal:46994 (size: 1893.0 B, free: 4.1 GB)
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 65 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (1/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: ResultStage 1 (count at SimpleApp.scala:13) finished in 0.078 s
16/02/07 12:55:39 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 79 ms on ip-172-31-2-135.ap-northeast-1.compute.internal (2/2)
16/02/07 12:55:39 INFO scheduler.DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.100431 s
16/02/07 12:55:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
Lines with a: 32, Lines with b: 11
16/02/07 12:55:39 INFO spark.SparkContext: Invoking stop() from shutdown hook
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/02/07 12:55:39 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/02/07 12:55:39 INFO ui.SparkUI: Stopped Spark web UI at http://ec2-52-192-126-225.ap-northeast-1.compute.amazonaws.com:4040
16/02/07 12:55:39 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/02/07 12:55:39 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/02/07 12:55:39 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/02/07 12:55:39 INFO storage.MemoryStore: MemoryStore cleared
16/02/07 12:55:39 INFO storage.BlockManager: BlockManager stopped
16/02/07 12:55:39 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/02/07 12:55:39 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/02/07 12:55:39 INFO spark.SparkContext: Successfully stopped SparkContext
16/02/07 12:55:39 INFO util.ShutdownHookManager: Shutdown hook called
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf/httpd-77ec3777-a82b-4ef3-9fca-5a4c790a5747
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt2/spark/spark-8d522cb1-beec-4d21-9a8c-e9c1cf635ea7
16/02/07 12:55:39 INFO util.ShutdownHookManager: Deleting directory /mnt/spark/spark-6ce5d4ee-c69e-40d1-9053-5d613461f9bf
需要继续研究的是当前是在主节点通过client的模式来部署程序的,当使用cluster模式来部署的时候,会报错。这个需要继续研究了。