1 安装Scala IDE
从http://scala-ide.org/下载Scala IDE
2 安装Hadoop windows版
要在windows上运行hadoop需要从新编译。
https://github.com/srccodes/hadoop-common-2.2.0-bin/archive/master.zip
该地址提供了一个编译好的hadoop2.2.0的版本。
用于在windows运行Spark Driver程序应该足够用了。
下载该版本,解压后配置HADOOP_HOME环境变量为hadoop安装目录。
3 设备列表
SparkMaster 192.168.245.134(Ubuntu)
SparkWorker1 192.168.245.135(Ubuntu)
SparkWorker2 192.168.245.136(Ubuntu)
运行Driver机器 192.168.245.1(windows 7)
4 Hosts文件配置
在Spark集群所有机器的hosts文件添加运行Driver机器的名称与IP的映射关系。
例如:
192.168.245.1 DriverHostname
DriverHostname为运行Driver机器的主机名
5 测试代码
在Scala IDE中创建HelloSpark项目,并添加spark 1.2.0的jar包,并将Scala的版本选择为2.10.4.如下图:
创建HelloSpark.scala文件,代码如下:
package spark.test
import org.apache.spark._
object HelloSpark {
def main(args: Array[String]) {
val conf = new SparkConf().
setAppName("HelloSpark").setMaster("spark://SparkMaster:7077");
val sc = new SparkContext(conf)
val file = sc.textFile("hdfs://SparkMaster:9000/sparksql/people.txt")
println("Num lines: " + file.count)
}
}
6 运行日志
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/01/22 18:15:31 INFO SecurityManager: Changing view acls to: warren
15/01/22 18:15:31 INFO SecurityManager: Changing modify acls to: warren
15/01/22 18:15:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(warren); users with modify permissions: Set(warren)
15/01/22 18:15:32 INFO Slf4jLogger: Slf4jLogger started
15/01/22 18:15:32 INFO Remoting: Starting remoting
15/01/22 18:15:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@kylin:13514]
15/01/22 18:15:32 INFO Utils: Successfully started service 'sparkDriver' on port 13514.
15/01/22 18:15:32 INFO SparkEnv: Registering MapOutputTracker
15/01/22 18:15:32 INFO SparkEnv: Registering BlockManagerMaster
........
15/01/22 18:15:46 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 8002 ms on SparkWorker1 (1/2)
15/01/22 18:15:47 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 8532 ms on SparkWorker2 (2/2)
15/01/22 18:15:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/01/22 18:15:47 INFO DAGScheduler: Stage 0 (count at HelloSpark.scala:11) finished in 10.554 s
15/01/22 18:15:47 INFO DAGScheduler: Job 0 finished: count at HelloSpark.scala:11, took 10.705972 s
Num lines: 3
关键是要保证Worker能够连通akka.tcp://sparkDriver@kylin:13514。
kylin为主机名,需要在hosts文件配置kylin及其IP。
在虚拟机环境下有两块虚拟网卡,当Driver启动时绑定ip地址和hosts中配置的不同时,可以在windows的hosts文件指定kylin的ip。
例如在hosts文件添加:
192.168.245.1 kylin