Condition:
scala 2.10.X
jdk 1.7.55
spark 1.0.0
maven 3.0.4
Step 1: download the source code
scala http://www.scala-lang.org/download/2.10.3.html
spark http://www.apache.org/dyn/closer.cgi/spark/spark-1.0.0/spark-1.0.0.tgz
Step 2: Install java
Step 3: install scala
make sure you have put the jdk home and scala home in the environment path
step 4: make the maven memory lager, because the compiling process will consume lots of momery
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
go to the uncompressed directory
use the following command to compile
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
第5步也可以用下面的方法编译
修改 解压目录下的pom.xml
把 <hadoop.version>1.0.4</hadoop.version>修改为 <hadoop.version>2.2.0</hadoop.version>
执行如下命令
./make-distribution.sh --hadoop 2.2 --with-yarn --with-hive --with-tachyon --skip-java-test
编译完成的spark 在/dist 下面,在spark目录下也会生成一个打好的包。
第六步:
把scala和spark加入到环境变量,如果愿意也可以加入到 conf/spark-env.sh中
第七步:编辑conf/slaves
把slaves的hostname写在里面
第八步:启动spark
sbin/start-all.sh
检查是否期启动成功
jps
master 机器上面应该有一个master进程
slave 机器上应该有一个worker进程
检查是否真的启动
1. 运行例子
cd bin/
./run-example SparkPi 出现错误,请参照我的另外一篇博客 http://blog.csdn.net/robinsonmhj/article/details/32327251
2.从HDFS读取文件并运行WordCount
$ hdfs fs -put README.md ./test $ MASTER=spark://master:7077 ./spark-shell scala> val file = sc.textFile("hdfs://master:9000/test/README.md") scala> val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_+_) scala> count.collect()
References:
http://spark.apache.org/docs/latest/building-with-maven.html