基于Hadoop2.2.0安装Spark 1.0
基础
服务器的版本:ubuntu 12.04
JDK
下载
tar -zxvf ... -C /usr/local
ln -s /usr/local/jdk1.7.0_60 /usr/local/jvm
设置环境变量
export JAVA_HOME=/usr/local/jdk
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=$JAVA_HOME/bin:$PATH
Scala
下载
wget http://www.scala-lang.org/files/archive/scala-2.10.3.tgz
tar -zxf scala-2.10.3.tgz -C /usr/local/
ln -s /usr/local/scala-2.10.3 /usr/local/scala
设置环境变量
export SCALA_HOME=/usr/local/scala
export PATH=$SCALA_HOME/bin:$PATH
maven
下载
#wget http://mirror.bit.edu.cn/apache/maven/maven-3/3.1.1/binaries/apache-maven-3.1.1-bin.tar.gz
- tar xvzf apache-maven-3.1.1-bin.tar.gz -C /usr/local # ln -s /usr/local/apache-maven-3.1.1 /usr/local/maven
设置环境变量
export MAVEN_HOME=/usr/local/maven
export PATH=$MAVEN_HOME/bin:$PATHPROTOBUF 安装(主节点)
安装依赖包
apt-get install g++ autoconf automake libtool cmake zlib1g-dev pkg-config libssl-dev make下载安装
wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz - tar xvzf protobuf-2.5.0.tar.gz# ./configure --prefix=/usr/local/protobuf# make && make install
设置环境变量
export PATH=/usr/local/protobuf/bin:$PATHHadoop2.2.0安装
基础
http://blog.changecong.com/2013/10/ubuntu-%E7%BC%96%E8%AF%91%E5%AE%89%E8%A3%85-hadoop-2-2-0/ http://blog.csdn.net/licongcong_0224/article/details/12972889创建用户并设置无密码登录
nanenode
useradd -s /bin/bash -m hadoop
mkdir -p /home/hadoop/.ssh
chown hadoop.hadoop /home/hadoop/.ssh
su - hadoop
ssh-keygen -t rsa
#如果主节点也当datanode的话,
执行:cat id_rsa.pub > authorized_keysdatanode
useradd -s /bin/bash -m hadoop
mkdir -p /home/hadoop/.ssh
chown hadoop.hadoop /home/hadoop/.ssh
echo "主节点id_rsa.pub的内容" > id_rsa.pub
cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys每个节点关闭防火墙
ufw disableHadoop安装问题汇总
[ERROR] class file for org.mortbay.component.AbstractLifeCycle not found
解决方法:编辑hadoop-common-project/hadoop-auth/pom.xml文件,添加依赖:
<dependency>
<groupId>org.mortbay.jetty</groupId> <artifactId>jetty-util</artifactId> <scope>test</scope> </dependency>注意hosts,192.168.137.100 namenode对应
Spark1.0安装
下载源码 spark1.0
进入下载网页:http://spark.apache.org/downloads.html
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.0.0.tgz配置
解压、配置路径
sudo tar zxf spark-1.0.0.tgz
sudo cp spark-1.0.0 /usr/local/ -rf
cd /usr/local
sudo ln -s spark-1.0.0/ spark设置变量
sudo /etc/profile:
export SPARK_HOME=/usr/local/spark
export SCALA_HOME=/usr/local/scala
export PATH=$SCALA_HOME/bin:$PATH
source /etc/profile组件依赖
apt-get install unzipbuild Spark
cd $SPARK_HOME
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
#mvn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -Pnew-yarn -DskipTests package
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
#sudo ./sbt/sbt assembly
SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true ./sbt/sbt assembly将编译后的文件分发到各数据节点中
1../bin/
2../sbin/
3. ./assemble/...
4../conf/
scp -r spark/sbin hadoop@datanode1.ejushang.com:/home/hadoop/spark
scp -r spark/bin hadoop@datanode1.ejushang.com:/home/hadoop/spark
scp -r spark/conf hadoop@datanode1.ejushang.com:/home/hadoop/spark
scp -r spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar hadoop@datanode1.ejushang.com:/home/hadoop/spark/assembly/target/scala-2.10测行测试
build成功后,会在spark目录下生成两个文件,分别为:
1. examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.2.0.jar
2. assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar配置运行脚本
1. sudo cp conf/spark-env.sh.template conf/spark-env.sh
并添加如下内容:
export JAVA_HOME=/usr/local/jvm/java
export SCALA_HOME=/usr/local/scala
export HADOOP_HOME=/home/hadoop/hadoop
2. sudo cp conf/log4j.properties.template conf/log4j.properties
3.cd $SPARK_HOME
vi conf/slaves
添加数据节点:datanode1配置Spark路径
vi /etc/profile
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
source /etc/profile运行测试
本地模式
普通集群模式
结合HDFS模式
./bin/spark-shell
var file = sc.textFile("hdfs://namenode:9000/user/spark/hdfs.cmd");
val count = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(+)
count.collect()基于Yarn模式
在Spark目录mkdir test && cd test
vi run_spark_shell.sh
SPARK_JAR=../assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop2.2.0.jar \
spark-class org.apache.spark.deploy.yarn.Client \
--jar ../examples/target/scala-2.10/spark-examples-1.0.0-hadoop2.2.0.jar \
--class org.apache.spark.examples.SparkPi \
--num-workers 3 \
--master-memory 1g \
--worker-memory 1g \
--worker-cores 1运行状态监控
在启动./bin/spark-she ll时,会显示如下日志:
14/06/23 14:36:43 INFO SparkUI: Started SparkUI at http://namenode:4040
则可访问:http://192.168.137.100:4040/Shark安装
添加DataNode
无密码登录
修改hostname及hadoop/slaves,spark/slaves文件
同步hadoop文件
同步spark文件
同步环境变量/etc/profile
安装问题汇总
spark目录必须为hadoop权限下的目录
chown hadoop.hadoop /usr/local/sparkSCALA_HOME not set
build Spark的时候,用的是Spark帐户,Scala_Home是有设置的,但还会报这个错误,但切换到root帐户下,就没有出现这个问题。JAVA_HOME is not set
在conf/ spark-env.sh上添加export JAVA_HOME=/usr/local/jvm/java即可wrap: java.lang.reflect.InvocationTargetException: PermGen space
出现这个错误,内存不够,应该是jvm参数引起的,给jvm分配的空间太小了。我用官方推荐的export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
,还会报此错误,我改成MaxPermSize=1g,解决此问题。WARN NativeCodeLoader: Unable to load native-hadoop library for your platform
WARN TaskSchedulerImpl: Initial job has not accepted any resources
worker内存不足引起的,可以修改每个datanode的worker数量,和每个worker的内存附录
1. Spark On YARN 环境搭建:http://sofar.blog.51cto.com/353572/1352713