搭建hadoop集群
hadoop2.7.3 + spark1.6.1 + scala2.11.8 + jdk1.8.0_101
下载hadoop2.7,修改$HADOOP_HOME/etc/hadoop下的hadoop-env.sh文件
export JAVA_HOME=/soft/jdk1.8.0_101
修改core-site.xml文件(这里讲将数据目录data就放在$HADOOP_HOME下了
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://192.168.186.128:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/root/spark/hadoop-2.7.3/data</value> </property> </configuration>
修改hdfs-site.xml文件
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> </configuration>
先格式化
$HADOOP_HOME/bin/hdfs namenode -format
启动namenode和datanode
./hadoop-daemon.sh start namenode
./hadoop-daemon.sh start datanode
关闭iptables
service iptables stop
chkconfig --level 35 iptables off
修改hostname
#几个修改方式
hostname 【主机名】
vim /etc/sysconfig/network
sysctl kernel.hostname
vim /etc/hosts
搭建hadoop yarn
修改yarn-en.sh
JAVA=/soft/jdk1.8.0_101/bin/java
修改yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>vm128</value>
</property>
</configuration>
启动节点
./yarn-daemon.sh start resourcemanager
./yarn-daemon.sh start nodemanager
搭建spark
下载scala,安装最新版本即可,然后配置scala home
JAVA_HOME=/soft/jdk1.8.0_101
SCALA_HOME=/root/spark/scala-2.11.8
PATH=$PATH:$JAVA_HOME/bin:$SCALA_HOME/bin:
export PATH USER LOGNAME MAIL HOSTNAME HISTSIZE HISTCONTROL JAVA_HOME SCALA_HOME
修改$SPARK_HOME/conf下的 spark-env.sh
export SCALA_HOME=/root/spark/scala-2.11.8
export JAVA_HOME=/soft/jdk1.8.0_101
export SPARK_MASTER_IP=192.168.186.128
export SPARK_WORKER_MEMORY=512M
export HADOOP_CONF_DIR=/root/spark/hadoop-2.7.3/etc/hadoop
启动节点
$SPARK_HOME/bin/start-master.sh
./start-slave.sh spark://192.168.186.128:7077
jps结果
NameNode和DataNode是hdfs进程
ResourceManager和NodeManager是YARN进程
Master和Worker是spark进程
6368 Master
7666 Jps
6756 Worker
4343 DataNode
5052 NodeManager
4446 NameNode
4798 ResourceManager
运行简单例子
$SPARK_HOME/bin/spark-shell
先上传一个文件到hdfs中
$HADOOP_HOME/bin/./hdfs dfs -mkdir /test
./hdfs dfs -put /root/spark/spark-2.0.0-bin-hadoop2.7/conf/spark-defaults.conf.template /test/xx
var textFile = sc.textFile("hdfs://192.168.186.128:9000/test/xx")
var line = textFile.filter(line=>line.contains("spark"))
#执行count后就可以计算了
line.count()
#map,filter,collect函数
sc.parallelize(1 to 100).map(_*2).filter(_>50).filter(_<180).collect
web UI端口
#hadoop界面
http://192.168.186.128:50070/dfshealth.html#tab-datanode
#yarn界面
http://192.168.186.128:8088/cluster/apps/RUNNING
#spark界面
http://192.168.186.128:8080/
#spark-shell启动后的任务监控界面
http://192.168.186.134:4040/
参考