1前期准备
1.1软件
- jdk-8u191-linux-x64.tar.gz
- hadoop-3.1.1.tar.gz
- scala-2.12.8.gz
- spark-2.4.0-bin-hadoop2.7.gz
1.2集群规划
IP | Hostname | 安装内容 |
192.168.56.11 | spark1 | Hadoop\scala\spark |
192.168.56.12 | Spark2 | Hadoop\scala\spark |
192.168.56.13 | Spark3 | Hadoop\scala\spark |
2安装步骤
2.1上传软件(spark1)
将JDK、hadoop、scala、spark等软件上传至spark1服务器
2.2修改hosts文件(spark1\spark2\spark3)
vi /etc/hosts
192.168.56.11 spark1
192.168.56.12 spark2
192.168.56.13 spark3
2.3 ssh互信(免密码登录)
在三台机器上执行以下命令:
mkdir ~/.ssh
chmod -R 700 ~/.ssh
cd .ssh
ssh-keygen -t rsa
分别将其他两台机器的密钥文件拷贝至spark1
1)在spark2上执行:
scp id_rsa.pub root@spark1:~/.ssh/id_rsa.pub.spark2
2)在spark3上执行:
scp id_rsa.pub root@spark1:~/.ssh/id_rsa.pub.spark3
3)在spark1上合并密钥文件
cat id_rsa.pub >> authorized_keys
cat id_rsa.pub.spark2 >> authorized_keys
cat id_rsa.pub.spark3 >> authorized_keys
chmod 600 ~/.ssh/authorized_keys
4)将合并后的密钥文件分发至其他两台机器(spark2\spark3)
scp ~/.ssh/authorized_keys root@spark2:~/.ssh/authorized_keys
scp ~/.ssh/authorized_keys root@spark3:~/.ssh/authorized_keys
5)在三台机器上分别进行ssh测试
ssh spark1
ssh spark2
ssh spark3
2.4解压相关软件(spark1)
cd /opt
tar -xvf jdk-8u191-linux-x64.tar.gz
tar -xvf hadoop-3.1.1.tar.gz
tar -xvf scala-2.12.8.tgz
tar -xvf spark-2.4.0-bin-hadoop2.7.tgz
2.5分发基础环境
将jdk和scala分发至其他两台机器(spark2\spark3)
scp -r /opt/jdk1.8.0_191 root@spark2:/opt/jdk1.8.0_191
scp -r /opt/jdk1.8.0_191 root@spark3:/opt/jdk1.8.0_191
scp -r /opt/scala-2.12.8 root@spark2:/opt/scala-2.12.8
scp -r /opt/scala-2.12.8 root@spark3:/opt/scala-2.12.8
2.6修改配置文件
在三台机器上修改配置文件,并使配置文件生效
vi /etc/profile
export JAVA_HOME=/opt/jdk1.8.0_191
export HADOOP_HOME=/opt/hadoop-3.1.1
export SCALA_HOME=/opt/scala-2.12.8
export SPARK_HOME=/opt/spark-2.4.0-bin-hadoop2.7/
export PATH=$SPARK_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
执行source /etc/profile,使配置文件生效
接下来,测试基础环境是否配置无误。
java -version
scala –version
2.7 Hadoop完全分布式搭建
2.7.1创建hadoop相关目录
在spark1上创建hadoop相关目录
mkdir -p /opt/hadoop-3.1.1/hadoop/tmp
mkdir -p /opt/hadoop-3.1.1/hadoop/data
mkdir -p /opt/hadoop-3.1.1/hadoop/name
2.7.2修改相应的配置文件
以下内容在spark1上执行
2.7.2.1 修改hadoop-env.sh文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/hadoop-env.sh
头部添加以下内容:
export JAVA_HOME=/opt/jdk1.8.0_191
export HADOOP_PREFIX=/opt/hadoop-3.1.1
2.7.2.2修改yarn-env.sh文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/yarn-env.sh
头部添加以下内容:
export JAVA_HOME=/opt/jdk1.8.0_191
2.7.2.3修改core-site.xml文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://spark1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop-3.1.1/hadoop/tmp</value>
</property>
</configuration>
2.7.2.4修改hdfs-site.xml文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>spark1:50070</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop-3.1.1/hadoop/name</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop-3.1.1/hadoop/data</value>
</property>
</configuration>
2.7.2.5修改mapred-site.xml文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/</value>
</property>
</configuration>
注意此处标红的部分,该部分在测试hadoop的mapreduce时,需要制定该内容,否则会报以下错误
2.7.2.6修改workers文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/workers
spark1
spark2
spark3
2.7.2.7修改yarn-site.xml文件
命令:vi /opt/hadoop-3.1.1/etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>spark1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>spark1:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>spark1:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>spark1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>spark1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>spark1:8088</value>
</property>
</configuration>
2.7.2.8修改start-dfs.sh文件
命令:vi /opt/hadoop-3.1.1/sbin/start-dfs.sh
头部添加以下内容:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
2.7.2.9修改stop-dfs.sh文件
命令:vi /opt/hadoop-3.1.1/sbin/ stop-dfs.sh
头部添加以下内容:
HDFS_DATANODE_USER=root
HADOOP_SECURE_DN_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
2.7.2.10修改start-yarn.sh文件
命令:vi /opt/hadoop-3.1.1/sbin/start-yarn.sh
头部添加以下内容:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
2.7.2.11修改stop-yarn.sh文件
命令:vi /opt/hadoop-3.1.1/sbin/stop-yarn.sh
头部添加以下内容:
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
2.7.2.12修改start-all.sh文件
命令:vi /opt/hadoop-3.1.1/sbin/start-all.sh
头部添加以下内容:
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
2.7.2.13修改stop-all.sh文件
命令:vi /opt/hadoop-3.1.1/sbin/stop-all.sh
头部添加以下内容:
HDFS_DATANODE_USER=root
HDFS_DATANODE_SECURE_USER=hdfs
HDFS_NAMENODE_USER=root
HDFS_SECONDARYNAMENODE_USER=root
YARN_RESOURCEMANAGER_USER=root
HADOOP_SECURE_DN_USER=yarn
YARN_NODEMANAGER_USER=root
2.7.3分发hadoop 文件
将配置好的hadoop文件夹分发至spark2和spark3
scp -r /opt/hadoop-3.1.1 root@spark2:/opt/hadoop-3.1.1
scp -r /opt/hadoop-3.1.1 root@spark3:/opt/hadoop-3.1.1
2.7.4格式化hadoop集群
在spark1上执行格式化命令:
/opt/hadoop-3.1.1/bin/hdfs namenode –format
2.7.5启动hadoop集群
在spark1上启动hadoop集群:
/opt/hadoop-3.1.1/sbin/start-all.sh
验证hadoop集群
在三台机器上分别执行jps
在spark1上执行jps:
在spark2上执行jps:
在spark3上执行jps:
访问http://192.168.56.11:8088
至此,hadoop分布式集群已经搭建完成。
2.8 Spark完全分布式搭建
2.8.1修改相应的配置文件
2.8.1.1修改spark-env.sh文件
命令:
cp /opt/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh.template /opt/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh
vi /opt/spark-2.4.0-bin-hadoop2.7/conf/spark-env.sh
头部添加以下内容:
export SCALA_HOME=/opt/scala-2.12.8
export JAVA_HOME=/opt/jdk1.8.0_191
export SPARK_MASTER_IP=spark1
export SPARK_WORKER_MEMORY=1g
export HADOOP_CONF_DIR=/opt/hadoop-3.1.1/etc/hadoop
2.8.1.2修改slaves文件
命令:
cp /opt/spark-2.4.0-bin-hadoop2.7/conf/slaves.template /opt/spark-2.4.0-bin-hadoop2.7/conf/slaves
vi /opt/spark-2.4.0-bin-hadoop2.7/conf/slaves
spark1
spark2
spark3
2.8.2分发spark文件夹
将修改好的spark文件夹分发至其他两台机器(spark2\spark3)
scp -r /opt/spark-2.4.0-bin-hadoop2.7 root@spark2:/opt/spark-2.4.0-bin-hadoop2.7
scp -r /opt/spark-2.4.0-bin-hadoop2.7 root@spark3:/opt/spark-2.4.0-bin-hadoop2.7
2.8.3启动spark集群
在spark1上执行启动集群的命令。
/opt/spark-2.4.0-bin-hadoop2.7/sbin/start-all.sh
验证spark集群,在三台机器上分别执行jps
在spark1上执行
在sark2上执行
在spark3上执行
3测试集群
这里采用最简单最常用的Wordcount来测试,测试的源文件/opt/wordcount.txt的内容为:
Hello hadoop
hello spark
hello bigdata
3.1测试hadoop集群
3.1.1上传文件至hdfs
执行下列命令:
hadoop fs -mkdir -p /Hadoop/Input
hadoop fs -put /opt/wordcount.txt /Hadoop/Input
将文件上传至hdfs
3.1.2执行mapreduce
命令:
hadoop jar /opt/hadoop-3.1.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /Hadoop/Input /Hadoop/Output
在执行job的过程中,我们可以查看http://192.168.56.11:8088/cluster的状态
执行成功之后
在spark1上查看mapreduce执行过程
在spark1上查看结果
hadoop fs -cat /Hadoop/Output/*
hadoop集群搭建成功!
3.2测试spark集群
我们使用spark-shell,做一个简单的worcount测试,直接使用已经在hdfs上存储了测试的源文件。
3.2.1启动spark-shell
命令:spark-shell
执行以下命令:
val file=sc.textFile("hdfs://spark1:9000/Hadoop/Input/wordcount.txt")
val rdd = file.flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
rdd.collect()
rdd.foreach(println)
运行后的结果和mapreduce是一样的,spark集群也搭建成功!