1、背景
上篇博客已经讲述了如何去安装ubuntu 16.04 系统并安装了java,ssh,vim
其中JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
本篇博客主要关于如何搭建hadoop平台(hadoop-2.8.5)
2、文件准备
我目前使用的hadoop是官方发布的二进制版本,稳定版本,但是可能存在一些意料之外的bug,所以官方建议不作为生产环境使用,我重点在于学习hadoop平台及其生态中的其他组件,以及想使用hive3.1版本以及spark2.3版本,所以采用此版本。
2019年3月20日补充,多次尝试之后去官网了解,因为spark2.3.3暂时不支持hadoop3.0和hive3.1 metastore 导致spark 调用hive表时无法实现,官方预计支持是在spark3.0版本中,详见spark官网https://issues.apache.org/jira/browse/SPARK-24360?filter=-4&jql=project%20%3D%20SPARK%20AND%20text%20~%20%22hive%20metastore%22%20order%20by%20created%20DESC。所以将hadoop和hive版本至2.x,spark仍然采用spark2.3.3,所以对博客内容进行了修改,但是基本配置上是相同的。
https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz
下载完成之后解压,我将它解压在了/usr/local/目录下,我的所有组件全部安装在了此目录下,根据个人习惯选择。
不过在解压到此目录下之间先将/usr/local/目录的所有者改成当前用户
sudo chown -R hadoop:hadoop /usr/local/
解压到/usr/local下
tar zxvf hadoop-2.8.5.tar.gz -C /usr/local/
需要修改的参数文档在 /usr/local/hadoop-3.1.2/etc/hadooop目录下
- core-site.xml
- hadoop-env.sh
- hdfs-site.xml
- mapred-site.xml
- master
- slaves
- yarn-site.xml
我将我设置的参数文档放在博客中以供参考,同时会对一些参数进行注释。
core-site.xml (master是我在hosts文件中主节点的别名)
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!--配置namenode的地址-->
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<!-- 指定hadoop运行时产生文件的存储目录 -->
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/tmp</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.httpfs.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.root.groups</name>
<value>*</value>
</property>
</configuration>
hadoop-env.sh中需要添加JAVA_HOME 变量
hadoop-env.sh文件中添加
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CLASSPATH=.:$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/bin
hdfs-site.xml文件
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!--指定hdfs的文件副本数-->
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<!--设置hdfs的权限-->
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<!-- secondary name node web 监听端口 -->
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master:50090</value>
</property>
<!-- name node web 监听端口 -->
<property>
<name>dfs.namenode.http-address</name>
<value>master:50070</value>
</property>
<!-- 开启web hdfs功能 -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<!--web hdfs 操作的用户组 -->
<property>
<name>dfs.web.ugi</name>
<value>supergroup</value>
</property>
<!-- NN所使用的元数据保存-->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/namenode</value>
</property>
<!-- datanode文件夹这里建议在最后的datanode进行编号
比如节点1 file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/data/datanode1
节点2 file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/data/datanode2
我这里将master也作为了datanode节点
-->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/data/datanode</value>
</property>
<!--存放 edit 文件-->
<property>
<name>dfs.namenode.edits.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/edits</value>
</property>
<!-- secondary namenode 节点存储 checkpoint 文件目录-->
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/checkpoints</value>
</property>
<!-- secondary namenode 节点存储 edits 文件目录-->
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/checkpoints/edits</value>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定mr运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<!--<property>
<name>mapred.job.tracker</name>
<value>master:54311</value>
</property>
<-->
<!--历史服务的web端口地址 -->
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
<!--历史服务的端口地址-->
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<!--Uber运行模式-->
<property>
<name>mapreduce.job.ubertask.enable</name>
<value>false</value>
</property>
<!--是job运行时的临时文件夹-->
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>hdfs://master:9000/tmp/hadoop-yarn/staging</value>
<description>The staging dir used while submitting jobs.</description>
</property>
<property>
<name>mapreduce.jobhistory.intermediate-done-dir</name>
<value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>
</property>
<!--MR JobHistory Server管理的日志的存放位置-->
<property>
<name>mapreduce.jobhistory.done-dir</name>
<value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>1024</value>
<description>每个Map任务的物理内存限制</description>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>819.2</value>
<description></description>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>2048</value>
<description>每个Reduce任务的物理内存限制</description>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>1638.4</value>
<description></description>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>
/usr/local/hadoop-2.8.5/etc/hadoop,
/usr/local/hadoop-2.8.5/share/hadoop/common/*,
/usr/local/hadoop-2.8.5/share/hadoop/common/lib/*,
/usr/local/hadoop-2.8.5/share/hadoop/hdfs/*,
/usr/local/hadoop-2.8.5/share/hadoop/hdfs/lib/*,
/usr/local/hadoop-2.8.5/share/hadoop/mapreduce/*,
/usr/local/hadoop-2.8.5/share/hadoop/mapreduce/lib/*,
/usr/local/hadoop-2.8.5/share/hadoop/yarn/*,
/usr/local/hadoop-2.8.5/share/hadoop/yarn/lib/*
</value>
</property>
</configuration>
master(新增)
master
slaves
master
data1
data2
yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.shuffleHandler</value>
</property>
<!-- 公平调度器 -->
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<!--启用日志聚集功能-->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<!--在HDFS上聚集的日志最多保存多长时间-->
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>86400</value>
</property>
<property>
<discription>单个任务可申请最少内存,默认1024MB</discription>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<discription>单个任务可申请最大内存,默认8192MB</discription>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<!-- 这个参数对应当前机器可用内存,每个节点设置可以不同 -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<discription></discription>
<value>4096</value>
</property>
<!-- 这个参数对应当前机器可用虚拟核心数,每个节点设置可以不同 -->
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
<property>
<discription>单个任务可申请最少虚拟核心数,默认1个</discription>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<discription></discription>
<value>1</value>
</property>
<property>
<discription>单个任务可申请最大虚拟核心数,默认1个</discription>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<discription></discription>
<value>5</value>
</property>
<property>
<discription>规整化因子</discription>
<name>yarn.scheduler.increment-allocation-mb</name>
<discription></discription>
<value>256</value>
</property>
<!--
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>
-->
<property>
<description>Ratio between virtual memory to physical memory when
setting memory limits for containers. Container allocations are
expressed in terms of physical memory, and virtual memory usage
is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>3.5</value>
</property>
</configuration>
参数到这里就设置完成了
因为hive只需要安装在运行的节点,spark需要编译才可使用,所以接下来就开始克隆虚拟机并设置免密登录
先修改/etc/hosts文件(127.0.1.1这行需要注释掉)
然后克隆虚拟机 注意自己电脑内存 合理分配内存和核心
节点可用内存 作业最小内存 最大内存 最小核心 最大核心都是需要适应电脑的内存和核心数进行修改
固定ip参照上篇博客中的方法,然后将主节点设置为master,data节点依次编号
修改主机名需要修改/etc/hostname文件
$ sudo vim /etc/hostname
配置免密登录
在所有节点执行下面步骤
$ ssh-keygen -t rsa
根据实际情况输入 文件路径建议默认,一般路径默认,密码为空即可。
然后执行添加公钥
$ ssh-add
配置免密登录
后面括号为当前的主机名
$ ssh-copy-id -i ./.ssh/id_rsa.pub [master/data1/data2]
输入 yes 然后输入主机密码 即可
输入ssh 查看是否可以免密登录
添加hadoop环境变量
$ vim ~/.bashrc
添加下面这些参数
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
#export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:$JAVA_HOME/lib
export PATH=$PATH:${JAVA_HOME}/bin
export HADOOP_HOME=/usr/local/hadoop-2.8.5
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPPED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HODOOP_HOME
export CLASSPATH=$CLASSPATH:/usr/local/hadoop-2.8.5/lib/*:.
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CONF_DIR=/usr/local/hadoop-2.8.5/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
然后生效环境变量
$ source ~/.bashrc
第一次运行hadoop前需要格式化namenode
格式化namenode
$ hdfs namenode -format
启动集群
$ start-all.sh
执行jps可以看到运行信息
datanode上会有Datanode和NodeManager
查看master:50070端口
查看master:8088端口
hadoop集群就算搭建完了 。