大数据平台学习之路(2)搭建hadoop平台

1、背景

上篇博客已经讲述了如何去安装ubuntu 16.04 系统并安装了java,ssh,vim

其中JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

本篇博客主要关于如何搭建hadoop平台(hadoop-2.8.5)

2、文件准备

我目前使用的hadoop是官方发布的二进制版本,稳定版本,但是可能存在一些意料之外的bug,所以官方建议不作为生产环境使用,我重点在于学习hadoop平台及其生态中的其他组件,以及想使用hive3.1版本以及spark2.3版本,所以采用此版本。

2019年3月20日补充,多次尝试之后去官网了解,因为spark2.3.3暂时不支持hadoop3.0和hive3.1 metastore 导致spark 调用hive表时无法实现,官方预计支持是在spark3.0版本中,详见spark官网https://issues.apache.org/jira/browse/SPARK-24360?filter=-4&jql=project%20%3D%20SPARK%20AND%20text%20~%20%22hive%20metastore%22%20order%20by%20created%20DESC。所以将hadoop和hive版本至2.x,spark仍然采用spark2.3.3,所以对博客内容进行了修改,但是基本配置上是相同的。

https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

$ wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

下载完成之后解压,我将它解压在了/usr/local/目录下,我的所有组件全部安装在了此目录下,根据个人习惯选择。

不过在解压到此目录下之间先将/usr/local/目录的所有者改成当前用户

sudo chown -R hadoop:hadoop /usr/local/

解压到/usr/local下

tar zxvf hadoop-2.8.5.tar.gz -C /usr/local/

需要修改的参数文档在 /usr/local/hadoop-3.1.2/etc/hadooop目录下

  • core-site.xml
  • hadoop-env.sh
  • hdfs-site.xml
  • mapred-site.xml
  • master
  • slaves
  • yarn-site.xml

我将我设置的参数文档放在博客中以供参考,同时会对一些参数进行注释。

core-site.xml (master是我在hosts文件中主节点的别名)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!--配置namenode的地址-->
  <property>
    <name>fs.default.name</name>
    <value>hdfs://master:9000</value>
  </property>
<!-- 指定hadoop运行时产生文件的存储目录 -->
  <property>
    <name>hadoop.tmp.dir</name>
    <value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/tmp</value>
 </property>    
<property>

<name>io.file.buffer.size</name>

<value>131072</value>

</property>
<property>
        <name>hadoop.proxyuser.httpfs.hosts</name>
        <value>*</value>
</property>
<property>
        <name>hadoop.proxyuser.httpfs.groups</name>
        <value>*</value>
</property>
<property>
	<name>hadoop.proxyuser.root.hosts</name>
	<value>*</value>
</property>
<property>

	<name>hadoop.proxyuser.root.groups</name>
    <value>*</value>
</property> 
</configuration>

hadoop-env.sh中需要添加JAVA_HOME 变量

hadoop-env.sh文件中添加

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_CLASSPATH=.:$CLASSPATH:$HADOOP_CLASSPATH:$HADOOP_HOME/bin

hdfs-site.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!--指定hdfs的文件副本数-->
  <property>
        <name>dfs.replication</name>
        <value>1</value>
  </property>
<!--设置hdfs的权限-->
  <property>
         <name>dfs.permissions</name>
         <value>false</value>
  </property>
<!-- secondary name node web 监听端口 -->
  <property>
         <name>dfs.namenode.secondary.http-address</name>
         <value>master:50090</value>
  </property>
<!-- name node web 监听端口 -->

  <property>
    <name>dfs.namenode.http-address</name>
    <value>master:50070</value>
  </property>
<!-- 开启web hdfs功能 -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<!--web hdfs 操作的用户组 -->
<property>
<name>dfs.web.ugi</name>
<value>supergroup</value>
</property>
<!-- NN所使用的元数据保存-->
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/namenode</value>
  </property>
<!-- datanode文件夹这里建议在最后的datanode进行编号 
比如节点1 file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/data/datanode1
节点2 file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/data/datanode2
我这里将master也作为了datanode节点
 -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/data/datanode</value>
</property>
<!--存放 edit 文件-->
  <property>
    <name>dfs.namenode.edits.dir</name>
    <value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/edits</value>
  </property>
<!-- secondary namenode 节点存储 checkpoint 文件目录-->
  <property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/checkpoints</value>
  </property>
<!-- secondary namenode 节点存储 edits 文件目录-->
  <property>
    <name>dfs.namenode.checkpoint.edits.dir</name>
    <value>file:/usr/local/hadoop-2.8.5/hadoop_data/hdfs/checkpoints/edits</value>
  </property>

</configuration>

mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<!-- 指定mr运行在yarn上 -->
  <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
  </property>
<!--<property>
	<name>mapred.job.tracker</name>
	<value>master:54311</value>
</property>
<-->
<!--历史服务的web端口地址  -->
  <property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>master:19888</value>
  </property>
<!--历史服务的端口地址-->
  <property>
    <name>mapreduce.jobhistory.address</name>
    <value>master:10020</value>
  </property>
<!--Uber运行模式-->
  <property>
    <name>mapreduce.job.ubertask.enable</name>
    <value>false</value>
  </property>

<!--是job运行时的临时文件夹-->
    <property>
        <name>yarn.app.mapreduce.am.staging-dir</name>
        <value>hdfs://master:9000/tmp/hadoop-yarn/staging</value>
        <description>The staging dir used while submitting jobs.</description>
    </property>
    <property>
        <name>mapreduce.jobhistory.intermediate-done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate</value>
    </property>
    <!--MR JobHistory Server管理的日志的存放位置-->
    <property>
        <name>mapreduce.jobhistory.done-dir</name>
        <value>${yarn.app.mapreduce.am.staging-dir}/history/done</value>
    </property>

<property>
        <name>mapreduce.map.memory.mb</name>
        <value>1024</value>
<description>每个Map任务的物理内存限制</description> 
</property>


<property>
        <name>mapreduce.map.java.opts</name>
        <value>819.2</value>
<description></description> 
</property>
<property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>2048</value>
<description>每个Reduce任务的物理内存限制</description> 
</property>
<property>
        <name>mapreduce.reduce.java.opts</name>
        <value>1638.4</value>
<description></description> 
</property>

<property>
        <name>mapreduce.application.classpath</name>
        <value>
            /usr/local/hadoop-2.8.5/etc/hadoop,
            /usr/local/hadoop-2.8.5/share/hadoop/common/*,
            /usr/local/hadoop-2.8.5/share/hadoop/common/lib/*,
            /usr/local/hadoop-2.8.5/share/hadoop/hdfs/*,
            /usr/local/hadoop-2.8.5/share/hadoop/hdfs/lib/*,
            /usr/local/hadoop-2.8.5/share/hadoop/mapreduce/*,
            /usr/local/hadoop-2.8.5/share/hadoop/mapreduce/lib/*,
            /usr/local/hadoop-2.8.5/share/hadoop/yarn/*,
            /usr/local/hadoop-2.8.5/share/hadoop/yarn/lib/*
        </value>
 </property>
</configuration>

master(新增)

master

slaves

master
data1
data2

yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property>
	<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.shuffleHandler</value>
</property>
<!-- 公平调度器 -->
<property>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>


<property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8032</value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
    <value>master:8031</value>
  </property>
<property>
    <name>yarn.resourcemanager.webapp.address</name>
    <value>master:8088</value>
  </property>

<!--启用日志聚集功能-->
  <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
  </property>
<!--在HDFS上聚集的日志最多保存多长时间-->
  <property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>86400</value>
  </property> 	

<property>
<discription>单个任务可申请最少内存,默认1024MB</discription>  
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
</property>
<property>
<discription>单个任务可申请最大内存,默认8192MB</discription>  
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
<property>

   <name>yarn.nodemanager.vmem-check-enabled</name>

   <value>false</value>

</property>
<property>

   <name>yarn.nodemanager.pmem-check-enabled</name>

   <value>false</value>

</property>
<!-- 这个参数对应当前机器可用内存,每个节点设置可以不同 -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<discription></discription>  
<value>4096</value>
</property>
<!-- 这个参数对应当前机器可用虚拟核心数,每个节点设置可以不同 -->
<property>
   <name>yarn.nodemanager.resource.cpu-vcores</name>
   <value>2</value>
   <description>Number of CPU cores that can be allocated for containers.</description>
</property>

<property>
<discription>单个任务可申请最少虚拟核心数,默认1个</discription>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<discription></discription>  
<value>1</value>
</property>

<property>
<discription>单个任务可申请最大虚拟核心数,默认1个</discription>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<discription></discription>  
<value>5</value>
</property>

<property>
<discription>规整化因子</discription>
<name>yarn.scheduler.increment-allocation-mb</name>
<discription></discription>  
<value>256</value>
</property>
<!--
<property>
<name>yarn.app.mapreduce.am.resource.mb</name> 
<value>512</value>
</property>
-->
<property>
<description>Ratio between virtual memory to physical memory when
setting memory limits for containers. Container allocations are
expressed in terms of physical memory, and virtual memory usage
is allowed to exceed this allocation by this ratio.
</description>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>3.5</value>
</property>

</configuration>

参数到这里就设置完成了

因为hive只需要安装在运行的节点,spark需要编译才可使用,所以接下来就开始克隆虚拟机并设置免密登录

先修改/etc/hosts文件(127.0.1.1这行需要注释掉

然后克隆虚拟机 注意自己电脑内存 合理分配内存和核心

节点可用内存  作业最小内存 最大内存 最小核心 最大核心都是需要适应电脑的内存和核心数进行修改

固定ip参照上篇博客中的方法,然后将主节点设置为master,data节点依次编号

修改主机名需要修改/etc/hostname文件 

$ sudo vim /etc/hostname

 

配置免密登录

在所有节点执行下面步骤

$  ssh-keygen -t rsa 

根据实际情况输入  文件路径建议默认,一般路径默认,密码为空即可。

然后执行添加公钥

$ ssh-add

配置免密登录

后面括号为当前的主机名

$ ssh-copy-id -i ./.ssh/id_rsa.pub [master/data1/data2]

输入 yes   然后输入主机密码  即可

输入ssh 查看是否可以免密登录

添加hadoop环境变量

$ vim ~/.bashrc

添加下面这些参数

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
#export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:$JAVA_HOME/lib
export PATH=$PATH:${JAVA_HOME}/bin
export HADOOP_HOME=/usr/local/hadoop-2.8.5
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPPED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HODOOP_HOME
export CLASSPATH=$CLASSPATH:/usr/local/hadoop-2.8.5/lib/*:. 
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CONF_DIR=/usr/local/hadoop-2.8.5/etc/hadoop
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

然后生效环境变量

$ source ~/.bashrc

第一次运行hadoop前需要格式化namenode

格式化namenode

$ hdfs namenode -format

启动集群

$ start-all.sh

执行jps可以看到运行信息

datanode上会有Datanode和NodeManager

查看master:50070端口

查看master:8088端口

hadoop集群就算搭建完了 。

 

 

  • 8
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值