hadoop 2.7 集群搭建笔记

尘客.

已于 2023-10-16 14:19:46 修改

阅读量437

点赞数

分类专栏： Bigdata # install 文章标签：配置文件书写集群搭建 hadoop

于 2019-08-20 22:56:34 首次发布

本文链接：https://blog.csdn.net/qq_34901049/article/details/99858834

版权

Bigdata 同时被 2 个专栏收录

68 篇文章 4 订阅

订阅专栏

install

20 篇文章 0 订阅

订阅专栏

以下简单记录了hadoop集群搭建的几个配置文件基本配置，在此基础上进行了mapreduce/yarn的history log拓展配置。

概述

hadoop集群搭建需要配置四个基本xml文件：（HADOOP_HOME/etc/hadoop）

 这四个配置文件将在继承对应的四个default文件上进行用户自定义设置。
 core-site.xml
 hdfs-site.xml
 yarn-site.xml
 mapred-site.xml (cp/mv from mapred-site.xml.template)

以及三个环境相关脚本文件:

主要是添加JAVA_HOME及其他必要的运行环境。
 hadoop-env.sh
 yarn-env.sh
 mapred-env.sh

最后是配置slaves，为了运行集群群起。

实验环境
基于三台centos6.8虚拟机进行搭建，三台节点hostName分别为chdp01、chdp02、chdp03，在此已完成基本网络、DNS配置。

集群规划
三台节点均担任有dataNode以及nodeManager角色，chdp01担任nameNode，chdp02担任secondaryNamNode，chdp03担任resourceManager。

一、配置文件

详细配置展开，配置代码作用见注释。

1、core-site.xml

 <configuration>
		<!-- set nameaddress of NameNode on HDFS-->
        <property>
                 <name>fs.defaultFS</name>
                  <value>hdfs://chdp01:9000</value>
        </property>
        <!-- set the directory with produce when running hadoop-->
        <property>
               <name>hadoop.tmp.dir</name>
                <value>/usr/SFT/hadoop-2.7.2/data/tmp</value>
        </property>
                
		<!-- configuration trash ，ref from core-default.xml-->
		<property>
				<name>fs.trash.interval</name>
				<value>10</value>
				<description>
						 Number of minutes after which the checkpoint  gets deleted.
						   If zero, the trash feature is disabled.
				 </description>
		</property>				
		<property>
				<name>fs.trash.checkpoint.interval</name>
				<value>0</value>
				<description>
						  	  Number of minutes between trash checkpoints.
							  Should be smaller or equal to fs.trash.interval. If zero,
							  the value is set to the value of fs.trash.interval.
				</description>
		</property>
 </configuration>

2、hdfs-site.xml


 <configuration>
	    <!-- set the replication of file, needed base on number of datanode  -->
        <property>
               <name>dfs.replication</name>
                <value>3</value>
         </property>

         <!--set namenode storage dir-->
	    <property>
	        <name>dfs.name.dir</name>
	        <value>/data/hadoop/name</value>
	    </property>
	    <!--set datanode storage dir-->
	    <property>
	        <name>dfs.data.dir</name>
	        <value>/data/hadoop/data</value>
	    </property>

		<!-- CLUSTER  :set secondary namenode ，about port, follow your heart all but system used and unspoken rule  -->
         <property>
                  <name>dfs.namenode.secondary.http-address</name>
                  <value>chdp02:50090</value>
         </property>
 </configuration>

3 yarn-site.xml

 <configuration>
		<!-- the mode of data acquirement in reduce-->
       <property>
               <name>yarn.nodemanager.aux-services</name>
               <value>mapreduce_shuffle</value>
       </property>
       <!-- set the address/domain of resourcemanager for yarn  -->
       <property>
               <name>yarn.resourcemanager.hostname</name>
               <value>chdp03</value>
       </property>
		
        <!-- FOR LOGS: open the feature of aggregation  -->
        <property>
               <name>yarn.log-aggregation-enable</name>
               <value>true</value>
        </property>
        <!-- FOR LOGS: set the life time for logs (days 7) -->
        <property>
                <name>yarn.log-aggregation.retain-seconds</name>
                <value>604800</value>
         </property>
         <!-- job history server url -->
		<property>
			<name>yarn.log.server.url</name>
			<value>http://chdp01:20000/jobhistory/logs/</value>
		</property>
        <!-- advanced-->
             <!--set max memory for single ApplicationMaster could be apply-->
        <property>
            <name>yarn.scheduler.maximum-allocation-mb</name>
            <value>2048</value>
        </property>
        <!--set max memory for single container which ApplicationMaster could be apply-->
        <property>
            <name>yarn.scheduler.minimum-allocation-mb</name>
            <value>1024</value>
        </property>
        <!--set virtual retio, this means you can using vmem-pmem-ratio*VMSize of memory size  in a node, default 2.1 -->
        <property>
            <name>yarn.nodemanager.vmem-pmem-ratio</name>
            <value>2.1</value>
        </property>
        <!-- set max JVM memory  can be request for a task-->
        <property>
            <name>mapred.child.java.opts</name>
            <value>-Xmx1024m</value>
        </property>
</configuration>

4 mapred-site

<configuration>
<!-- set mapreduce running on yarn -->
                <property>
                        <name>mapreduce.framework.name</name>
                        <value>yarn</value>
                </property>

<!--  FOR LOGS: configure history address of server，about port, follow your heart all but system used，  default port 10020-->
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>chdp01:20001</value>
        </property>
<!-- FOR LOGS: the web address of history ，about port, follow your heart all but system used,  default port 19888-->
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>chdp01:20000</value>
        </property>
</configuration>

5 hadoop-env.sh/yarn-env.sh/mapred-env.sh
在此三者一样的操作，配置JAVA_HOME即可

export JAVA_HOME=/usr/SFT/jdk1.8.0_191

6 slaves

列出需要启动节点ip/domain，注意文件不能有空行，每行后不能有空格，确保规范书写。
chdp01
chdp02
chdp03

7 环境变量
在/etc/profile末尾处添加如下内容

export HADOOP_HOME=/usr/SFT/hadoop-2.7.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

注：
1、以上配置文件根据我自己的虚拟机进行配置，具体配置根据自己实际情况进行修改。
2、dataNode与resourceManager的问题: 集群启动时需要在对应的节点上运行批量启动脚本，否则将报错，如：集群环境下启动yarn时成功启动了nodemanager，但未成功启动resourcemanager

二、格式化与启动

1 format
在配置了namenode的节点上运行bin/hdfs namenode -format。曾记录过在format阶段出错导致的一些问题与解决，见这篇博客：记一次hadoop namenode 启动失败问题及解决过程（启动几秒钟后又挂了）

2 启动集群

chdp01: start-dfs.sh
chdp02: start-yarn.sh

也可以写个脚本一键启动：（需要注意ssh免密配置，详见这篇博客：Host key verification failed.）
并不推荐使用sbin/start-all.sh启动yarn和hdfs，因为在大多数情况下namenode与resourcemanager不在同一台节点上，此时便会出错（ Error starting ResourceManager）。

#strart hdfs and yarn
ssh chdp01 '/usr/SFT/hadoop-2.7.2/sbin/start-dfs.sh'
ssh chdp03 '/usr/SFT/hadoop-2.7.2/sbin/start-yarn.sh'

3 启动日志服务进程

mr-jobhistory-daemon.sh start  historyserver

尘客.

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
hadoop 2.7 集群搭建笔记

以下简单记录了hadoop集群搭建的几个配置文件基本配置，在此基础上进行了mapreduce/yarn的history log拓展配置。hadoop集群搭建需要配置四个基本xml文件：1 core-site.xml2 hdfs-site.xml3 yarn-site.xml4 mapred-site.xml (cp/mv from mapred-site.xml.template)这...
复制链接

扫一扫