Hadoop 简单整理

本文介绍了Hadoop的核心组件,包括HDFS、YARN和MapReduce的工作原理,并详细讲解了YARN的资源配置与调度,特别是如何通过capacity-scheduler.xml配置多队列,以优化资源分配。同时,列举了各主要配置文件的默认设置与自定义配置,如core-site.xml、hdfs-site.xml和yarn-site.xml,以帮助理解Hadoop集群的管理和优化。
摘要由CSDN通过智能技术生成

众所周知,hadoop核心三组件:

        HDFS:主要负责存储数据

                DataNode:实际存储数据节点(即服务器)

                NameNode:管理者,处理客户端请求,配置副本策略等

                HDFS默认文件块大小为:128M(1.x为64M,可修改,块儿大小主要取决于磁盘传输速度)

                SecondaryNameNode:协助NameNode管理数据,注意不是NameNode的热备,不过可以通过SecondaryNameNode恢复NameNode;合并Edts和Fsimage文件。默认一小时或者操作次数达到100万次执行一次CheckPoint。

HDFS写流程:

HDFS读流程:

        YARN:主要负责资源的调度(hadoop1.x时,资源调度由MapReduce负责,2.x之后才引入添加了yarn,用来进行资源调度)

                ResourceManager:

                        处理客端请求;

                        负责资源分配及调度;

                        监控NodeManager;

                        启动和监控ApplicationMaster;

                NodeManager:

                        管理单节点上的资源;

                        处理来自ResourceManager的命令;

                        处理来自ApplicationMaster的命令;

                ApplicationMaster:

                        为应用程序申请资源并分配给内部的任务;

                        任务的监控和容错;

                Container:

                        资源抽象,封装了某个节点上的多维度资源,任务执行地;

yarn工作机制:

yarn核心参数配置:

配置多队列:

创建多队列的好处?

(1)因为担心员工不小心,写递归死循环代码,把所有资源全部耗尽。

(2)实现任务的降级使用,特殊时期保证重要的任务队列资源充足。11.11  6.18

业务部门1(重要)=》业务部门2(比较重要)=》下单(一般)=》购物车(一般)=》登录注册(次要)

capacity-scheduler.xml

<!-- 指定多队列,增加hive队列 -->

<property>

    <name>yarn.scheduler.capacity.root.queues</name>

    <value>default,hive</value>

    <description>

      The queues at the this level (root is the root queue).

    </description>

</property>

<!-- 降低default队列资源额定容量为40%,默认100% -->

<property>

    <name>yarn.scheduler.capacity.root.default.capacity</name>

    <value>40</value>

</property>

<!-- 降低default队列资源最大容量60%,默认100% -->

<property>

    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>

    <value>60</value>

</property>

<!-- 指定hive队列的资源额定容量 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.capacity</name>

    <value>60</value>

</property>

<!-- 用户最多可以使用队列多少资源,1表示 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>

    <value>1</value>

</property>

<!-- 指定hive队列的资源最大容量 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>

    <value>80</value>

</property>

<!-- 启动hive队列 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.state</name>

    <value>RUNNING</value>

</property>

<!-- 哪些用户有权向队列提交作业 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>

    <value>*</value>

</property>

<!-- 哪些用户有权操作队列,管理员权限(查看/杀死) -->

<property>

    <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>

    <value>*</value>

</property>

<!-- 哪些用户有权配置提交任务优先级 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name>

    <value>*</value>

</property>

<!-- 任务的超时时间设置:yarn application -appId appId -updateLifetime Timeout

参考资料:Enforcing application lifetime SLAs on YARN - Cloudera Blog -->

<!-- 如果application指定了超时时间,则提交到该队列的application能够指定的最大超时时间不能超过该值。 

-->

<property>

    <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime</name>

    <value>-1</value>

</property>

<!-- 如果application没指定超时时间,则用default-application-lifetime作为默认值 -->

<property>

    <name>yarn.scheduler.capacity.root.hive.default-application-lifetime</name>

    <value>-1</value>

</property>

        MapReduce:负责计算(例如wordcount)

切片机制:默认为块儿大小128M

CombineTextInputFormat用于小文件过多场景,将小文件逻辑上规划到一个切片中

原理:

shuffle机制:

环形缓冲区默认100M,达到80%是反向溢写到磁盘,根据分区、keyindex快排后溢写到磁盘形成临时文件,kv数据,key为实际value的index等信息;然后merge阶段,maptask将所有的临时文件按照分区进行归并排序形成一个大文件,reducetask拉取数据,如果超过阈值就写到磁盘,否则放在内存中;然后按照key进行聚合,最后再对数据做一次归并排序,最后结果写到HDFS上。

默认配置文件:

        核心配置文件:core-default.xml
        HDFS配置文件:hdfs-default.xml
        YARN配置文件:yarn-default.xml
        MapReduce配置文件:mapred-default.xml

自定义配置文件:

       核心配置文件:core-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

    <!-- 指定NameNode的地址 9820端口号自定义,集群内部通信端口 -->

    <property>

        <name>fs.defaultFS</name>

        <value>hdfs://localhost:9820</value>

    </property>

    <!-- 指定hadoop数据的存储目录 -->

    <property>

        <name>hadoop.tmp.dir</name>

        <value>hadoop地址/data</value>

    </property>

    <!-- 配置HDFS网页登录使用的静为 -->

    <property>

        <name>hadoop.http.staticuser.user</name> 

        <value>xxx</value>

    </property>

</configuration>

        HDFS配置文件:hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- nn web端访问地址-->

<property>

        <name>dfs.namenode.http-address</name>

        <value>localhost:9870</value>

    </property>

<!-- 2nn web端访问地址-->

    <property>

        <name>dfs.namenode.secondary.http-address</name>

        <value>localhost:9868</value>

    </property>

</configuration>

        YARN配置文件:yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

    <!-- 指定MR走shuffle -->

    <property>

        <name>yarn.nodemanager.aux-services</name>

        <value>mapreduce_shuffle</value>

    </property>

    <!-- 指定ResourceManager的地址-->

    <property>

        <name>yarn.resourcemanager.hostname</name>

        <value>localhost</value>

    </property>

    <!-- 环境变量的继承 -->

    <property>

        <name>yarn.nodemanager.env-whitelist</name>

        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

    </property>

<!-- 开启日志聚集功能 -->

<property>

    <name>yarn.log-aggregation-enable</name>

    <value>true</value>

</property>

<!-- 设置日志聚集服务器地址 -->

<property>  

    <name>yarn.log.server.url</name>  

    <value>http://localhost:19888/jobhistory/logs</value>

</property>

<!-- 设置日志保留时间为7天 -->

<property>

    <name>yarn.log-aggregation.retain-seconds</name>

    <value>604800</value>

</property>

<!-- 选择调度器,默认容量 -->

<property>

<description>The class to use as the resource scheduler.</description>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

</property>

<!-- ResourceManager处理调度器请求的线程数量,默认50;如果提交的任务数大于50,可以增加该值,但是不能超过3* 4线程 = 12线程(去除其他应用程序实际不能超过8) -->

<property>

<description>Number of threads to handle scheduler interface.</description>

<name>yarn.resourcemanager.scheduler.client.thread-count</name>

<value>8</value>

</property>

<!-- 是否让yarn自动检测硬件进行配置,默认是false,如果该节点有很多其他应用程序,建议手动配置。如果该节点没有其他应用程序,可以采用自动 -->

<property>

<description>Enable auto-detection of node capabilities such as

memory and CPU.

</description>

<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>

<value>false</value>

</property>

<!-- 是否将虚拟核数当作CPU核数,默认是false,采用物理CPU核数 -->

<property>

<description>Flag to determine if logical processors(such as

hyperthreads) should be counted as cores. Only applicable on Linux

when yarn.nodemanager.resource.cpu-vcores is set to -1 and

yarn.nodemanager.resource.detect-hardware-capabilities is true.

</description>

<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>

<value>false</value>

</property>

<!-- 虚拟核数和物理核数乘数,默认是1.0 -->

<property>

<description>Multiplier to determine how to convert phyiscal cores to

vcores. This value is used if yarn.nodemanager.resource.cpu-vcores

is set to -1(which implies auto-calculate vcores) and

yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier.

</description>

<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>

<value>1.0</value>

</property>

<!-- NodeManager使用内存数,默认8G,修改为4G内存 -->

<property>

<description>Amount of physical memory, in MB, that can be allocated

for containers. If set to -1 and

yarn.nodemanager.resource.detect-hardware-capabilities is true, it is

automatically calculated(in case of Windows and Linux).

In other cases, the default is 8192MB.

</description>

<name>yarn.nodemanager.resource.memory-mb</name>

<value>4096</value>

</property>

<!-- nodemanager的CPU核数,不按照硬件环境自动设定时默认是8个,修改为4个 -->

<property>

<description>Number of vcores that can be allocated

for containers. This is used by the RM scheduler when allocating

resources for containers. This is not used to limit the number of

CPUs used by YARN containers. If it is set to -1 and

yarn.nodemanager.resource.detect-hardware-capabilities is true, it is

automatically determined from the hardware in case of Windows and Linux.

In other cases, number of vcores is 8 by default.</description>

<name>yarn.nodemanager.resource.cpu-vcores</name>

<value>4</value>

</property>

<!-- 容器最小内存,默认1G -->

<property>

<description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have less memory than this value will be shut down by the resource manager.

</description>

<name>yarn.scheduler.minimum-allocation-mb</name>

<value>1024</value>

</property>

<!-- 容器最大内存,默认8G,修改为2G -->

<property>

<description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.

</description>

<name>yarn.scheduler.maximum-allocation-mb</name>

<value>2048</value>

</property>

<!-- 容器最小CPU核数,默认1-->

<property>

<description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager.

</description>

<name>yarn.scheduler.minimum-allocation-vcores</name>

<value>1</value>

</property>

<!-- 容器最大CPU核数,默认4个,修改为2个 -->

<property>

<description>The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an

InvalidResourceRequestException.</description>

<name>yarn.scheduler.maximum-allocation-vcores</name>

<value>2</value>

</property>

<!-- 虚拟内存检查,默认打开,修改为关闭 -->

<property>

<description>Whether virtual memory limits will be enforced for

containers.</description>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

</property>

<!-- 虚拟内存和物理内存设置比例,默认2.1 -->

<property>

<description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.

</description>

<name>yarn.nodemanager.vmem-pmem-ratio</name>

<value>2.1</value>

</property>

</configuration>

        MapReduce配置文件:mapred-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- 指定MapReduce程序运行在Yarn上 -->

    <property>

        <name>mapreduce.framework.name</name>

        <value>yarn</value>

    </property>

<!-- 历史服务器端地址 -->

<property>

    <name>mapreduce.jobhistory.address</name>

    <value>localhost:10020</value>

</property>

<!-- 历史服务器web端地址 -->

<property>

    <name>mapreduce.jobhistory.webapp.address</name>

    <value>localhost:19888</value>

</property>

</configuration>

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值