Hadoop 简单整理

Cjs_xiaokugua

已于 2023-10-01 21:01:47 修改

阅读量56

点赞数

文章标签： hadoop 前端 npm

于 2023-09-30 21:10:02 首次发布

本文链接：https://blog.csdn.net/Cjs_xiaokugua/article/details/133440613

版权

本文介绍了Hadoop的核心组件，包括HDFS、YARN和MapReduce的工作原理，并详细讲解了YARN的资源配置与调度，特别是如何通过capacity-scheduler.xml配置多队列，以优化资源分配。同时，列举了各主要配置文件的默认设置与自定义配置，如core-site.xml、hdfs-site.xml和yarn-site.xml，以帮助理解Hadoop集群的管理和优化。

摘要由CSDN通过智能技术生成

众所周知，hadoop核心三组件：

HDFS：主要负责存储数据

DataNode：实际存储数据节点（即服务器）

NameNode：管理者，处理客户端请求，配置副本策略等

HDFS默认文件块大小为：128M（1.x为64M，可修改，块儿大小主要取决于磁盘传输速度）

SecondaryNameNode：协助NameNode管理数据，注意不是NameNode的热备，不过可以通过SecondaryNameNode恢复NameNode；合并Edts和Fsimage文件。默认一小时或者操作次数达到100万次执行一次CheckPoint。

HDFS写流程：

HDFS读流程：

YARN：主要负责资源的调度（hadoop1.x时，资源调度由MapReduce负责，2.x之后才引入添加了yarn，用来进行资源调度）

ResourceManager：

处理客端请求；

负责资源分配及调度；

监控NodeManager；

启动和监控ApplicationMaster；

NodeManager：

管理单节点上的资源；

处理来自ResourceManager的命令；

处理来自ApplicationMaster的命令；

ApplicationMaster：

为应用程序申请资源并分配给内部的任务；

任务的监控和容错；

Container：

资源抽象，封装了某个节点上的多维度资源，任务执行地；

yarn工作机制：

yarn核心参数配置：

配置多队列：

创建多队列的好处？

（1）因为担心员工不小心，写递归死循环代码，把所有资源全部耗尽。

（2）实现任务的降级使用，特殊时期保证重要的任务队列资源充足。11.11 6.18

业务部门1（重要）=》业务部门2（比较重要）=》下单（一般）=》购物车（一般）=》登录注册（次要）

capacity-scheduler.xml

<name>yarn.scheduler.capacity.root.queues</name>

<value>default,hive</value>

The queues at the this level (root is the root queue).

</description>

</property>

<name>yarn.scheduler.capacity.root.default.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.default.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.hive.capacity</name>

</property>

<name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>

</property>

<name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>

</property>

<name>yarn.scheduler.capacity.root.hive.state</name>

<value>RUNNING</value>

</property>

<name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>

</property>

<name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>

</property>

<name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name>

</property>

<!-- 任务的超时时间设置：yarn application -appId appId -updateLifetime Timeout

参考资料：Enforcing application lifetime SLAs on YARN - Cloudera Blog -->

<!-- 如果application指定了超时时间，则提交到该队列的application能够指定的最大超时时间不能超过该值。

-->

<name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime</name>

</property>

<name>yarn.scheduler.capacity.root.hive.default-application-lifetime</name>

</property>

MapReduce：负责计算（例如wordcount）

切片机制：默认为块儿大小128M

CombineTextInputFormat用于小文件过多场景，将小文件逻辑上规划到一个切片中

原理：

shuffle机制：

环形缓冲区默认100M，达到80%是反向溢写到磁盘，根据分区、keyindex快排后溢写到磁盘形成临时文件，kv数据，key为实际value的index等信息；然后merge阶段，maptask将所有的临时文件按照分区进行归并排序形成一个大文件，reducetask拉取数据，如果超过阈值就写到磁盘，否则放在内存中；然后按照key进行聚合，最后再对数据做一次归并排序，最后结果写到HDFS上。

默认配置文件：

核心配置文件：core-default.xml

HDFS配置文件：hdfs-default.xml

YARN配置文件：yarn-default.xml

MapReduce配置文件：mapred-default.xml

自定义配置文件：

核心配置文件：core-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9820</value>

</property>

<name>hadoop.tmp.dir</name>

<value>hadoop地址/data</value>

</property>

<name>hadoop.http.staticuser.user</name>

</property>

</configuration>

HDFS配置文件：hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>dfs.namenode.http-address</name>

<value>localhost:9870</value>

</property>

<name>dfs.namenode.secondary.http-address</name>

<value>localhost:9868</value>

</property>

</configuration>

YARN配置文件：yarn-site.xml

<?xml version="1.0" encoding="UTF-8"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.resourcemanager.hostname</name>

<value>localhost</value>

</property>

<name>yarn.nodemanager.env-whitelist</name>

<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>

</property>

<name>yarn.log-aggregation-enable</name>

</property>

<name>yarn.log.server.url</name>

<value>http://localhost:19888/jobhistory/logs</value>

</property>

<name>yarn.log-aggregation.retain-seconds</name>

</property>

<description>The class to use as the resource scheduler.</description>

<name>yarn.resourcemanager.scheduler.class</name>

<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>

</property>

<description>Number of threads to handle scheduler interface.</description>

<name>yarn.resourcemanager.scheduler.client.thread-count</name>

</property>

<description>Enable auto-detection of node capabilities such as

memory and CPU.

</description>

<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>

<value>false</value>

</property>

<description>Flag to determine if logical processors(such as

hyperthreads) should be counted as cores. Only applicable on Linux

when yarn.nodemanager.resource.cpu-vcores is set to -1 and

yarn.nodemanager.resource.detect-hardware-capabilities is true.

</description>

<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>

<value>false</value>

</property>

<description>Multiplier to determine how to convert phyiscal cores to

vcores. This value is used if yarn.nodemanager.resource.cpu-vcores

is set to -1(which implies auto-calculate vcores) and

yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The number of vcores will be calculated as number of CPUs * multiplier.

</description>

<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>

</property>

<description>Amount of physical memory, in MB, that can be allocated

for containers. If set to -1 and

yarn.nodemanager.resource.detect-hardware-capabilities is true, it is

automatically calculated(in case of Windows and Linux).

In other cases, the default is 8192MB.

</description>

<name>yarn.nodemanager.resource.memory-mb</name>

</property>

<description>Number of vcores that can be allocated

for containers. This is used by the RM scheduler when allocating

resources for containers. This is not used to limit the number of

CPUs used by YARN containers. If it is set to -1 and

yarn.nodemanager.resource.detect-hardware-capabilities is true, it is

automatically determined from the hardware in case of Windows and Linux.

In other cases, number of vcores is 8 by default.</description>

<name>yarn.nodemanager.resource.cpu-vcores</name>

</property>

<description>The minimum allocation for every container request at the RM in MBs. Memory requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have less memory than this value will be shut down by the resource manager.

</description>

<name>yarn.scheduler.minimum-allocation-mb</name>

</property>

<description>The maximum allocation for every container request at the RM in MBs. Memory requests higher than this will throw an InvalidResourceRequestException.

</description>

<name>yarn.scheduler.maximum-allocation-mb</name>

</property>

<description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager.

</description>

<name>yarn.scheduler.minimum-allocation-vcores</name>

</property>

<description>The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an

InvalidResourceRequestException.</description>

<name>yarn.scheduler.maximum-allocation-vcores</name>

</property>

<description>Whether virtual memory limits will be enforced for

containers.</description>

<name>yarn.nodemanager.vmem-check-enabled</name>

<value>false</value>

</property>

<description>Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.

</description>

<name>yarn.nodemanager.vmem-pmem-ratio</name>

</property>

</configuration>