hadoop伪集群的安装，及基本概念。

最新推荐文章于 2024-07-28 20:13:01 发布

奋斗的小面包

最新推荐文章于 2024-07-28 20:13:01 发布

阅读量354

点赞数

分类专栏： hadoop 文章标签： hadoop伪集群配置 yarn内存 hadoop

本文链接：https://blog.csdn.net/c1523456/article/details/84168957

版权

hadoop 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

导读

伪集群的意思就是说我们可以在多台计算机上面安装hadoop，但是不具有高可用和共容错，这适用于开发环境。

我们首先下载hadoop的安装包，我使用的cdh版本的5.14.0，你可以在该网址找到他，

首先我们说一下hadoop的配置文件的分类：

hadoop的配置文件可以分为两种类型的配置文件。

一种是只读的默认配置如： core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml

另一种是我们指定的配置文件：conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.

此外我们还可对这两个脚本进行配置：conf/hadoop-env.sh and yarn-env.sh.

详细配置

我们首先来配置conf/hadoop-env.sh and yarn-env.sh。

第一步配置 JAVA_HOME,

第二步：可选配置，管理员可以为独立的守护进程进行详细的配置：

Daemon	Environment Variable
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
Secondary NameNode	HADOOP_SECONDARYNAMENODE_OPTS
ResourceManager	YARN_RESOURCEMANAGER_OPTS
NodeManager	YARN_NODEMANAGER_OPTS
WebAppProxy	YARN_PROXYSERVER_OPTS
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_OPTS

例如我们可以配置NameNode的 parallelGC：

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"

还可以进行一些其他的有用的配置：

HADOOP_LOG_DIR / YARN_LOG_DIR - 指定他们日志的存储路径，如果不存在的话自动创建。
HADOOP_HEAPSIZE / YARN_HEAPSIZE - 指定最大堆栈大小。

Daemon	Environment Variable
ResourceManager	YARN_RESOURCEMANAGER_HEAPSIZE
NodeManager	YARN_NODEMANAGER_HEAPSIZE
WebAppProxy	YARN_PROXYSERVER_HEAPSIZE
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_HEAPSIZE

现在我们可以进行这些守护进程的详细配置了：

conf/core-site.xml

Parameter	Value	Notes
fs.defaultFS	NameNode URI	hdfs://host:port/
io.file.buffer.size	131072	Size of read/write buffer used in SequenceFiles.

fs.defaultFS：这个路径是我们访问分布式文件系统的路径。

conf/hdfs-site.xml

Configurations for NameNode:

Parameter	Value	Notes
dfs.namenode.name.dir	Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.	If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

Configurations for DataNode:

Parameter	Value	Notes
dfs.datanode.data.dir	Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.	If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

Parameter

Value

Notes

dfs.datanode.data.dir

Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.

If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

conf/yarn-site.xml

Configurations for ResourceManager and NodeManager:

Parameter	Value	Notes
yarn.acl.enable	true / false	Enable ACLs? Defaults to false.
yarn.admin.acl	Admin ACL	ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.
yarn.log-aggregation-enable	false	Configuration to enable or disable log aggregation

Configurations for ResourceManager:

Parameter	Value	Notes
yarn.resourcemanager.address	ResourceManager host:port for clients to submit jobs.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.scheduler.address	ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.resource-tracker.address	ResourceManager host:port for NodeManagers.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.admin.address	ResourceManager host:port for administrative commands.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.webapp.address	ResourceManager web-ui host:port.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.hostname	ResourceManager host.	host Single hostname that can be set in place of setting all yarn.resourcemanager*addressresources. Results in default ports for ResourceManager components.
yarn.resourcemanager.scheduler.class	ResourceManager Scheduler class.	CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler
yarn.scheduler.minimum-allocation-mb	Minimum limit of memory to allocate to each container request at the Resource Manager.	In MBs
yarn.scheduler.maximum-allocation-mb	Maximum limit of memory to allocate to each container request at the Resource Manager.	In MBs
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path	List of permitted/excluded NodeManagers.	If necessary, use these files to control the list of allowable NodeManagers.

Configurations for NodeManager:

Parameter	Value	Notes
yarn.nodemanager.resource.memory-mb	Resource i.e. available physical memory, in MB, for given NodeManager	Defines total available resources on the NodeManager to be made available to running containers
yarn.nodemanager.vmem-pmem-ratio	Maximum ratio by which virtual memory usage of tasks may exceed physical memory	The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs	Comma-separated list of paths on the local filesystem where intermediate data is written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log-dirs	Comma-separated list of paths on the local filesystem where logs are written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log.retain-seconds	10800	Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir	/logs	HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffix	logs	Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services	mapreduce_shuffle	Shuffle service that needs to be set for Map Reduce applications.

Configurations for History Server (Needs to be moved elsewhere):

Parameter	Value	Notes
yarn.log-aggregation.retain-seconds	-1	How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds	-1	Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

conf/mapred-site.xml

Configurations for MapReduce Applications:

Parameter	Value	Notes
mapreduce.framework.name	yarn	Execution framework set to Hadoop YARN.
mapreduce.map.memory.mb	1536	Larger resource limit for maps.
mapreduce.map.java.opts	-Xmx1024M	Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb	3072	Larger resource limit for reduces.
mapreduce.reduce.java.opts	-Xmx2560M	Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb	512	Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor	100	More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies	50	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

Configurations for MapReduce JobHistory Server:

Parameter	Value	Notes
mapreduce.jobhistory.address	MapReduce JobHistory Server host:port	Default port is 10020.
mapreduce.jobhistory.webapp.address	MapReduce JobHistory Server Web UI host:port	Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir	/mr-history/tmp	Directory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir	/mr-history/done	Directory where history files are managed by the MR JobHistory Server.

基本配置

我负责给出这些配置，你如何配置在你，下面我给出一个可以运行环境的基本配置：

#hadoop-env.sh的配置配置javahome
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://node-master:9000</value>
        </property>
    </configuration>

hdfs-site.conf

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/hadoop/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/hadoop/data/dataNode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>1</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>512</value>
    </property>

    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>256</value>
    </property>

    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>256</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
            <name>yarn.acl.enable</name>
            <value>0</value>
    </property>

    
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
    </property>

    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1536</value>
    </property>

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
    </property>

    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

slaves

node01
node02
node03

解释

slaves

我们需要说一下slaves 这个配置文件的作用：

通常，您选择群集中的一台计算机作为NameNode，并选择一台计算机作为ResourceManager。其余的机器既充当DataNode又充当NodeManager，并称为从设备。这个配置文件就是指定运行DataNode和NodeManager的节点的主机名。

内存

还有yarn-set.xml中的关于内存的额配置和 mapred-site.xml中关于内存的配置：

在yarn集群中运行的任务有两种类型：

一种是Application Master (AM)他负责监视应用程序并协调集群中的分布式执行程序。

一种是executors 他通过AM创建，并运行job，对于MapReduce jobs,他们进行map,reduce并行的操作,请注意我们的yarn上面可不止能运行mapreduce哦，归根到底，mapreduce程序只是一个类调用yarm集群的程序。

两者都在slave nodes上的containers 中运行。每个slave node都运行一个NodeManager守护程序，该守护程序负责在节点上创建container 。整个集群由ResourceManager管理，ResourceManager根据容量要求和当前费用调度所有slave nodes上的容器分配。

说了这么多都不如看一张图明显：

官方解释如下：

How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.

This value is configured in yarn-site.xml with yarn.nodemanager.resource.memory-mb.
How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.

Those values are configured in yarn-site.xml with yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb.
How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.

This is configured in mapred-site.xml with yarn.app.mapreduce.am.resource.mb.
How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.

This is configured in mapred-site.xml with properties mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.