导读
伪集群的意思就是说我们可以在多台计算机上面安装hadoop,但是不具有高可用和共容错,这适用于开发环境。
我们首先下载hadoop的安装包,我使用的cdh版本的5.14.0,你可以在该网址找到他,
首先我们说一下hadoop的配置文件的分类:
hadoop的配置文件可以分为两种类型的配置文件。
一种是只读的默认配置如: core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml
另一种是我们指定的配置文件:conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.
此外我们还可对这两个脚本进行配置:conf/hadoop-env.sh and yarn-env.sh.
详细配置
我们首先来配置conf/hadoop-env.sh and yarn-env.sh。
第一步配置 JAVA_HOME,
第二步:可选配置,管理员可以为独立的守护进程进行详细的配置:
Daemon | Environment Variable |
NameNode | HADOOP_NAMENODE_OPTS |
DataNode | HADOOP_DATANODE_OPTS |
Secondary NameNode | HADOOP_SECONDARYNAMENODE_OPTS |
ResourceManager | YARN_RESOURCEMANAGER_OPTS |
NodeManager | YARN_NODEMANAGER_OPTS |
WebAppProxy | YARN_PROXYSERVER_OPTS |
Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_OPTS |
例如我们可以配置NameNode的 parallelGC:
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
还可以进行一些其他的有用的配置:
HADOOP_LOG_DIR / YARN_LOG_DIR - 指定他们日志的存储路径,如果不存在的话自动创建。
HADOOP_HEAPSIZE / YARN_HEAPSIZE - 指定最大堆栈大小。
Daemon | Environment Variable |
ResourceManager | YARN_RESOURCEMANAGER_HEAPSIZE |
NodeManager | YARN_NODEMANAGER_HEAPSIZE |
WebAppProxy | YARN_PROXYSERVER_HEAPSIZE |
Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_HEAPSIZE |
现在我们可以进行这些守护进程的详细配置了:
conf/core-site.xml
Parameter | Value | Notes |
---|---|---|
fs.defaultFS | NameNode URI | hdfs://host:port/ |
io.file.buffer.size | 131072 | Size of read/write buffer used in SequenceFiles. |
fs.defaultFS:这个路径是我们访问分布式文件系统的路径。
conf/hdfs-site.xml
- Configurations for NameNode:
Parameter Value Notes dfs.namenode.name.dir Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
Configurations for DataNode:
Parameter | Value | Notes |
---|---|---|
dfs.datanode.data.dir | Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. | If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.
|
conf/yarn-site.xml
- Configurations for ResourceManager and NodeManager:
Parameter Value Notes yarn.acl.enable true / false Enable ACLs? Defaults to false. yarn.admin.acl Admin ACL ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access. yarn.log-aggregation-enable false Configuration to enable or disable log aggregation
-
- Configurations for ResourceManager:
Parameter Value Notes yarn.resourcemanager.address ResourceManager host:port for clients to submit jobs. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.scheduler.address ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.resource-tracker.address ResourceManager host:port for NodeManagers. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.admin.address ResourceManager host:port for administrative commands. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.webapp.address ResourceManager web-ui host:port. host:port
If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.hostname ResourceManager host. host
Single hostname that can be set in place of setting all yarn.resourcemanager*addressresources. Results in default ports for ResourceManager components.yarn.resourcemanager.scheduler.class ResourceManager Scheduler class. CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler yarn.scheduler.minimum-allocation-mb Minimum limit of memory to allocate to each container request at the Resource Manager. In MBs yarn.scheduler.maximum-allocation-mb Maximum limit of memory to allocate to each container request at the Resource Manager. In MBs yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path List of permitted/excluded NodeManagers. If necessary, use these files to control the list of allowable NodeManagers. - Configurations for NodeManager:
Parameter Value Notes yarn.nodemanager.resource.memory-mb Resource i.e. available physical memory, in MB, for given NodeManager Defines total available resources on the NodeManager to be made available to running containers yarn.nodemanager.vmem-pmem-ratio Maximum ratio by which virtual memory usage of tasks may exceed physical memory The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio. yarn.nodemanager.local-dirs Comma-separated list of paths on the local filesystem where intermediate data is written. Multiple paths help spread disk i/o. yarn.nodemanager.log-dirs Comma-separated list of paths on the local filesystem where logs are written. Multiple paths help spread disk i/o. yarn.nodemanager.log.retain-seconds 10800 Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. yarn.nodemanager.remote-app-log-dir /logs HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled. yarn.nodemanager.remote-app-log-dir-suffix logs Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled. yarn.nodemanager.aux-services mapreduce_shuffle Shuffle service that needs to be set for Map Reduce applications. - Configurations for History Server (Needs to be moved elsewhere):
Parameter Value Notes yarn.log-aggregation.retain-seconds -1 How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node. yarn.log-aggregation.retain-check-interval-seconds -1 Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.
- Configurations for ResourceManager:
- conf/mapred-site.xml
- Configurations for MapReduce Applications:
Parameter Value Notes mapreduce.framework.name yarn Execution framework set to Hadoop YARN. mapreduce.map.memory.mb 1536 Larger resource limit for maps. mapreduce.map.java.opts -Xmx1024M Larger heap-size for child jvms of maps. mapreduce.reduce.memory.mb 3072 Larger resource limit for reduces. mapreduce.reduce.java.opts -Xmx2560M Larger heap-size for child jvms of reduces. mapreduce.task.io.sort.mb 512 Higher memory-limit while sorting data for efficiency. mapreduce.task.io.sort.factor 100 More streams merged at once while sorting files. mapreduce.reduce.shuffle.parallelcopies 50 Higher number of parallel copies run by reduces to fetch outputs from very large number of maps. - Configurations for MapReduce JobHistory Server:
Parameter Value Notes mapreduce.jobhistory.address MapReduce JobHistory Server host:port Default port is 10020. mapreduce.jobhistory.webapp.address MapReduce JobHistory Server Web UI host:port Default port is 19888. mapreduce.jobhistory.intermediate-done-dir /mr-history/tmp Directory where history files are written by MapReduce jobs. mapreduce.jobhistory.done-dir /mr-history/done Directory where history files are managed by the MR JobHistory Server.
- Configurations for MapReduce Applications:
基本配置
我负责给出这些配置,你如何配置在你,下面我给出一个可以运行环境的基本配置:
#hadoop-env.sh的配置配置javahome
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node-master:9000</value>
</property>
</configuration>
hdfs-site.conf
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/data/nameNode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/data/dataNode</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>512</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>256</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>256</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.acl.enable</name>
<value>0</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>
slaves
node01
node02
node03
解释
slaves
我们需要说一下slaves
这个配置文件的作用:
通常,您选择群集中的一台计算机作为NameNode,并选择一台计算机作为ResourceManager。其余的机器既充当DataNode又充当NodeManager,并称为从设备。这个配置文件就是指定运行DataNode和NodeManager的节点的主机名。
内存
还有yarn-set.xml中的关于内存的额配置和 mapred-site.xml中关于内存的配置:
在yarn集群中运行的任务有两种类型:
一种是Application Master (AM)他负责监视应用程序并协调集群中的分布式执行程序。
一种是executors 他通过AM创建,并运行job,对于MapReduce jobs,他们进行map,reduce并行的操作,请注意我们的yarn上面可不止能运行mapreduce哦,归根到底,mapreduce程序只是一个类调用yarm集群的程序。
两者都在slave nodes上的containers 中运行。每个slave node都运行一个NodeManager守护程序,该守护程序负责在节点上创建container 。整个集群由ResourceManager管理,ResourceManager根据容量要求和当前费用调度所有slave nodes上的容器分配。
说了这么多都不如看一张图明显:
官方解释如下:
-
How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.
This value is configured in
yarn-site.xml
withyarn.nodemanager.resource.memory-mb
. -
How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.
Those values are configured in
yarn-site.xml
withyarn.scheduler.maximum-allocation-mb
andyarn.scheduler.minimum-allocation-mb
. -
How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.
This is configured in
mapred-site.xml
withyarn.app.mapreduce.am.resource.mb
. -
How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.
This is configured in
mapred-site.xml
with propertiesmapreduce.map.memory.mb
andmapreduce.reduce.memory.mb
.
运行
接下来就可以运行集群了,跟我们的操作系统一样,用之前得先进行格式化:
hdfs namenode -format
运行hdfs:
start-dfs.sh
运行yarn:
start-yarn.sh
易出问题
参看:https://blog.csdn.net/baidu_16757561/article/details/53698746