[综合]Apache Hadoop 2.2.0集群安装(1)[翻译]

用途

此文档描述了如何安装、配置和维护一个重大集群从几个节点到上千节点。

初次接触hadoop建议先从单节点集群开始。

 

前提

Apache 上下载了稳定的版本。

 

安装

安装hadoop集群通常需要在所有的节点上解压软件或者prm安装。

通常集群中的某一个节点被当做NameNode,其他节点作为ResourceManager,这些是主控节点。其他节点被当做DataNode和NodeManager,这些是从节点。

 

非安全模式启动Hadoop

接下来的章节将会阐述如何配置hadoop集群。

配置文件

hadoop中的配置文件有两大类型:

只读型默认配置:core-default.xmlhdfs-default.xmlyarn-default.xml and mapred-default.xml

定制化配置:conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.

此外:你可以自己操作hadoop的脚本,在bin目录下可以找到,还有一些配置的环境变量在conf/hadoop-env.sh and yarn-env.sh中。

站点配置:

配置hadoop集群你首先要配置hadoop守护进程执行的环境。

hadoop的守护进程包括NameNode/DataNode and ResourceManager/NodeManager.

hadoop守护进程环境配置

管理员需要使用conf/hadoop-env.sh and conf/yarn-env.sh脚本对hadoop守护进程做环境配置。

首先你要验证JAVA_HOME在所有的节点上是否正确

有时候你需要 HADOOP_PID_DIR and HADOOP_SECURE_DN_PID_DIR目录只能被启动守护进程的用户执行写操作。否则就会出现软连接攻击。

管理员可以利用配置项单独配置进程,配置项如下:

DaemonEnvironment Variable
NameNodeHADOOP_NAMENODE_OPTS
DataNodeHADOOP_DATANODE_OPTS
Secondary NameNodeHADOOP_SECONDARYNAMENODE_OPTS
ResourceManagerYARN_RESOURCEMANAGER_OPTS
NodeManagerYARN_NODEMANAGER_OPTS
WebAppProxyYARN_PROXYSERVER_OPTS
Map Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_OPTS

如要配置Namenode 为parallelGC,那么可以添加如下到hadoop-env.sh中:

 

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"

其他有用的可定制化参数包括:

 

HADOOP_LOG_DIR / YARN_LOG_DIR :进程日志目录,如果不存在会自动创建。

HADOOP_HEAPSIZE / YARN_HEAPSIZE:内存堆大小默认单位为M,如果变量设置成1000 那么堆内存会设置成1000M,默认为1000,如果你需要配置他那么你可以为每个节点单独配置。

 

DaemonEnvironment Variable
ResourceManagerYARN_RESOURCEMANAGER_HEAPSIZE
NodeManagerYARN_NODEMANAGER_HEAPSIZE
WebAppProxyYARN_PROXYSERVER_HEAPSIZE
Map Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_HEAPSIZE

hadoop守护进程非安全模式配置:

此章节是比较重要的参数配置,涉及信息如下:

conf/core-site.xml

ParameterValueNotes
fs.defaultFSNameNode URIhdfs://host:port/
io.file.buffer.size131072 SequenceFiles的读/写缓冲区大小

conf/hdfs-site.xml

NameNode的配置:

ParameterValueNotes
dfs.namenode.name.dirPath on the local filesystem where the NameNode stores the namespace and transactions logs persistently.If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
dfs.namenode.hosts /dfs.namenode.hosts.exclude List of permitted/excluded DataNodes.If necessary, use these files to control the list of allowable datanodes.
dfs.blocksize268435456HDFS blocksize of 256MB for large file-systems.
dfs.namenode.handler.count100More NameNode server threads to handle RPCs from large number of DataNodes.

DataNode配置:

ParameterValueNotes
dfs.datanode.data.dirComma separated list of paths on the local filesystem of a DataNode where it should store its blocks.If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

 

conf/yarn-site.xml

ResourceManager和NodeManager配置:

ParameterValueNotes
yarn.acl.enable true /false Enable ACLs? Defaults to false.
yarn.admin.aclAdmin ACLACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.
yarn.log-aggregation-enablefalseConfiguration to enable or disable log aggregation

ResourceManager配置:

ParameterValueNotes
yarn.resourcemanager.address ResourceManager host:port for clients to submit jobs.host:port
yarn.resourcemanager.scheduler.address ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.host:port
yarn.resourcemanager.resource-tracker.address ResourceManager host:port for NodeManagers.host:port
yarn.resourcemanager.admin.address ResourceManager host:port for administrative commands.host:port
yarn.resourcemanager.webapp.address ResourceManager web-ui host:port.host:port
yarn.resourcemanager.scheduler.class ResourceManager Scheduler class. CapacityScheduler (recommended), FairScheduler(also recommended), or FifoScheduler
yarn.scheduler.minimum-allocation-mbMinimum limit of memory to allocate to each container request at the Resource Manager.In MBs
yarn.scheduler.maximum-allocation-mbMaximum limit of memory to allocate to each container request at the Resource Manager.In MBs
yarn.resourcemanager.nodes.include-path /yarn.resourcemanager.nodes.exclude-path List of permitted/excluded NodeManagers.If necessary, use these files to control the list of allowable NodeManagers.

NodeManager配置:

 

ParameterValueNotes
yarn.nodemanager.resource.memory-mbResource i.e. available physical memory, in MB, for givenNodeManager Defines total available resources on the NodeManager to be made available to running containers
yarn.nodemanager.vmem-pmem-ratioMaximum ratio by which virtual memory usage of tasks may exceed physical memoryThe virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirsComma-separated list of paths on the local filesystem where intermediate data is written.Multiple paths help spread disk i/o.
yarn.nodemanager.log-dirsComma-separated list of paths on the local filesystem where logs are written.Multiple paths help spread disk i/o.
yarn.nodemanager.log.retain-seconds10800Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir/logsHDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffixlogsSuffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-servicesmapreduce_shuffleShuffle service that needs to be set for Map Reduce applications.

运行历史配置:

ParameterValueNotes
yarn.log-aggregation.retain-seconds-1How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds-1Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

 

conf/mapred-site.xml

MapReduce应用配置:

ParameterValueNotes
mapreduce.framework.nameyarnExecution framework set to Hadoop YARN.
mapreduce.map.memory.mb1536Larger resource limit for maps.
mapreduce.map.java.opts-Xmx1024MLarger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb3072Larger resource limit for reduces.
mapreduce.reduce.java.opts-Xmx2560MLarger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb512Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor100More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies50Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

MapReduce 执行历史服务配置:

ParameterValueNotes
mapreduce.jobhistory.addressMapReduce JobHistory Server host:port Default port is 10020.
mapreduce.jobhistory.webapp.addressMapReduce JobHistory Server Web UIhost:port Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir/mr-history/tmpDirectory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir/mr-history/doneDirectory where history files are managed by the MR JobHistory Server.

 

Hadoop机架感知

HDFS和YARN服务可机架感知的

NameNode 和ResourceManager通过调用api来获取集群中每个从节点的机架信息。

api以dns名称(或ip)作为一个机架id

这个模块也是可配置的,通过topology.node.switch.mapping.impl来配置,可以通过命令行参数topology.script.file.name来配置,如果topology.script.file.name没有配置那么默认其ip为机架id。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值