hadoop伪集群的安装,及基本概念。

导读

伪集群的意思就是说我们可以在多台计算机上面安装hadoop,但是不具有高可用和共容错,这适用于开发环境。

我们首先下载hadoop的安装包,我使用的cdh版本的5.14.0,你可以在该网址找到他,

首先我们说一下hadoop的配置文件的分类:

hadoop的配置文件可以分为两种类型的配置文件。

一种是只读的默认配置如: core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml

另一种是我们指定的配置文件:conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.

此外我们还可对这两个脚本进行配置:conf/hadoop-env.sh and yarn-env.sh.

详细配置

我们首先来配置conf/hadoop-env.sh and yarn-env.sh。

第一步配置 JAVA_HOME,

第二步:可选配置,管理员可以为独立的守护进程进行详细的配置:

DaemonEnvironment Variable
NameNodeHADOOP_NAMENODE_OPTS
DataNodeHADOOP_DATANODE_OPTS
Secondary NameNodeHADOOP_SECONDARYNAMENODE_OPTS
ResourceManagerYARN_RESOURCEMANAGER_OPTS
NodeManagerYARN_NODEMANAGER_OPTS
WebAppProxyYARN_PROXYSERVER_OPTS
Map Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_OPTS

例如我们可以配置NameNode的 parallelGC:

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"

还可以进行一些其他的有用的配置:

HADOOP_LOG_DIR / YARN_LOG_DIR - 指定他们日志的存储路径,如果不存在的话自动创建。
HADOOP_HEAPSIZE / YARN_HEAPSIZE - 指定最大堆栈大小。

DaemonEnvironment Variable
ResourceManagerYARN_RESOURCEMANAGER_HEAPSIZE
NodeManagerYARN_NODEMANAGER_HEAPSIZE
WebAppProxyYARN_PROXYSERVER_HEAPSIZE
Map Reduce Job History ServerHADOOP_JOB_HISTORYSERVER_HEAPSIZE

 

现在我们可以进行这些守护进程的详细配置了:

conf/core-site.xml

ParameterValueNotes
fs.defaultFSNameNode URIhdfs://host:port/
io.file.buffer.size131072Size of read/write buffer used in SequenceFiles.

fs.defaultFS:这个路径是我们访问分布式文件系统的路径。

conf/hdfs-site.xml

  • Configurations for NameNode:
    ParameterValueNotes
    dfs.namenode.name.dirPath on the local filesystem where the NameNode stores the namespace and transactions logs persistently.If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

 Configurations for DataNode:

ParameterValueNotes
dfs.datanode.data.dirComma separated list of paths on the local filesystem of a DataNode where it should store its blocks.

If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

 

 conf/yarn-site.xml

  • Configurations for ResourceManager and NodeManager:
    ParameterValueNotes
    yarn.acl.enabletrue / falseEnable ACLs? Defaults to false.
    yarn.admin.aclAdmin ACLACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.
    yarn.log-aggregation-enablefalse

    Configuration to enable or disable log aggregation

    • Configurations for ResourceManager:
      ParameterValueNotes
      yarn.resourcemanager.addressResourceManager host:port for clients to submit jobs.host:port 
      If set, overrides the hostname set in yarn.resourcemanager.hostname.
      yarn.resourcemanager.scheduler.addressResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.host:port 
      If set, overrides the hostname set in yarn.resourcemanager.hostname.
      yarn.resourcemanager.resource-tracker.addressResourceManager host:port for NodeManagers.host:port 
      If set, overrides the hostname set in yarn.resourcemanager.hostname.
      yarn.resourcemanager.admin.addressResourceManager host:port for administrative commands.host:port 
      If set, overrides the hostname set in yarn.resourcemanager.hostname.
      yarn.resourcemanager.webapp.addressResourceManager web-ui host:port.host:port 
      If set, overrides the hostname set in yarn.resourcemanager.hostname.
      yarn.resourcemanager.hostnameResourceManager host.host 
      Single hostname that can be set in place of setting all yarn.resourcemanager*addressresources. Results in default ports for ResourceManager components.
      yarn.resourcemanager.scheduler.classResourceManager Scheduler class.CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler
      yarn.scheduler.minimum-allocation-mbMinimum limit of memory to allocate to each container request at the Resource Manager.In MBs
      yarn.scheduler.maximum-allocation-mbMaximum limit of memory to allocate to each container request at the Resource Manager.In MBs
      yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-pathList of permitted/excluded NodeManagers.If necessary, use these files to control the list of allowable NodeManagers.
    • Configurations for NodeManager:
      ParameterValueNotes
      yarn.nodemanager.resource.memory-mbResource i.e. available physical memory, in MB, for given NodeManagerDefines total available resources on the NodeManager to be made available to running containers
      yarn.nodemanager.vmem-pmem-ratioMaximum ratio by which virtual memory usage of tasks may exceed physical memoryThe virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
      yarn.nodemanager.local-dirsComma-separated list of paths on the local filesystem where intermediate data is written.Multiple paths help spread disk i/o.
      yarn.nodemanager.log-dirsComma-separated list of paths on the local filesystem where logs are written.Multiple paths help spread disk i/o.
      yarn.nodemanager.log.retain-seconds10800Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
      yarn.nodemanager.remote-app-log-dir/logsHDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
      yarn.nodemanager.remote-app-log-dir-suffixlogsSuffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
      yarn.nodemanager.aux-servicesmapreduce_shuffleShuffle service that needs to be set for Map Reduce applications.
    • Configurations for History Server (Needs to be moved elsewhere):
      ParameterValueNotes
      yarn.log-aggregation.retain-seconds-1How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
      yarn.log-aggregation.retain-check-interval-seconds-1Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.
  • conf/mapred-site.xml
    • Configurations for MapReduce Applications:
      ParameterValueNotes
      mapreduce.framework.nameyarnExecution framework set to Hadoop YARN.
      mapreduce.map.memory.mb1536Larger resource limit for maps.
      mapreduce.map.java.opts-Xmx1024MLarger heap-size for child jvms of maps.
      mapreduce.reduce.memory.mb3072Larger resource limit for reduces.
      mapreduce.reduce.java.opts-Xmx2560MLarger heap-size for child jvms of reduces.
      mapreduce.task.io.sort.mb512Higher memory-limit while sorting data for efficiency.
      mapreduce.task.io.sort.factor100More streams merged at once while sorting files.
      mapreduce.reduce.shuffle.parallelcopies50Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.
    • Configurations for MapReduce JobHistory Server:
      ParameterValueNotes
      mapreduce.jobhistory.addressMapReduce JobHistory Server host:portDefault port is 10020.
      mapreduce.jobhistory.webapp.addressMapReduce JobHistory Server Web UI host:portDefault port is 19888.
      mapreduce.jobhistory.intermediate-done-dir/mr-history/tmpDirectory where history files are written by MapReduce jobs.
      mapreduce.jobhistory.done-dir/mr-history/doneDirectory where history files are managed by the MR JobHistory Server.

基本配置

 我负责给出这些配置,你如何配置在你,下面我给出一个可以运行环境的基本配置:

#hadoop-env.sh的配置配置javahome
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

 core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://node-master:9000</value>
        </property>
    </configuration>

hdfs-site.conf

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/hadoop/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/hadoop/data/dataNode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>1</value>
    </property>
</configuration>

 mapred-site.xml

<configuration>
    <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>512</value>
    </property>

    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>256</value>
    </property>

    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>256</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
            <name>yarn.acl.enable</name>
            <value>0</value>
    </property>

    
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
    </property>

    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1536</value>
    </property>

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
    </property>

    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

slaves 

node01
node02
node03

解释

slaves

我们需要说一下slaves 这个配置文件的作用:

通常,您选择群集中的一台计算机作为NameNode,并选择一台计算机作为ResourceManager。其余的机器既充当DataNode又充当NodeManager,并称为从设备。这个配置文件就是指定运行DataNode和NodeManager的节点的主机名。

内存

还有yarn-set.xml中的关于内存的额配置和 mapred-site.xml中关于内存的配置:

在yarn集群中运行的任务有两种类型:

一种是Application Master (AM)他负责监视应用程序并协调集群中的分布式执行程序。

一种是executors 他通过AM创建,并运行job,对于MapReduce jobs,他们进行map,reduce并行的操作,请注意我们的yarn上面可不止能运行mapreduce哦,归根到底,mapreduce程序只是一个类调用yarm集群的程序。

两者都在slave nodes上的containers 中运行。每个slave node都运行一个NodeManager守护程序,该守护程序负责在节点上创建container 。整个集群由ResourceManager管理,ResourceManager根据容量要求和当前费用调度所有slave nodes上的容器分配。

说了这么多都不如看一张图明显:

hadoop yarn 内存图
hadoop yarn 的内存图

官方解释如下:

  1. How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.

    This value is configured in yarn-site.xml with yarn.nodemanager.resource.memory-mb.

  2. How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.

    Those values are configured in yarn-site.xml with yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb.

  3. How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.

    This is configured in mapred-site.xml with yarn.app.mapreduce.am.resource.mb.

  4. How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.

    This is configured in mapred-site.xml with properties mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.

运行

 接下来就可以运行集群了,跟我们的操作系统一样,用之前得先进行格式化:

hdfs namenode -format

运行hdfs:

start-dfs.sh

运行yarn:

start-yarn.sh

易出问题

参看:https://blog.csdn.net/baidu_16757561/article/details/53698746

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值