2024年大数据最全大数据集群搭建之Linux安装hadoop3，细节决定成败

2401_84591746

于 2024-05-09 19:01:03 发布

阅读量350

点赞数 5

分类专栏：程序员文章标签：大数据面试学习

本文链接：https://blog.csdn.net/2401_84591746/article/details/138626873

版权

程序员专栏收录该内容

60 篇文章 0 订阅

订阅专栏

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

dfs.blocksize

HDFS blocksize of 128MB for large file-systems

dfs.namenode.handler.count

100

More NameNode server threads to handle RPCs from large number of DataNodes.

dfs.namenode.shared.edits.dir

qjournal://hadoop001:8485;hadoop002:8485;hadoop003:8485/ns1

dfs.ha.fencing.methods

sshfence

dfs.ha.fencing.ssh.private-key-files

/root/.ssh/id_rsa

mapred-site.xml

<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

mapreduce.framework.name

yarn

Execution framework set to Hadoop YARN.

mapreduce.map.memory.mb

4096

Larger resource limit for maps.

mapreduce.map.java.opts

-Xmx4096M

Larger heap-size for child jvms of maps.

mapreduce.reduce.memory.mb

4096

Larger resource limit for reduces.

mapreduce.reduce.java.opts

-Xmx4096M

Larger heap-size for child jvms of reduces.

mapreduce.task.io.sort.mb

2040

Higher memory-limit while sorting data for efficiency.

mapreduce.task.io.sort.factor

400

More streams merged at once while sorting files.

mapreduce.reduce.shuffle.parallelcopies

200

Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

mapreduce.jobhistory.address

hadoop001:10020

MapReduce JobHistory Server host:port.Default port is 10020

mapreduce.jobhistory.webapp.address

hadoop001:19888

MapReduce JobHistory Server Web UI host:port.Default port is 19888.

mapreduce.jobhistory.intermediate-done-dir

/tmp/mr-history/tmp

Directory where history files are written by MapReduce jobs.

mapreduce.jobhistory.done-dir

/tmp/mr-history/done

Directory where history files are managed by the MR JobHistory Server.

yarn-site.xml

<?xml version="1.0"?>

yarn.resourcemanager.ha.enabled

true

yarn.resourcemanager.ha.automatic-failover.enabled

true

yarn.resourcemanager.ha.automatic-failover.embedded

true

yarn.resourcemanager.cluster-id

yarn-rm-cluster

yarn.resourcemanager.ha.rm-ids

rm1,rm2

yarn.resourcemanager.hostname.rm1

hadoop001

yarn.resourcemanager.hostname.rm2

hadoop002

yarn.resourcemanager.recovery.enabled

true

yarn.resourcemanager.zk.state-store.address

hadoop001:2181,hadoop002:2181,hadoop003:2181

yarn.resourcemanager.zk-address

hadoop001:2181,hadoop002:2181,hadoop003:2181

yarn.resourcemanager.address.rm1

hadoop001:8032

yarn.resourcemanager.address.rm2

hadoop002:8032

yarn.resourcemanager.scheduler.address.rm1

hadoop001:8034

yarn.resourcemanager.webapp.address.rm1

hadoop001:8088

yarn.resourcemanager.scheduler.address.rm2

hadoop002:8034

yarn.resourcemanager.webapp.address.rm2

hadoop002:8088

yarn.acl.enable

true

Enable ACLs? Defaults to false.

yarn.admin.acl

yarn.log-aggregation-enable

false

Configuration to enable or disable log aggregation

yarn.resourcemanager.hostname

hadoop001

host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources. Results in default ports for ResourceManager components.

yarn.scheduler.minimum-allocation-mb

1024

saprk调度时一个container能够申请的最小资源，默认值为1024MB

yarn.scheduler.maximum-allocation-mb

28672

saprk调度时一个container能够申请的最大资源，默认值为8192MB

yarn.nodemanager.resource.memory-mb

28672

nodemanager能够申请的最大内存，默认值为8192MB

yarn.app.mapreduce.am.resource.mb

28672

AM能够申请的最大内存，默认值为1536MB

yarn.nodemanager.log.retain-seconds

10800

yarn.nodemanager.log-dirs

/home/cluster/yarn/log/1,/home/cluster/yarn/log/2,/home/cluster/yarn/log/3

yarn.nodemanager.aux-services

mapreduce_shuffle

Shuffle service that needs to be set for Map Reduce applications.

yarn.log-aggregation.retain-seconds

-1

yarn.log-aggregation.retain-check-interval-seconds

-1

yarn.app.mapreduce.am.staging-dir

hdfs://ns1/tmp/hadoop-yarn/staging

The staging dir used while submitting jobs.

yarn.application.classpath

/usr/local/hadoop/hadoop/etc/hadoop:/usr/local/hadoop/hadoop/share/hadoop/common/lib/:/usr/local/hadoop/hadoop/share/hadoop/common/:/usr/local/hadoop/hadoop/share/hadoop/hdfs:/usr/local/hadoop/hadoop/share/hadoop/hdfs/lib/:/usr/local/hadoop/hadoop/share/hadoop/hdfs/:/usr/local/hadoop/hadoop/share/hadoop/mapreduce/:/usr/local/hadoop/hadoop/share/hadoop/yarn:/usr/local/hadoop/hadoop/share/hadoop/yarn/lib/:/usr/local/hadoop/hadoop/share/hadoop/yarn/*

Linux上打 hadoop classpath 找到的所有路径

五、初始化集群

1、启动zookeeper

由于hadoop的HA机制依赖于zookeeper，因此先启动zookeeper集群

如果zookeeper集群没有搭建参考：大数据高可用技术之zookeeper3.4.5安装配置_qq262593421的博客-CSDN博客

zkServer.sh start

zkServer.sh status

2、在zookeeper中初始化元数据

hdfs zkfc -formatZK

3、启动zkfc

hdfs --daemon start zkfc

4、启动JournalNode

格式化NameNode前必须先格式化JournalNode，否则格式化失败

这里配置了3个JournalNode节点，hadoop001、hadoop002、hadoop003

hdfs --daemon start journalnode

5、格式化NameNode

在第一台NameNode节点上执行

hdfs namenode -format

6、启动hdfs

start-all.sh

7、同步备份NameNode

等hdfs初始化完成之后（20秒），在另一台NameNode上执行

hdfs namenode -bootstrapStandby

如果格式化失败或者出现以下错误，把对应节点上的目录删掉再重新格式化

Directory is in an inconsistent state: Can’t format the storage directory because the current directory is not empty.

rm -rf /home/cluster/hadoop/data/jn/ns1/*

hdfs namenode -format

8、启动备份NameNode

同步之后，需要在另一台NameNode节点上启动NameNode进程

hdfs --daemon start namenode

9、查看集群状态

hadoop dfsadmin -report

10、访问集群

http://hadoop001:50070/

http://hadoop002:50070/

六、集群高可用测试

1、停止Active状态的NameNode

在active状态上的NameNode执行（hadoop1）

hdfs --daemon stop namenode

2、查看standby状态的NameNode

http://hadoop002:50070/ 可以看到，hadoop2从standby变成了active状态

3、重启启动停止的NameNode

停止之后，浏览器无法访问，重启恢复

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

tive状态上的NameNode执行（hadoop1）

hdfs --daemon stop namenode

2、查看standby状态的NameNode

http://hadoop002:50070/ 可以看到，hadoop2从standby变成了active状态

3、重启启动停止的NameNode

停止之后，浏览器无法访问，重启恢复

[外链图片转存中…(img-5qLqbcmg-1715252432505)]
[外链图片转存中…(img-8Wc1nJlS-1715252432505)]
[外链图片转存中…(img-IESR6U9O-1715252432506)]

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上大数据知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

需要这份系统化资料的朋友，可以戳这里获取

2401_84591746

关注

5
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
2024年大数据最全大数据集群搭建之Linux安装hadoop3，细节决定成败

100这是配置提供共享编辑存储的journalnode地址的地方，这些地址由活动nameNode写入，由备用nameNode读取，以便与活动nameNode所做的所有文件系统更改保持最新。虽然必须指定几个JournalNode地址，但是应该只配置其中一个uri。URI的形式应该是:qjournal://*host1:port1*;
复制链接

扫一扫