使用Docker容器搭建Hadoop伪分布式集群
1、编写docker-compose.yaml文件配置集群
version: "3"
services:
namenode:
image: apache/hadoop:3.3.6
hostname: namenode
command: ["hdfs", "namenode"]
ports:
- 9870:9870
env_file:
- ./config
environment:
ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
datanode:
image: apache/hadoop:3.3.6
command: ["hdfs", "datanode"]
env_file:
- ./config
resourcemanager:
image: apache/hadoop:3.3.6
hostname: resourcemanager
command: ["yarn", "resourcemanager"]
ports:
- 8088:8088
env_file:
- ./config
volumes:
- ./test.sh:/opt/test.sh
nodemanager:
image: apache/hadoop:3.3.6
command: ["yarn", "nodemanager"]
env_file:
- ./config
配置文件相关解释
hdfs namenode - 该命令用于启动Hadoop分布式文件系统的名称节点。名称节点是HDFS的关键组件之一,负责管理文件系统的命名空间和元数据。
hdfs datanode - 该命令用于启动Hadoop分布式文件系统的数据节点。数据节点负责存储和处理实际的数据块。
yarn resourcemanager - 该命令用于启动YARN的资源管理器。ResourceManager是YARN的核心组件,负责管理集群中的资源分配和作业调度。
yarn nodemanager - 该命令用于启动YARN的节点管理器。NodeManager在每个节点上运行,负责管理和监控容器的启动、停止和状态报告。
ENSURE_NAMENODE_DIR: “/tmp/hadoop-root/dfs/name” 这个配置是用来指定Hadoop分布式文件系统(HDFS)的名称节点(NameNode)的数据存储路径。
2、Config 配置文件
HADOOP_HOME=/opt/hadoop
CORE-SITE.XML_fs.default.name=hdfs://namenode
CORE-SITE.XML_fs.defaultFS=hdfs://namenode
HDFS-SITE.XML_dfs.namenode.rpc-address=namenode:8020
HDFS-SITE.XML_dfs.replication=1
MAPRED-SITE.XML_mapreduce.framework.name=yarn
MAPRED-SITE.XML_yarn.app.mapreduce.am.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.map.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
MAPRED-SITE.XML_mapreduce.reduce.env=HADOOP_MAPRED_HOME=$HADOOP_HOME
YARN-SITE.XML_yarn.resourcemanager.hostname=resourcemanager
YARN-SITE.XML_yarn.nodemanager.pmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.delete.debug-delay-sec=600
YARN-SITE.XML_yarn.nodemanager.vmem-check-enabled=false
YARN-SITE.XML_yarn.nodemanager.aux-services=mapreduce_shuffle
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-applications=10000
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.maximum-am-resource-percent=0.1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.queues=default
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.user-limit-factor=1
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.maximum-capacity=100
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.state=RUNNING
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_submit_applications=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.root.default.acl_administer_queue=*
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.node-locality-delay=40
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings=
CAPACITY-SCHEDULER.XML_yarn.scheduler.capacity.queue-mappings-override.enable=false
3、启动相应容器
Hadoop-workspace % docker-compose up -d
[+] Running 5/5
⠿ Network hadoop-workspace_default Created 0.0s
⠿ Container hadoop-workspace-namenode-1 Started 0.9s
⠿ Container hadoop-workspace-nodemanager-1 Started 0.9s
⠿ Container hadoop-workspace-resourcemanager-1 Started 0.9s
⠿ Container hadoop-workspace-datanode-1 Started 0.7s
完成容器初始化后进入namenode容器
docker exec -it namenode /bin/bash
4、尝试进行MapReduce服务
编写namenode的mapred-site.xml配置
<configuration>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
</configuration>
将配置文件传入namenode
Hadoop-workspace % docker cp ./mapred-site.xml 1dbbd393fac19275547ba4d810cd7e7952bf594bb581c594f31e38300e795fcf:/opt/hadoop/etc/hadoop
尝试进行MapReduce服务
bash-4.2$ yarn jar hadoop-mapreduce-examples-3.3.6.jar pi 10 15
Number of Maps = 10
Samples per Map = 15
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
2023-07-10 01:43:15 INFO DefaultNoHARMFailoverProxyProvider:64 - Connecting to ResourceManager at resourcemanager/172.19.0.5:8032
2023-07-10 01:43:15 INFO JobResourceUploader:907 - Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/hadoop/.staging/job_1688952567349_0001
2023-07-10 01:43:15 INFO FileInputFormat:300 - Total input files to process : 10
2023-07-10 01:43:16 INFO JobSubmitter:202 - number of splits:10
2023-07-10 01:43:16 INFO JobSubmitter:298 - Submitting tokens for job: job_1688952567349_0001
2023-07-10 01:43:16 INFO JobSubmitter:299 - Executing with tokens: []
2023-07-10 01:43:16 INFO Configuration:2854 - resource-types.xml not found
2023-07-10 01:43:16 INFO ResourceUtils:476 - Unable to find 'resource-types.xml'.
2023-07-10 01:43:17 INFO YarnClientImpl:338 - Submitted application application_1688952567349_0001
2023-07-10 01:43:17 INFO Job:1682 - The url to track the job: http://resourcemanager:8088/proxy/application_1688952567349_0001/
2023-07-10 01:43:17 INFO Job:1727 - Running job: job_1688952567349_0001
2023-07-10 01:43:25 INFO Job:1748 - Job job_1688952567349_0001 running in uber mode : false
2023-07-10 01:43:25 INFO Job:1755 - map 0% reduce 0%
2023-07-10 01:43:31 INFO Job:1755 - map 10% reduce 0%
2023-07-10 01:43:32 INFO Job:1755 - map 20% reduce 0%
2023-07-10 01:43:34 INFO Job:1755 - map 30% reduce 0%
2023-07-10 01:43:36 INFO Job:1755 - map 40% reduce 0%
2023-07-10 01:43:38 INFO Job:1755 - map 50% reduce 0%
2023-07-10 01:43:40 INFO Job:1755 - map 80% reduce 0%
2023-07-10 01:43:43 INFO Job:1755 - map 100% reduce 0%
2023-07-10 01:43:44 INFO Job:1755 - map 100% reduce 100%
2023-07-10 01:43:45 INFO Job:1766 - Job job_1688952567349_0001 completed successfully
2023-07-10 01:43:45 INFO Job:1773 - Counters: 54
File System Counters
FILE: Number of bytes read=226
FILE: Number of bytes written=3045185
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2600
HDFS: Number of bytes written=215
HDFS: Number of read operations=45
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=10
Launched reduce tasks=1
Rack-local map tasks=10
Total time spent by all maps in occupied slots (ms)=50140
Total time spent by all reduces in occupied slots (ms)=9631
Total time spent by all map tasks (ms)=50140
Total time spent by all reduce tasks (ms)=9631
Total vcore-milliseconds taken by all map tasks=50140
Total vcore-milliseconds taken by all reduce tasks=9631
Total megabyte-milliseconds taken by all map tasks=51343360
Total megabyte-milliseconds taken by all reduce tasks=9862144
Map-Reduce Framework
Map input records=10
Map output records=20
Map output bytes=180
Map output materialized bytes=280
Input split bytes=1420
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=280
Reduce input records=20
Reduce output records=0
Spilled Records=40
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=1374
CPU time spent (ms)=5180
Physical memory (bytes) snapshot=3030937600
Virtual memory (bytes) snapshot=29260795904
Total committed heap usage (bytes)=2622488576
Peak Map Physical memory (bytes)=297189376
Peak Map Virtual memory (bytes)=2661572608
Peak Reduce Physical memory (bytes)=209162240
Peak Reduce Virtual memory (bytes)=2667085824
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 30.425 seconds
Estimated value of Pi is 3.17333333333333333333