Hadoop集群基础操作
Hadoop集群基本信息查看
集群存储信息查看
登录HDFS监控web查看运行情况及相关存储信息,默认端口为50070,具体以hdfs-site.xml文件中配置为准
<!-- 定义namenode界面的访问地址 -->
<property>
<name>dfs.http.address</name>
<value>node1:50070</value>
</property>
当然也可以在后台服务器通过命令的方式进行查看:
Usage: hdfs dfsadmin
Note: Administrative commands can only be run as the HDFS superuser.
[-report [-live] [-dead] [-decommissioning] [-enteringmaintenance] [-inmaintenance]]
集群计算资源查看
登录8088端口(默认)查看集群的计算资源信息,具体地址以yarn-site.xml文件中配置为准:
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node1</value>
</property>
<!-- yarn的web访问地址 -->
<property>
<description>
The http address of the RM web application.
If only a host is provided as the value,
the webapp will be served on a random port.
</description>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
进入8042端口(默认)查看节点的各项资源信息,具体以yarn-site.xml文件中配置为准:
<property>
<description>NM Webapp address.</description>
<name>yarn.nodemanager.webapp.address</name>
<value>${yarn.nodemanager.hostname}:8042</value>
</property>
HDFS文件系统操作
查看HDFS文件系统
可以登录50070端口通过web浏览hdfs文件系统基本信息,和正常在Linux操作系统目录结构基本一样:
HDFS基本操作
通过HDFS命令可以完成对HDFS文件系统的大部分管理操作,相关命令信息如下:
[hadoop@node1 ~]$ hdfs dfs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] <path> ...]
[-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] [-x] <path> ...]
[-expunge]
[-find <path> ... <expression> ...]
[-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] [-skip-empty-file] <src> <localdst>]
[-help [cmd ...]]
[-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-truncate [-w] <length> <path> ...]
[-usage [cmd ...]]
Generic options supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> define a value for a given property
-fs <file:///|hdfs://namenode:port> specify default filesystem URL to use, overrides 'fs.defaultFS' property from configurations.
-jt <local|resourcemanager:port> specify a ResourceManager
-files <file1,...> specify a comma-separated list of files to be copied to the map reduce cluster
-libjars <jar1,...> specify a comma-separated list of jar files to be included in the classpath
-archives <archive1,...> specify a comma-separated list of archives to be unarchived on the compute machines
The general command line syntax is:
command [genericOptions] [commandOptions]
常用操作命令:
# 创建目录
hdfs dfs -mkdir /tmp # 在HDFS的根目录(/)下创建一个tmp目录
hdfs dfs -mkdir -p /test1/test2 # 创建多级目录,加入参数“-p”
# 显示文件相关信息
hdfs dfs -ls / # 列出HDFS上的所有目录
# 显示文件内容
hdfs dfs -cat /tmp/test.txt # 显示文件内容
# 文件上传到HDFS中
hdfs dfs -put /home/hadoop/test.txt /tmp # 将本地/home/hadoop/test.txt文件上传到HDFS中的/tmp目录下
hdfs dfs -appendToFile /home/hadooop/test.txt /tmp # 若文件存在,则追加到文件末尾
hdfs dfs -copyFromLocal /home/hadoop/test.txt /tmp # 若HDFS文件已存在,则覆盖原有文件
# 下载HDFS中的文件
hdfs dfs -get /tmp/test.txt /home/hadoop #HDFS中的文件test.txt下载到本地的/home/hadoop目录下
hdfs dfs -copyToLocal /tmp/test.txt /home/hadoop/test1.txt #若本地存在该文件,对文件重命名
# 在HDFS中移动文件
hdfs dfs -mv /tmp/test.txt /test/ # 将test.txt移动到test目录下
# 删除HDFS中的指定文件
hdfs dfs -rm /tmp/test.txt
在web页面也可查看文件的相关信息:
运行MapReduce任务
官方示例程序包
在$HADOOOP_HOME/share/hadoop/mapreduce/
目录下有个官方示例程序包hadoop-mapreduce-examples-2.10.1.jar
,其中封装了一些常用的测试模块:
程序名称 | 用途 |
---|---|
aggregatewordcount | 一个基于聚合的map/reduce程序,它对输入文件中的单词进行计数。 |
aggregatewordhist | 一个基于聚合的map/reduce程序,用于计算输入文件中单词的直方图。 |
bbp | 一个使用Bailey Borwein Plouffe计算PI精确数字的map/reduce程序。 |
dbcount | 一个计算页面浏览量的示例作业,从数据库中计数。 |
distbbp | 一个使用BBP型公式计算PI精确比特的map/reduce程序。 |
grep | 一个在输入中计算正则表达式匹配的map/reduce程序。 |
join | 一个影响连接排序、相等分区数据集的作业 |
multifilewc | 一个从多个文件中计算单词的任务。 |
pentomino | 一个地图/减少瓦片铺设程序来找到解决PotoMimo问题的方法。 |
pi | 一个用拟蒙特卡洛方法估计PI的MAP/Relp程序。 |
randomtextwriter | 一个map/reduce程序,每个节点写入10GB的随机文本数据。 |
randomwriter | 一个映射/RADIUS程序,每个节点写入10GB的随机数据。 |
secondarysort | 定义一个次要排序到减少的例子。 |
sort | 一个对随机写入器写入的数据进行排序的map/reduce程序。 |
sudoku | 数独求解者。 |
teragen | 为terasort生成数据 |
terasort | 运行terasort |
teravalidate | terasort的检查结果 |
wordcount | 一个映射/缩小程序,计算输入文件中的单词。 |
wordmean | map/reduce程序,用于计算输入文件中单词的平均长度。 |
wordmedian | map/reduce程序,用于计算输入文件中单词的中值长度。 |
提交MapReduce任务运行
示例1:wordcount
执行命令及日志信息如下:
[hadoop@node1 ~]$ hadoop jar /app/hadoop-2.10.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar wordcount /tmp/test.txt /tmp/output
22/05/11 23:19:17 INFO client.RMProxy: Connecting to ResourceManager at node1/199.188.166.111:8032
22/05/11 23:19:19 INFO input.FileInputFormat: Total input files to process : 1
22/05/11 23:19:19 INFO mapreduce.JobSubmitter: number of splits:1
22/05/11 23:19:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1652322858586_0001
22/05/11 23:19:20 INFO conf.Configuration: resource-types.xml not found
22/05/11 23:19:20 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/05/11 23:19:20 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
22/05/11 23:19:20 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
22/05/11 23:19:21 INFO impl.YarnClientImpl: Submitted application application_1652322858586_0001
22/05/11 23:19:21 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1652322858586_0001/
22/05/11 23:19:21 INFO mapreduce.Job: Running job: job_1652322858586_0001
22/05/11 23:19:38 INFO mapreduce.Job: Job job_1652322858586_0001 running in uber mode : false
22/05/11 23:19:38 INFO mapreduce.Job: map 0% reduce 0%
22/05/11 23:19:50 INFO mapreduce.Job: map 100% reduce 0%
22/05/11 23:20:02 INFO mapreduce.Job: map 100% reduce 100%
22/05/11 23:20:03 INFO mapreduce.Job: Job job_1652322858586_0001 completed successfully
22/05/11 23:20:03 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=2274
FILE: Number of bytes written=421473
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3213
HDFS: Number of bytes written=1928
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=8053
Total time spent by all reduces in occupied slots (ms)=8758
Total time spent by all map tasks (ms)=8053
Total time spent by all reduce tasks (ms)=8758
Total vcore-milliseconds taken by all map tasks=8053
Total vcore-milliseconds taken by all reduce tasks=8758
Total megabyte-milliseconds taken by all map tasks=8246272
Total megabyte-milliseconds taken by all reduce tasks=8968192
Map-Reduce Framework
Map input records=38
Map output records=335
Map output bytes=4379
Map output materialized bytes=2274
Input split bytes=95
Combine input records=335
Combine output records=87
Reduce input groups=87
Reduce shuffle bytes=2274
Reduce input records=87
Reduce output records=87
Spilled Records=174
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=222
CPU time spent (ms)=2630
Physical memory (bytes) snapshot=396931072
Virtual memory (bytes) snapshot=3804737536
Total committed heap usage (bytes)=194383872
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=3118
File Output Format Counters
Bytes Written=1928
执行完成后可在HDFS文件系统中查看到执行结果:
在output文件中生成了两个新文件:一个是_SUCCESS,这是一个标识文件,表示这个任务执行完成;另一个是part-r-00000,即任务执行完成后产生的结果文件:
[hadoop@node1 ~]$ hdfs dfs -cat /tmp/output/part-r-00000
-rw-rw-r--. 4
00:15 1
00:20 1
00:26 1
00:29 1
00:39 1
00:47 1
00:51 1
00:54 1
01:00 1
01:04 1
01:07 1
01:09 1
01:16 1
07:55 1
08:04 1
08:15 1
09:32 1
1 4
11 9
16 1
17 5
18:59 6
19:33 4
19:34 4
2 8
20:24 1
20:33 1
20:46 1
21 18
23:05 1
23:06 2
28 1
29 1
3 23
30 2
35 2
4 7
5 17
54 1
6 7
9 6
Apr 4
Jetty_0_0_0_0_50070_hdfs____w2cu08 1
Jetty_0_0_0_0_50090_secondary____y6aanv 1
Jetty_0_0_0_0_8042_node____19tj0x 1
Jetty_localhost_32873_datanode____t7p7lo 1
Jetty_localhost_33735_datanode____jksu74 1
Jetty_localhost_34961_datanode____.fpendy 1
Jetty_localhost_36015_datanode____.lhrbt4 1
Jetty_localhost_38151_datanode____.rhd829 1
Jetty_localhost_39677_datanode____.s4r2y1 1
Jetty_localhost_40461_datanode____.d6iqau 1
Jetty_localhost_40969_datanode____1moe5j 1
Jetty_localhost_41457_datanode____snit9c 1
Jetty_localhost_42109_datanode____.mhhtgd 1
Jetty_localhost_42315_datanode____.wlr1a8 1
Jetty_localhost_42845_datanode____.422dr2 1
Jetty_localhost_43529_datanode____.iybvi4 1
Jetty_localhost_43811_datanode____vzpazk 1
Jetty_localhost_44775_datanode____2kxto 1
Jetty_node1_50070_hdfs____.8fa0c 1
Jetty_node1_8088_cluster____uqk9cr 1
May 33
drwx------. 11
drwxr-xr-x. 2
drwxrwxr-x. 20
hadoop 52
hadoop-hadoop-datanode.pid 1
hadoop-hadoop-namenode.pid 1
hsperfdata_hadoop 1
hsperfdata_root 1
root 22
systemd-private-0abe12489c264785bd8088f6e33eeb83-ModemManager.service-aaM0Jf 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-bluetooth.service-7X05Qi 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-chronyd.service-HmOY5i 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-colord.service-VyCTLg 1
systemd-private-0abe12489c264785bd8088f6e33eeb83-rtkit-daemon.service-3U58pj 1
total 1
tracker-extract-files.1000 1
vmware-root_916-2689078442 1
vmware-root_918-2697532712 1
vmware-root_921-3980298495 1
vmware-root_925-3988621690 1
vmware-root_927-3980167416 1
yarn-hadoop-nodemanager.pid 1
yarn-hadoop-resourcemanager.pid 1
示例2:计算圆周率Π的值
执行命令及日志信息如下:
[hadoop@node1 ~]$ hadoop jar /app/hadoop-2.10.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.1.jar pi 10 100
Number of Maps = 10
Samples per Map = 100
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
22/05/11 23:35:11 INFO client.RMProxy: Connecting to ResourceManager at node1/199.188.166.111:8032
22/05/11 23:35:12 INFO input.FileInputFormat: Total input files to process : 10
22/05/11 23:35:12 INFO mapreduce.JobSubmitter: number of splits:10
22/05/11 23:35:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1652322858586_0002
22/05/11 23:35:13 INFO conf.Configuration: resource-types.xml not found
22/05/11 23:35:13 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/05/11 23:35:13 INFO resource.ResourceUtils: Adding resource type - name = memory-mb, units = Mi, type = COUNTABLE
22/05/11 23:35:13 INFO resource.ResourceUtils: Adding resource type - name = vcores, units = , type = COUNTABLE
22/05/11 23:35:13 INFO impl.YarnClientImpl: Submitted application application_1652322858586_0002
22/05/11 23:35:13 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1652322858586_0002/
22/05/11 23:35:13 INFO mapreduce.Job: Running job: job_1652322858586_0002
22/05/11 23:35:24 INFO mapreduce.Job: Job job_1652322858586_0002 running in uber mode : false
22/05/11 23:35:24 INFO mapreduce.Job: map 0% reduce 0%
22/05/11 23:35:42 INFO mapreduce.Job: map 20% reduce 0%
22/05/11 23:36:00 INFO mapreduce.Job: map 20% reduce 7%
22/05/11 23:36:28 INFO mapreduce.Job: map 30% reduce 7%
22/05/11 23:36:29 INFO mapreduce.Job: map 50% reduce 7%
22/05/11 23:36:30 INFO mapreduce.Job: map 70% reduce 7%
22/05/11 23:36:31 INFO mapreduce.Job: map 100% reduce 7%
22/05/11 23:36:32 INFO mapreduce.Job: map 100% reduce 100%
22/05/11 23:36:33 INFO mapreduce.Job: Job job_1652322858586_0002 completed successfully
22/05/11 23:36:33 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=226
FILE: Number of bytes written=2297625
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2620
HDFS: Number of bytes written=215
HDFS: Number of read operations=43
HDFS: Number of large read operations=0
HDFS: Number of write operations=3
Job Counters
Launched map tasks=10
Launched reduce tasks=1
Data-local map tasks=10
Total time spent by all maps in occupied slots (ms)=528176
Total time spent by all reduces in occupied slots (ms)=48013
Total time spent by all map tasks (ms)=528176
Total time spent by all reduce tasks (ms)=48013
Total vcore-milliseconds taken by all map tasks=528176
Total vcore-milliseconds taken by all reduce tasks=48013
Total megabyte-milliseconds taken by all map tasks=540852224
Total megabyte-milliseconds taken by all reduce tasks=49165312
Map-Reduce Framework
Map input records=10
Map output records=20
Map output bytes=180
Map output materialized bytes=280
Input split bytes=1440
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=280
Reduce input records=20
Reduce output records=0
Spilled Records=40
Shuffled Maps =10
Failed Shuffles=0
Merged Map outputs=10
GC time elapsed (ms)=9894
CPU time spent (ms)=13350
Physical memory (bytes) snapshot=2232963072
Virtual memory (bytes) snapshot=20908384256
Total committed heap usage (bytes)=1540988928
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1180
File Output Format Counters
Bytes Written=97
Job Finished in 82.602 seconds
Estimated value of Pi is 3.14800000000000000000
查看MapReduce任务计算资源情况
- 在下面页面可以实时看到集群资源的使用情况(因为执行完成,所以参数为初始参数)
- MapReduce任务列表
- 查看任务的详细信息