Hadoop-命令操作整理

OnePandas

已于 2024-02-01 20:47:13 修改

阅读量1.3k

点赞数 20

分类专栏： Hadoop 文章标签： hadoop 大数据

于 2024-01-05 23:08:34 首次发布

本文链接：https://blog.csdn.net/m0_49620121/article/details/135419613

版权

Hadoop 专栏收录该内容

11 篇文章

订阅专栏

HDFS 命令

Apache Hadoop 3.3.4 – Overview

01.appendToFile

hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hdfs dfs -appendToFile /root/tmp/202302/02/1.txt hdfs://192.168.88.161:8020/tmp/test20230202/1.txt

02.cat

-ignoreCrc	忽略检查验证

hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -cat file:///file3 /user/hadoop/file4

03.checksum

-v	显示文件块的大小

hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///etc/hosts

04.chgrp

更改文件的组关联。用户必须是文件的所有者，或者是超级用户。其他信息在权限指南中。

-R	将文件的组关联进行递归更改

hdfs dfs -chgrp -R /tmp/tmp

05.chmod

-R	将文件使用权限进行递归更改

hdfs dfs -chmod -R 777 /tmp/tmp

06.chown

-R	递归更改

hdfs dfs -chown -R /tmp/tmp

07.copyFromLocal

将文件上传到HDFS, 同 -put

08.copyToLocal

将文件下载到本地,同 -get

09.count

计算指定文件模式匹配的路径下的目录、文件和字节数。获取配额和使用情况。带有 -count 的输出列包括：DIR_COUNT、FILE_COUNT、CONTENT_SIZE、路径名

-q	-u 和 -q 选项控制输出包含哪些列。-q 表示显示配额，-u 将输出限制为仅显示配额和使用情况。
-u	-u 和 -q 选项控制输出包含哪些列。-q 表示显示配额，-u 将输出限制为仅显示配额和使用情况。
-v	显示标题行
-x	-x 选项从结果计算中排除快照。如果没有 -x 选项（缺省值），则始终根据所有 INodes 计算结果，包括给定路径下的所有快照。如果给定了 -u 或 -q 选项，则忽略 -x 选项。
-h	可以更人性化的展示字节大小B,K,M,G
-e	显示纠删码策略
-s	-s 选项显示每个目录的快照计数。

hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
hadoop fs -count -q hdfs://nn1.example.com/file1
hadoop fs -count -q -h hdfs://nn1.example.com/file1
hadoop fs -count -q -h -v hdfs://nn1.example.com/file1
hadoop fs -count -u hdfs://nn1.example.com/file1
hadoop fs -count -u -h hdfs://nn1.example.com/file1
hadoop fs -count -u -h -v hdfs://nn1.example.com/file1
hadoop fs -count -e hdfs://nn1.example.com/file1
hadoop fs -count -s hdfs://nn1.example.com/file1

10.test

判断hdfs是否存在文件或者文件夹

命令参数	描述
-d	如果指定路径是一个目录返回0否则返回1
-e	如果指定路径存在返回0否则返回1
-f	如果指定路径是一个文件返回0否则返回1
-s	如果指定路径文件大小大于0返回0否则返回1
-z	如果指定指定文件大小等于0返回0否则返回1

11.getmerge

# 将hdfs目录中的文件合并下载到本地
hdfs dfs -getmerge hdfs://ip:port/tmp/tmp ./value.txt

12.expunge

清空回收站

13.skipTrash

直接删除,不放入回收站

14.df

-h：以可读形式展示，kb/mb/gb

-s：统计指定目录下总使用量

查看hdfs总容量和使用情况

hdfs dfs -df

15.distcp

参数	说明
-append	重用目标文件中的现有数据，并在可能的情况下添加新数据，新增进去而不是覆盖它
-async	是否应该阻塞distcp执行
-atomic	提交所有更改或不提交更改
-bandwidth	以MB/second为单位指定每个map的带宽
-delete	删除目标文件中存在的文件，但在源文件中不存在，走HDFS垃圾回收站
-diff	使用snapshot diff报告来标识源和目标之间的差异
-f	需要复制的文件列表
-filelimit	（已弃用！）限制复制到<= n的文件数
-filters	从复制的文件列表中排除
-i	忽略复制过程中的失败
-log	HDFS上的distcp执行日志文件夹保存
-m	限制同步启动的map数，默认每个文件对应一个map，每台机器最多启动20个map
-mapredSslConf	配置ssl配置文件，用于hftps：//
-numListstatusThreads	用于构建文件清单的线程数(最多40个)，当文件目录结构复杂时应该适当增大该值
-overwrite	选择无条件覆盖目标文件，即使它们存在。
-p	保留源文件状态（rbugpcaxt）（复制，块大小，用户，组，权限，校验和类型，ACL，XATTR，时间戳）
-sizelimit	（已弃用！）限制复制到<= n的文件数字节
-skipcrccheck	是否跳过源和目标路径之间的CRC检查。
-strategy	选择复制策略，默认值uniformsize，每个map复制的文件总大小均衡；可以设置为dynamic，使更快的map复制更多的文件，以提高性能
-tmp	要用于原子的中间工作路径承诺
-update	如果目标文件的名称和大小与源文件不同，则覆盖；如果目标文件大小和名称与源文件相同则跳过

hadoop distcp -i  -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/

hadoop distcp -i -update -delete -p hdfs://192.168.40.100:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp hdfs://192.168.40.200:8020/user/hive/warehouse/iot.db/dwd_pollution_distcp

16.find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

The following primary expressions are recognised:

-name pattern
-iname pattern

Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.
-print
-print0

Always evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.

The following operators are recognised:

expression -a expression

expression -and expression

expression expression

Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.

Example:

hadoop fs -find / -name test -print

16.ls

Usage: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

Options:

-C: Display the paths of files and directories only.(仅显示文件和目录的路径)
-d: Directories are listed as plain files.(目录列为普通文件)
-h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).(格式化,显示m和g)
-q: Print ? instead of non-printable characters.
-R: Recursively list subdirectories encountered.(递归列出遇到的子目录。)
-t: Sort output by modification time (most recent first).(按修改时间对输出进行排序（最近的第一个）。)
-S: Sort output by file size.(按文件大小对输出进行排序。)
-r: Reverse the sort order.(颠倒排序顺序。)
-u: Use access time rather than modification time for display and sorting.(使用访问时间而不是修改时间进行显示和排序。)
-e: Display the erasure coding policy of files and directories only.(仅显示文件和目录的擦除编码策略。)

For a file ls returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Files within a directory are order by filename by default.

Example:

hadoop fs -ls /user/hadoop/file1
hadoop fs -ls -e /ecdir

18.mkdir

Usage: hadoop fs -mkdir [-p] <paths>

-p参数可以递归创建目录

Takes path uri’s as argument and creates directories.

Options:

The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

19.mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

20.put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [-t <thread count>] [-q <thread pool queue size>] [ - | <localsrc> ...] <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”

Copying fails if the file already exists, unless the -f flag is given.

Options:

-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)(保留访问和修改时间、所有权和权限。（假设权限可以跨文件系统传播）)
-f : Overwrites the destination if it already exists.(覆盖已存在的目标。)
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.(允许DataNode将文件延迟持久化到磁盘，强制复制因子为1。此标志将导致耐久性降低。小心使用。)
-d : Skip creation of temporary file with the suffix ._COPYING_.(跳过创建后缀为的临时文件_复制。)
-t <thread count> : Number of threads to be used, default is 1. Useful when uploading directories containing more than 1 file.(要使用的线程数，默认值为1。上传包含多个文件的目录时很有用。)
-q <thread pool queue size> : Thread pool queue size to be used, default is 1024. It takes effect only when thread count greater than 1.(要使用的线程池队列大小，默认值为1024。只有当线程数大于1时，它才会生效。)

Examples:

hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
hadoop fs -put -t 5 localdir hdfs://nn.example.com/hadoop/hadoopdir
hadoop fs -put -t 10 -q 2048 localdir1 localdir2 hdfs://nn.example.com/hadoop/hadoopdir

21.rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

Delete files specified as args.

If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).

Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).

See expunge about deletion of files in trash.

Options:

The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.(选项将不会显示诊断消息，也不会修改退出状态以反映文件不存在时的错误。)
The -R option deletes the directory and any content under it recursively.(选项以递归方式删除目录及其下的任何内容。)
The -r option is equivalent to -R.
The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.(-选项将绕过垃圾桶（如果启用），并立即删除指定的文件。当需要从超过配额的目录中删除文件时，这可能很有用。)
The -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent accidental deletion of large directories. Delay is expected when walking over large directory recursively to count the number of files to be deleted before the confirmation.(选项在删除文件总数大于`hadoop.shell.delete.limit.num.files’（在core-site.xml中，默认值：100）的目录之前需要进行安全确认。它可以与-skipTrash一起使用，以防止意外删除大目录。当递归遍历大目录以计算确认前要删除的文件数时，预计会出现延迟。)

Example:

hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

22.rmdir

Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

Delete a directory.

Options:

--ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.(使用通配符时，如果目录仍包含文件，请不要失败。)

Example:

hadoop fs -rmdir /user/hadoop/emptydir

23.tail

hadoop3.x版本以上新增命令

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

The -f option will output appended data as the file grows, as in Unix.

Example:

hadoop fs -tail pathname

24.touch

Usage: hadoop fs -touch [-a] [-m] [-t TIMESTAMP] [-c] URI [URI ...]

Updates the access and modification times of the file specified by the URI to the current time. If the file does not exist, then a zero length file is created at URI with current time as the timestamp of that URI.

Use -a option to change only the access time(仅更改访问时间的选项)
Use -m option to change only the modification time(仅更改修改时间的选项)
Use -t option to specify timestamp (in format yyyyMMdd:HHmmss) instead of current time(用于指定时间戳（格式为yyyyMMdd:HHmmss）而不是当前时间的选项)
Use -c option to not create file if it does not exist(如果文件不存在，则不创建该文件的选项)

The timestamp format is as follows * yyyy Four digit year (e.g. 2018) * MM Two digit month of the year (e.g. 08 for month of August) * dd Two digit day of the month (e.g. 01 for first day of the month) * HH Two digit hour of the day using 24 hour notation (e.g. 23 stands for 11 pm, 11 stands for 11 am) * mm Two digit minutes of the hour * ss Two digit seconds of the minute e.g. 20180809:230000 represents August 9th 2018, 11pm

Example:

hadoop fs -touch pathname
hadoop fs -touch -m -t 20180809:230000 pathname
hadoop fs -touch -t 20180809:230000 pathname
hadoop fs -touch -a pathname

25.touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.（创建一个长度为零的文件。如果存在长度为非零的文件，则返回错误。）

Example:

hadoop fs -touchz pathname

26.help

# 查看ls的命令帮助文档
hadoop fs -help ls

27.将fsimage文件转换为xml文件

hdfs oiv -p 文件类型(xml) -i 镜像文件 -o 转换后文件输出路径

28.将edits文件转换为xml文件

hdfs oev -p 文件类型(xml) -i 镜像文件 -o 转换后文件输出路径

29.查看支持的压缩算法

hadoop checknative

30.查看当前 hdfs 设置的文件块大小

# 按照字节显示
hdfs getconf -confKey dfs.blocksize

31.启停HDFS组件

hdfs --daemon start/stop namenode/datanode/secondarynamenode

32.查看 NameNode 内存

# jps 打印 NameNode 的 id 号
jmap -heap id

33.刷新HDFS节点

hdfs dfsadmin -refreshNodes

34.查看当前有哪些存储策略可以用

hdfs storagepolicies -listPolicies

35.为指定路径（数据存储目录）设置指定的存储策略

hdfs storagepolicies -setStoragePolicy -path xxx -policy xxx

36.获取指定路径（数据存储目录或文件）的存储策略

hdfs storagepolicies -getStoragePolicy -path xxx

37.取消存储策略；执行改命令之后该目录或者文件，以其上级的目录为准，如果是根目录，那么就是HOT

hdfs storagepolicies -unsetStoragePolicy -path xxx

38.查看文件块的分布

hdfs fsck /tmp -files -blocks -locations

YARN 命令

1.启停 YARN

yarn --daemon start/stop resourcemanager/nodemanager

2.列出所有 Application

yarn application -list

3.根据状态过滤 Application

ALL
NEW
NEW_SAVING
SUBMITTED
ACCEPTED
RUNNING
FINISHED
FAILED
KILLED

yarn application -list -appStates FINISHED

4.kill 掉 Application

yarn application -kill 任务id号

5.查看 job 的状态

yarn application -status 任务id号

6.查看 Application 日志

yarn logs -applicationId 任务id号

7.查询 Container 日志

yarn logs -applicationId application_1612577921195_0001 -containerId container_1612577921195_0001_01_000001

8.列出所有 Application 尝试的列表

yarn applicationattempt -list 任务jobId

9.打印 ApplicationAttempt 状态

yarn applicationattempt -status appattempt_1612599921195_0001_000001

10.列出所有节点

yarn node -list -all

11.YARN配置更新

yarn rmadmin -refreshQueues

12.打印队列信息

yarn queue -status default

13.修正任务优先级

yarn application -appID 任务号 -updatePriority 5

其他命令

1.开启数据均衡命令

# 对于参数10，代表的是集群中各个节点的磁盘空间利用率相差不超过10%，可根据实际情况进行调整。
sbin/start-balancer.sh -threshold 10

2.关闭数据均衡命令

# 注意：由于 HDFS 需要启动单独的 Rebalance Server 来执行 Rebalance 操作，所以尽量不要在 NameNode 上执行 start-balancer.sh，而是找一台比较空闲的机器。
sbin/stop-balancer.sh

3.查看集群节点

hadoop dfsadmin -report