一、hdfs 命令
- 查看版本:
hadoop version
- 查看文件内容,配合more:
hadoop fs -cat /in/hadoop-hadoop-namenode-h71.log | more
- 统计hdfs中文件的行数:
hadoop fs -cat /in/hadoop-hadoop-namenode-h71.log | wc -l
输出:16509 - 查看hdfs中文件的前n行:
hadoop fs -text file | head -n 100
- 查看hdfs中文件的后n行:
hadoop fs -text file | tail -n 100
- 查看hdfs目录中的前n个文件:
hadoop fs -du -h /hbase/oldWALs | head -10
- 查看hdfs目录中的后n个文件:
hadoop fs -du -h /hbase/oldWALs | tail -10
- 查看hdfs目录中的后n个文件(加过滤):
hadoop fs -du -h /hbase/oldWALs | grep 16978910 | tail -10
- 查看配置的 fs.default.name名字:
hdfs getconf -confKey fs.default.name
输出:hdfs://avicnamespace - 查看子目录所占存储大小:
hadoop fs -du -h /
注:加-s
参数是该目录下所有总和
第一列标示该目录下总文件大小
第二列标示该目录下所有文件在集群上的总存储大小和你的副本数相关,我的副本数是3 ,所以第二列的是第一列的三倍 (第二列内容=文件大小*副本数)
第三列标示你查询的目录
- 查看 hdfs 空间使用情况:
$ hdfs dfs -df -h /
Filesystem Size Used Available Use%
hdfs://bigdata1 78.3 T 61.5 T 11.1 T 78%
- 修改属主:
hadoop fs -chown -R root:root /tmp
- 赋权限:
hadoop fs -chmod 777 /work/user
- 获取一个namenode节点的HA状态:
hdfs haadmin -getServiceState nn1
注:上面 getServiceState 跟的 serviceId 在 Cloudera Manager 中可以去这里找:
- 查看文件个数:
hadoop fs -count /hbase/oldWALs
- 删除目录:
hadoop fs -rm -r /path/to/your/directory
- 删除文件:
hadoop fs -rm /path/to/your/file
- 删除文件或目录的时候跳过回收站:
hadoop fs -rm -r -f -skipTrash /input
- 清空回收站:
hadoop fs -expunge
- 删除文件(批量删除):
hadoop fs -du -h /hbase/oldWALs | head -1 | awk '{print $5}' | xargs hadoop fs -rm
- 在目录中查找文件:
hadoop fs -find /xiaoqiang/ -name *.parquet
- 合并导出(合并一个文件夹下面的所有文件至一个文件中):
hadoop fs -getmerge /user/hadoop/output/ local_file
注:假设在你的hdfs集群上有一个/user/hadoop/output
目录,里面有作业执行的结果(多个文件组成)part-000000
,part-000001
,part-000002
,然后你想把所有的文件合拢来一起看就可以使用这个命令。 - 查看是否处于 safemode,正常是 off:
hdfs dfsadmin -safemode get
输出:Safe mode is OFF
- 离开安全模式:
hdfs dfsadmin -safemode leave
进入安全模式:hdfs dfsadmin -safemode enter
- hdfs datanode 是否健康,磁盘空间是否空闲:
hdfs dfsadmin -report
Configured Capacity: 26792229863424 (24.37 TB)
Present Capacity: 13825143267805 (12.57 TB)
DFS Remaining: 7957572810313 (7.24 TB)
DFS Used: 5867570457492 (5.34 TB)
DFS Used%: 42.44%
Replicated Blocks:
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
Erasure Coded Block Groups:
Low redundancy block groups: 0
Block groups with corrupt internal blocks: 0
Missing block groups: 0
Low redundancy blocks with highest priority to recover: 0
Pending deletion blocks: 0
-------------------------------------------------
report: Access denied for user root. Superuser privilege is required
- 查看当前的 hdfs 的块的状态:
hdfs fsck /
Status: HEALTHY
Number of data-nodes: 3
Number of racks: 1
Total dirs: 10894
Total symlinks: 0
Replicated Blocks:
Total size: 1931171688283 B (Total open files size: 8187303934 B)
Total files: 45350 (Files currently being written: 13)
Total blocks (validated): 46718 (avg. block size 41336780 B) (Total open file blocks (not validated): 72)
Minimally replicated blocks: 46718 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Missing blocks: 0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Blocks queued for replication: 0
Erasure Coded Block Groups:
Total size: 0 B
Total files: 0
Total block groups (validated): 0
Minimally erasure-coded block groups: 0
Over-erasure-coded block groups: 0
Under-erasure-coded block groups: 0
Unsatisfactory placement block groups: 0
Average block group size: 0.0
Missing block groups: 0
Corrupt block groups: 0
Missing internal blocks: 0
Blocks queued for replication: 0
FSCK ended at Wed Feb 14 17:52:45 CST 2024 in 797 milliseconds
The filesystem under path '/' is HEALTHY
参数说明:hadoop fsck详解
Status:代表这次hdfs上block检测的结果
Number of data-nodes : datanode的节点数量
Number of racks : 机架数量
Total dirs:代表检测的目录下总共有多少个目录
Total symlinks:代表检测的目录下有多少个符号连接
Total size : hdfs集群存储大小,不包括复本大小。如:75423236058649 B (字节)。(字节->KB->m->G->TB,75423236058649/1024/1024/1024/1024=68.59703358591014TB)
Total files:代表检测的目录下总共有多少文件
Total blocks (validated) : 总共的块数量,不包括复本。(5363690 (avg. block size 14061818 B) (Total open file blocks (not validated): 148),计算: 14061818 *5363690=75423232588420 集群的容量大小,不包括复本的)
Minimally replicated blocks:代表拷贝的最小block块数
Over-replicated blocks:指的是副本数大于指定副本数的block数量
Under-replicated blocks : 正在复制块数量
Mis-replicated blocks : 正复制的缺少复制块的数量
Default replication factor :默认的复制因子,3 指默认的副本数是3份(自身一份,需要拷贝两份)
Average block replication : 当前块的平均复制数,如果小 default replication factor,则有块丢失
Missing replicas : 缺少复制块的数量,通常情况下Under-replicated blocks\Mis-replicated blocks\Missing replicas 都为0,则集群健康,如果不为0,则缺失块了
Corrupt blocks : 坏块的数量,这个值不为0,则说明当前集群有不可恢复的块,即数据有丢失了
Missing replicas:丢失的副本数
二、yarn 相关命令
查看任务列表:
[root@node01 ~]# yarn application -list
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
21/07/12 15:05:56 INFO client.AHSProxy: Connecting to Application History server at node02/110.110.110.110:10200
21/07/12 15:05:56 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
Total number of applications (application-types: [], states: [SUBMITTED, ACCEPTED, RUNNING] and tags: []):2
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1615543578058_0002 wormhole_1_mysql-hbase_test_stream SPARK root default RUNNING UNDEFINED 10 http://node02:45335
application_1617255690277_0001 ats-hbase yarn-service yarn-ats default RUNNING UNDEFINED 100% N/A
查看任务日志:yarn logs -applicationId application_1625729683563_0015
杀死 application:yarn application -kill application_1625729683563_0015
杀死 job:hadoop job -kill + jobID
批量kill 掉yarn的无用的任务(ACCEPTED是state的值,可以进行更改):
for i in `yarn application -list | grep -w ACCEPTED | awk '{print $1}' | grep application_`; do yarn application -kill $i; done
三、Kafka 相关命令
10.3.2.24 服务器上创建生产者:
kafka-console-producer.sh --broker-list 10.3.2.24:6667 --topic djt_db.test_schema1.result
生产有key消息:加上属性 --property parse.key=true
,默认消息 key 与消息 value 间使用 Tab键 进行分隔,所以消息 key 以及 value 中切勿使用转义字符 \t
;参考:【kafka运维】kafka-console-producer.sh命令详解(3)
消费 Topic 数据:
kafka-console-consumer.sh --bootstrap-server 10.3.2.24:6667 --topic djt_db.test_schema1.result --from-beginning
查看都有什么 Toptic:
kafka-topics.sh --list --zookeeper 10.3.2.24:2181
删除 Toptic(配置文件没有修改,所以现在还是标记了删除而已):
kafka-topics.sh --delete --topic huiq_test2_ctrl --zookeeper 10.3.2.24:2181
Topic huiq_test2_ctrl is marked for deletion.
Note: This will have no impact if delete.topic.enable is not set to true.
查看创建的 Topic:
kafka-topics.sh --describe --zookeeper 10.3.2.24:2181 --topic wormhole_feedback
Topic:wormhole_feedback PartitionCount:1 ReplicationFactor:3 Configs:
Topic: wormhole_feedback Partition: 0 Leader: 1003 Replicas: 1003,1001,1002 Isr: 1003,1002,1001
创建 Topic:
kafka-topics.sh --zookeeper node01:2181 --create --topic wormhole_heartbeat --replication-factor 1 --partitions 1
修改 Topic:
kafka-topics.sh --zookeeper node01:2181 --alter --topic wormhole_feedback --partitions 4
WARNING: If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected
Adding partitions succeeded!
注:分区数只能变多不能变少,否则报错:
Error while executing topic command : The number of partitions for a topic can only be increased. Topic huiq_warm_test currently has 3 partitions, 1 would not be an increase.
[2024-04-22 15:32:06,848] ERROR org.apache.kafka.common.errors.InvalidPartitionsException: The number of partitions for a topic can only be increased. Topic huiq_warm_test currently has 3 partitions, 1 would not be an increase.
四、zookeeper相关命令
在 10.2.3.24
服务器上登录客户端:
[root@bigdatanode01 zookeeper]# zookeeper-client
cd /usr/hdp/3.1.4.0-315/zookeeper/
bin/zkCli.sh
[zk: localhost:2181(CONNECTED) 2] ls /consumers
[Some(group-02)]
[zk: localhost:2181(CONNECTED) 3] ls /consumers/Some(group-02)
[offsets]
[zk: localhost:2181(CONNECTED) 4] ls /consumers/Some(group-02)/offsets
[djt_db.test_schema1.result]
[zk: localhost:2181(CONNECTED) 5] ls /consumers/Some(group-02)/offsets/djt_db.test_schema1.result
[]
[zk: localhost:2181(CONNECTED) 33] get /consumers/Some(group-02)/offsets/djt_db.test_schema1.result
0:161074
cZxid = 0x1300228ae1
ctime = Wed Jul 21 19:50:31 CST 2021
mZxid = 0x130025886f
mtime = Thu Jul 22 09:21:30 CST 2021
pZxid = 0x1300258868
cversion = 2
dataVersion = 96
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 8
numChildren = 0
[zk: localhost:2181(CONNECTED) 4] set /consumers/Some(group-02)/offsets/djt_db.test_schema1.result 0:161075
cZxid = 0x1300228ae1
ctime = Wed Jul 21 19:50:31 CST 2021
mZxid = 0x13002c189c
mtime = Fri Jul 23 15:12:01 CST 2021
pZxid = 0x1300258868
cversion = 2
dataVersion = 152
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 8
numChildren = 0
[zk: localhost:2181(CONNECTED) 6] rm /hbase
五、yarn 相关参数调整
问题1:hive任务卡在Tez session hasn't been created yet. Opening session
解决:参考:Kerberos实战
众所周知,一些大数据服务的执行,需要yarn资源的调度,所以在使用平台服务之前,需要先检查一下yarn的配置,确保执行任务的时候不会因为资源分配问题导致任务被卡住。
假设有集群由三台机器组成,且三台机器的内存为8G,这里需要调整两处地方:
- Yarn容器分配的内存大小
- 资源调度容量的最大百分比,默认为0.2。
Web UI --> Yarn配置 --> 基本配置 --> Memory allocated for all YARN containers on a node,内存建议调大一些。
后来我把这里改成了500GB。
Web UI --> Yarn配置 --> 高级配置 --> Scheduler --> 修改yarn.scheduler.capacity.maximum-am-resource-percent值,百分比建议调高一点,比如0.8(最大值是1)。
如果分配给YARN资源过少,会导致执行集群任务被卡住的问题。保存修改后的配置,并重启YARN服务。
问题2:yarn在资源不够用的时候两个都是Standby
,导致有时候任务杀不死,正常情况下是一个Active
一个Standby
。
解决:将参数yarn.resourcemanager.zk-timeout-ms
的值调大,由10000调为了60000。
参考:
ZooKeeper节点数据量限制引起的Hadoop YARN ResourceManager崩溃原因分析(二)
ResourceManager持续主备倒换
六、CDH 文件权限问题
刚装完 CDH 集群后在执行 Sparkstreaming 程序的时候报错:org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x
解决:
[root@localhost ~]# hadoop fs -ls /
Found 2 items
drwxrwxrwt - hdfs supergroup 0 2020-11-15 12:22 /tmp
drwxr-xr-x - hdfs supergroup 0 2020-11-15 12:21 /user
[root@localhost ~]#
[root@localhost ~]# hadoop fs -chmod 777 /user
chmod: changing permissions of '/user': Permission denied. user=root is not the owner of inode=/user
[root@localhost ~]# sudo -u hdfs hadoop fs -chmod 777 /user
[root@localhost ~]# hadoop fs -ls /
Found 2 items
drwxrwxrwt - hdfs supergroup 0 2020-11-15 12:22 /tmp
drwxrwxrwx - hdfs supergroup 0 2020-11-15 12:21 /user