一、集群间数据拷贝
scp实现两个远程主机之间的文件复制
将本机文件传到远程主机:
[root@hadoop001 hadoop-2.6.5]# scp flow.jar root@hadoop002:/opt/module/hadoop-2.6.5
flow.jar 100% 38MB 28.7MB/s 00:01
将远程主机文件复制到本地:
[root@hadoop001 hadoop-2.6.5]# rm -r flow.jar
rm: remove regular file ‘flow.jar’? y
[root@hadoop001 hadoop-2.6.5]# scp root@hadoop002:/opt/module/hadoop-2.6.5/flow.jar flow.jar
flow.jar 100% 38MB 44.8MB/s 00:00
两个远程主机之间传输文件:
[root@hadoop003 hadoop-2.6.5]# scp -r root@hadoop001:/opt/module/hadoop-2.6.5/flow.jar root@hadoop002:/opt/module/hadoop-2.6.5/flow.jar
两个hadoop集群间复制数据
采用discp命令实现两个hadoop集群之间的递归数据复制
[root@hadoop002 hadoop-2.6.5]# bin/hadoop distcp hdfs://haoop002:9000/user/data/ hdfs://hadoop105:9000/user/data/
二、Hadoop Archive 归档
需求:
如果在hadoop集群中HDFS拥有大量小文件,如:气象局10年的文件全部按文本文件放入hdfs中,虽然在hdfs占用空间是实际大小而不是一个文件就是128M,但是每个文件还是独立的占用一块,在namenode中记录该块的信息。这样,虽然磁盘空间不会损耗,但是对于内存的占用是非常巨大的。
Hadoop存档文件或HAR文件,是一个更高效的文件存档工具,它将文件存入HDFS块,在减少namenode内存使用的同时,允许对文件进行透明的访问。具体说来,Hadoop存档文件可以用作MapReduce的输入。
实操:
- 先启动hdfs,再启动yarn(因为在进行归档过程中需要使用mapreduce功能)
[root@hadoop001 sbin]# ./start-dfs.sh
Starting namenodes on [hadoop001]
hadoop001: starting namenode, logging to /opt/module/hadoop-2.6.5/logs/hadoop-root-namenode-hadoop001.out
hadoop001: starting datanode, logging to /opt/module/hadoop-2.6.5/logs/hadoop-root-datanode-hadoop001.out
hadoop002: starting datanode, logging to /opt/module/hadoop-2.6.5/logs/hadoop-root-datanode-hadoop002.out
hadoop003: starting datanode, logging to /opt/module/hadoop-2.6.5/logs/hadoop-root-datanode-hadoop003.out
Starting secondary namenodes [hadoop003]
hadoop003: starting secondarynamenode, logging to /opt/module/hadoop-2.6.5/logs/hadoop-root-secondarynamenode-hadoop003.out
[root@hadoop001 sbin]# jps
3984 DataNode
3883 NameNode
4220 Jps
[root@hadoop002 sbin]# ./start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /opt/module/hadoop-2.6.5/logs/yarn-root-resourcemanager-hadoop002.out
hadoop001: starting nodemanager, logging to /opt/module/hadoop-2.6.5/logs/yarn-root-nodemanager-hadoop001.out
hadoop002: nodemanager running as process 3169. Stop it first.
hadoop003: starting nodemanager, logging to /opt/module/hadoop-2.6.5/logs/yarn-root-nodemanager-hadoop003.out
[root@hadoop002 sbin]# jps
3108 ResourceManager
2907 DataNode
3387 Jps
- 归档成一个叫做xxx.har的文件夹,该文件夹下有相应的数据文件。Xx.har目录是一个整体,该目录看成是一个归档文件即可。
root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R /user/data/input
drwxr-xr-x - root supergroup 0 2018-08-15 19:50 /user/data/input/input
-rw-r--r-- 3 root supergroup 6397 2018-08-15 19:50 /user/data/input/input/combiner.txt
-rw-r--r-- 3 root supergroup 104 2018-08-15 19:50 /user/data/input/input/filter.txt
-rw-r--r-- 3 root supergroup 39654 2018-08-15 19:50 /user/data/input/input/log.txt
-rw-r--r-- 3 root supergroup 72 2018-08-15 19:50 /user/data/input/input/oneindex1.txt
-rw-r--r-- 3 root supergroup 72 2018-08-15 19:50 /user/data/input/input/oneindex2.txt
-rw-r--r-- 3 root supergroup 64 2018-08-15 19:50 /user/data/input/input/order.txt
-rw-r--r-- 3 root supergroup 116 2018-08-15 19:50 /user/data/input/input/part-r-00000
-rw-r--r-- 3 root supergroup 31 2018-08-15 19:50 /user/data/input/input/pd.txt
-rw-r--r-- 3 root supergroup 34 2018-08-15 19:50 /user/data/input/input/pf.txt
-rw-r--r-- 3 root supergroup 1429 2018-08-15 19:50 /user/data/input/input/phone_data.txt
-rw-r--r-- 3 root supergroup 53 2018-08-15 19:50 /user/data/input/input/xiaoxiao.txt
[root@hadoop001 hadoop-2.6.5]# hadoop archive -archiveName input.har -p /user/data/input /user/my
# 由此可见,归档操作运行了mapreduce
18/08/15 19:51:41 INFO client.RMProxy: Connecting to ResourceManager at hadoop002/192.168.170.132:8032
18/08/15 19:51:42 INFO client.RMProxy: Connecting to ResourceManager at hadoop002/192.168.170.132:8032
18/08/15 19:51:42 INFO client.RMProxy: Connecting to ResourceManager at hadoop002/192.168.170.132:8032
18/08/15 19:51:43 INFO mapreduce.JobSubmitter: number of splits:1
18/08/15 19:51:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1534387622052_0001
18/08/15 19:51:44 INFO impl.YarnClientImpl: Submitted application application_1534387622052_0001
18/08/15 19:51:44 INFO mapreduce.Job: The url to track the job: http://hadoop002:8088/proxy/application_1534387622052_0001/
18/08/15 19:51:44 INFO mapreduce.Job: Running job: job_1534387622052_0001
18/08/15 19:51:55 INFO mapreduce.Job: Job job_1534387622052_0001 running in uber mode : false
18/08/15 19:51:55 INFO mapreduce.Job: map 0% reduce 0%
18/08/15 19:52:06 INFO mapreduce.Job: map 100% reduce 0%
18/08/15 19:52:15 INFO mapreduce.Job: map 100% reduce 100%
18/08/15 19:52:16 INFO mapreduce.Job: Job job_1534387622052_0001 completed successfully
18/08/15 19:52:16 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=1181
FILE: Number of bytes written=219789
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=49101
HDFS: Number of bytes written=49145
HDFS: Number of read operations=36
HDFS: Number of large read operations=0
HDFS: Number of write operations=7
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=7977
Total time spent by all reduces in occupied slots (ms)=6514
Total time spent by all map tasks (ms)=7977
Total time spent by all reduce tasks (ms)=6514
Total vcore-milliseconds taken by all map tasks=7977
Total vcore-milliseconds taken by all reduce tasks=6514
Total megabyte-milliseconds taken by all map tasks=8168448
Total megabyte-milliseconds taken by all reduce tasks=6670336
Map-Reduce Framework
Map input records=13
Map output records=13
Map output bytes=1148
Map output materialized bytes=1181
Input split bytes=116
Combine input records=0
Combine output records=0
Reduce input groups=13
Reduce shuffle bytes=1181
Reduce input records=13
Reduce output records=0
Spilled Records=26
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=177
CPU time spent (ms)=2250
Physical memory (bytes) snapshot=396795904
Virtual memory (bytes) snapshot=4203311104
Total committed heap usage (bytes)=271581184
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=959
File Output Format Counters
Bytes Written=0
[root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R /user/my
# 归档后的目录结构
drwxr-xr-x - root supergroup 0 2018-08-15 19:52 /user/my/input.har
-rw-r--r-- 3 root supergroup 0 2018-08-15 19:52 /user/my/input.har/_SUCCESS
-rw-r--r-- 5 root supergroup 1095 2018-08-15 19:52 /user/my/input.har/_index
-rw-r--r-- 5 root supergroup 24 2018-08-15 19:52 /user/my/input.har/_masterindex
-rw-r--r-- 3 root supergroup 48026 2018-08-15 19:52 /user/my/input.har/part-0
- 查看归档
[root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R har:///user/my/input.har
drwxr-xr-x - root supergroup 0 2018-08-15 19:50 har:///user/my/input.har/input
-rw-r--r-- 3 root supergroup 6397 2018-08-15 19:50 har:///user/my/input.har/input/combiner.txt
-rw-r--r-- 3 root supergroup 104 2018-08-15 19:50 har:///user/my/input.har/input/filter.txt
-rw-r--r-- 3 root supergroup 39654 2018-08-15 19:50 har:///user/my/input.har/input/log.txt
-rw-r--r-- 3 root supergroup 72 2018-08-15 19:50 har:///user/my/input.har/input/oneindex1.txt
-rw-r--r-- 3 root supergroup 72 2018-08-15 19:50 har:///user/my/input.har/input/oneindex2.txt
-rw-r--r-- 3 root supergroup 64 2018-08-15 19:50 har:///user/my/input.har/input/order.txt
-rw-r--r-- 3 root supergroup 116 2018-08-15 19:50 har:///user/my/input.har/input/part-r-00000
-rw-r--r-- 3 root supergroup 31 2018-08-15 19:50 har:///user/my/input.har/input/pd.txt
-rw-r--r-- 3 root supergroup 34 2018-08-15 19:50 har:///user/my/input.har/input/pf.txt
-rw-r--r-- 3 root supergroup 1429 2018-08-15 19:50 har:///user/my/input.har/input/phone_data.txt
-rw-r--r-- 3 root supergroup 53 2018-08-15 19:50 har:///user/my/input.har/input/xiaoxiao.txt
- 解压归档
[root@hadoop001 hadoop-2.6.5]# hadoop fs -cp har:///user/my/input.har /user/data/har/[root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R /user/data/har
drwxr-xr-x - root supergroup 0 2018-08-15 20:01 /user/data/har/input.har
drwxr-xr-x - root supergroup 0 2018-08-15 20:01 /user/data/har/input.har/input
-rw-r--r-- 3 root supergroup 6397 2018-08-15 20:01 /user/data/har/input.har/input/combiner.txt
-rw-r--r-- 3 root supergroup 104 2018-08-15 20:01 /user/data/har/input.har/input/filter.txt
-rw-r--r-- 3 root supergroup 39654 2018-08-15 20:01 /user/data/har/input.har/input/log.txt
-rw-r--r-- 3 root supergroup 72 2018-08-15 20:01 /user/data/har/input.har/input/oneindex1.txt
-rw-r--r-- 3 root supergroup 72 2018-08-15 20:01 /user/data/har/input.har/input/oneindex2.txt
-rw-r--r-- 3 root supergroup 64 2018-08-15 20:01 /user/data/har/input.har/input/order.txt
-rw-r--r-- 3 root supergroup 116 2018-08-15 20:01 /user/data/har/input.har/input/part-r-00000
-rw-r--r-- 3 root supergroup 31 2018-08-15 20:01 /user/data/har/input.har/input/pd.txt
-rw-r--r-- 3 root supergroup 34 2018-08-15 20:01 /user/data/har/input.har/input/pf.txt
-rw-r--r-- 3 root supergroup 1429 2018-08-15 20:01 /user/data/har/input.har/input/phone_data.txt
-rw-r--r-- 3 root supergroup 53 2018-08-15 20:01 /user/data/har/input.har/input/xiaoxiao.txt
三、快照管理
快照相当于对目录做一个备份。并不会立即复制所有文件,而是指向同一个文件。当写入发生时,才会产生新文件。
实操:
hdfs原路径中的文件:
[root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R /user/data/input
-rw-r--r-- 3 root supergroup 39654 2018-08-15 20:13 /user/data/input/log.txt
- 开启快照功能:
[root@hadoop001 hadoop-2.6.5]# hdfs dfsadmin -allowSnapshot /user/data/input
Allowing snaphot on /user/data/input succeeded
- 对目录创建快照
[root@hadoop001 hadoop-2.6.5]# hdfs dfs -createSnapshot /user/data/input
Created snapshot /user/data/input/.snapshot/s20180815-200627.868
- 查看快照
[root@hadoop001 hadoop-2.6.5]# hdfs dfs -ls -R /user/data/input/.snapshot
drwxr-xr-x - root supergroup 0 2018-08-15 20:15 /user/data/input/.snapshot/log
-rw-r--r-- 3 root supergroup 39654 2018-08-15 20:13 /user/data/input/.snapshot/log/log.txt
- 重命名快照
[root@hadoop001 hadoop-2.6.5]# hdfs dfs -renameSnapshot /user/data/input log mylog
[root@hadoop001 hadoop-2.6.5]# hdfs dfs -ls -R /user/data/input/.snapshot/mylog
-rw-r--r-- 3 root supergroup 39654 2018-08-15 20:13 /user/data/input/.snapshot/mylog/log.txt
- 列出当前用户所有可快照目录
[root@hadoop001 hadoop-2.6.5]# hdfs lsSnapshottableDir
drwxr-xr-x 0 root supergroup 0 2018-08-15 20:15 1 65536 /user/data/input
- 现在传入一个新文件,对比原来的快照:
[root@hadoop001 hadoop-2.6.5]# hadoop fs -input input/order.txt /user/data/input
-input: Unknown command
[root@hadoop001 hadoop-2.6.5]# hadoop fs -put input/order.txt /user/data/input
[root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R /user/data/input
-rw-r--r-- 3 root supergroup 39654 2018-08-15 20:13 /user/data/input/log.txt
-rw-r--r-- 3 root supergroup 64 2018-08-15 20:23 /user/data/input/order.txt
[root@hadoop001 hadoop-2.6.5]# hdfs snapshotDiff /user/data/input . .snapshot/mylogDifference between current directory and snapshot mylog under directory /user/data/input:
M .
- ./order.txt
# 这里的-号代表相对于快照来说,少了一个order.txt文件
- 恢复快照
[root@hadoop001 hadoop-2.6.5]# hadoop fs -mkdir -p /user/data/input2
[root@hadoop001 hadoop-2.6.5]# hdfs dfs -cp /user/data/input/.snapshot/mylog /user/data/input2
[root@hadoop001 hadoop-2.6.5]# hadoop fs -ls -R /user/data/input2drwxr-xr-x - root supergroup 0 2018-08-15 20:26 /user/data/input2/mylog
-rw-r--r-- 3 root supergroup 39654 2018-08-15 20:26 /user/data/input2/mylog/log.txt
- 删除快照
[root@hadoop001 sbin]# hdfs dfs -deleteSnapshot /user/mylog mylog
四、回收站
默认回收站
如果开启了回收站功能,当文件删除以后会放入回收站,并且保存放入的时间,当fs.trash.checkpoint.interval设置的时间到了就扫描一次回收站,将到达fs.trash.interval设置的值的文件删除,循环往复。
- 默认值fs.trash.interval=0,0表示禁用回收站,可以设置删除文件的存活时间。
- 默认值fs.trash.checkpoint.interval=0,检查回收站的间隔时间。
- 要求fs.trash.checkpoint.interval<=fs.trash.interval。
修改参数
修改core-site.xml,配置垃圾回收时间为1分钟,扫描时间1分钟:
<property>
<name>fs.trash.interval</name>
<value>1</value>
<description>Number of minutes after which the checkpoint
gets deleted. If zero, the trash feature is disabled.
This option may be configured both on the server and the
client. If trash is disabled server side then the client
side configuration is checked. If trash is enabled on the
server side then the value configured on the server is
used and the client configuration value is ignored.
</description>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>1</value>
<description>Number of minutes between trash checkpoints.
Should be smaller or equal to fs.trash.interval. If zero,
the value is set to the value of fs.trash.interval.
Every time the checkpointer runs it creates a new checkpoint
out of current and removes checkpoints created more than
fs.trash.interval minutes ago.
</description>
</property>
- 重启集群后,删除文件:
[root@hadoop001 sbin]# hadoop fs -rmr /user/data/input/log.txt
rmr: DEPRECATED: Please use 'rm -r' instead.
18/08/15 20:48:43 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 2 minutes, Emptier interval = 1 minutes.
Moved: 'hdfs://hadoop001:8020/user/data/input/log.txt' to trash at: hdfs://hadoop001:8020/user/root/.Trash/Current
# 被移动到垃圾桶了
- 进入web页面查看回收站文件,发现不能访问,用户不对,修改用户:
<property>
<description>
The user name to filter as, on static web filters
while rendering content. An example use is the HDFS
web UI (user to be used for browsing files).
</description>
<name>hadoop.http.staticuser.user</name>
<value>root</value>
</property>
重启集群,删除文件,查看回收站:
两分钟后,文件清空:
通过程序删除的文件不会经过回收站,需要调用moveToTrash()才进入回收站
Trash trash = New Trash(conf);
trash.moveToTrash(path);恢复回收站数据
[root@hadoop001 sbin]# hadoop fs -mv /user/root/.Trash/Current/user/data/input/log.txt /user/data/input
- 清空回收站
[root@hadoop001 sbin]# hdfs dfs -expunge
18/08/15 20:59:10 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 2 minutes, Emptier interval = 1 minutes.