随着数据量越来越大,MFS的使用中也出现过一些问题,这里做了一些分析和总结,下面和大家分享一下:
先提一下MFS出问题时出现比较频繁的两个信息:
- 连接中断
- 坏块问题
连接中断问题在Master端会出现如下错误:
- mfsmaster[15861]: connection with client(ip:10.11.18.175) has been closed by peer
- 表示客户端和master的连接中断
- mfsmaster[15861]: connection with ML(10.11.19.76) has been closed by peer
- 表示Metalogger和Master的连接中断
- mfsmaster[15861]: connection with CS(10.11.18.199) has been closed by peer
- 表示ChunkServer和Master的连接中断
原因分析可能如下:
- 网络闪断 - 正常现象,MFS本身可自动重连,不会造成问题
- Clinet或ChunkServer主动断开连接,如Kill进程,也会引起这种错误
- ChunkServer或Client到Master的连接超时,也会断开连接,引起超时可能有两个原因:
- Client请求过多,引起Master请求队列已满,导致的连接超时
- 网络响应慢引起的超时(和网络闪断区分)
解决办法:
- 对于1、3出现引起的中断可不加理会,重点需关注2引起的问题:
- 针对2-a:Client控制请求,如超高并发的读写删除,另需注意的操作是ls,大家知道Linux系统本身对一个目录下文件个数的显示是有限制的(如10W,那么涉及到的需遍历指令就会报错,list too long),同样,我们MFS中遍历目录下文件时也要注意,要遍历的文件数过多会导致超时引起连接被中断等问题。
- 针对2-b: 合理分配带宽资源,优化网络环境解决。
备注:
Client或Chunk到Master的连接中断之后,会由Client或Chunk自动发出重连(Reconnection)和注册(Register)操作。
坏块问题在Master端会出现如下错误:
- mfsmaster[3250]: chunkserver has nonexistent chunk (000000000002139F_00000001), so create it for future deletion
- mfsmaster[3250]: (10.11.18.199:9422) chunk: 000000000002139F creation status: 20
- mfsmaster[3250]: chunk 000000000002139F has only invalid copies (1) – please repair it manually
- mfsmaster[3250]: chunk 000000000002139F_00000001 – invalid copy on (10.11.18.199 – ver:00000000)
- mfsmaster[3250]: currently unavailable chunk 000000000002139F (inode: 135845 ; index: 23)
上述日志的意思是:有一个块在Master中有元数据信息,但ChunkServer中没有这个块,系统会自动在ChunkServer上创建此块为了后续删除,因为没有内容,所以是非法的copy,我们也无法访问到此块。
出现的原因可能有很多,如:
- Client端大文件传输过程中,强制拔下master主机电源,造成master非法关闭,使用mfsmetarestore -a修复后,master日志报告有坏块
- ChunkServer的csstats.mfs存放位置空间不足,导致文件块无法写入,也会引起块错误
- 手动删除ChunkServer上的块文件
- 删除文件后,Master非正常结束后重启,但没有结果changelog.mfs进行恢复,也会引起坏块
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 24' |wc -l
181
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 23' |wc -l
235
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 19' |wc -l
2
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 18' |wc -l
4
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 10' |wc -l
13
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 22' |wc -l
375
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.42'|grep 'Jan 22' |wc -l
394
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.42'|grep 'Jan 21' |wc -l
120
181
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 23' |wc -l
235
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 19' |wc -l
2
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 18' |wc -l
4
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 10' |wc -l
13
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.45'|grep 'Jan 22' |wc -l
375
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.42'|grep 'Jan 22' |wc -l
394
[root@c1 log]# cat messages|grep 'chunkserver disconnected - ip: 192.168.2.42'|grep 'Jan 21' |wc -l
120
每天会发生这么多次重新连接? 原因?网络的不稳定?IO的变大,用网线直连看看!
Jan 24 11:12:14 c1 mfsmaster[28849]: connection with CS(192.168.2.45) has been closed by peer
Jan 24 11:12:14 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.45, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:16 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:16 c1 mfsmaster[28849]: connection with CS(192.168.2.42) has been closed by peer
Jan 24 11:12:16 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.42, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:18 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.44, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:21 c1 mfsmaster[28849]: chunkserver register begin (packet version: 5) - ip: 192.168.2.44, port: 9422
Jan 24 11:12:21 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:21 c1 mfsmaster[28849]: connection with CS(192.168.2.42) has been closed by peer
Jan 24 11:12:21 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.42, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:23 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:23 c1 mfsmaster[28849]: connection with CS(192.168.2.45) has been closed by peer
Jan 24 11:12:28 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.45, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:30 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:30 c1 mfsmaster[28849]: connection with CS(192.168.2.42) has been closed by peer
Jan 24 11:12:14 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.45, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:16 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:16 c1 mfsmaster[28849]: connection with CS(192.168.2.42) has been closed by peer
Jan 24 11:12:16 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.42, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:18 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.44, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:21 c1 mfsmaster[28849]: chunkserver register begin (packet version: 5) - ip: 192.168.2.44, port: 9422
Jan 24 11:12:21 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:21 c1 mfsmaster[28849]: connection with CS(192.168.2.42) has been closed by peer
Jan 24 11:12:21 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.42, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:23 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:23 c1 mfsmaster[28849]: connection with CS(192.168.2.45) has been closed by peer
Jan 24 11:12:28 c1 mfsmaster[28849]: chunkserver disconnected - ip: 192.168.2.45, port: 9422, usedspace: 0 (0.00 GiB), totalspace: 0 (0.00 GiB)
Jan 24 11:12:30 c1 mfsmaster[28849]: chunk-server already connected !!!
Jan 24 11:12:30 c1 mfsmaster[28849]: connection with CS(192.168.2.42) has been closed by peer