最近我们发现多个用户设备掉电后重启,系统不工作。
研究这些返修设备,发现这些设备的表象是网络连接失败,DNS resolve不了。进一步发现/etc/resolv.conf为空,所以应用程序没法进行DNS resolve。但是在同一个路由器下面,其他设备是可以正常获取DNS 服务器信息的,后来检查dhcpc的log,发现他获取了DNS信息,但是写/etc/resolv.conf失败,这样应用程序读取dns server时就会失败。
为啥写文件失败呢?后来我们用df检查根文件系统,发现文件系统已经被占满了。下面显示整个16M的根目录空间都被占了
/ # dfFilesystem 1K-blocks Used Available Use%Mounted on
overlay15863 15863 0 100% /none492 0 492 0% /dev
run454124 56 454068 0% /run
shm454124 0 454124 0% /dev/shm
ubi2:exa_data_780736 76516 699384 10% /config
ubi2:exa_data_780736 76516 699384 10% /log
ubi2:exa_data_780736 76516 699384 10% /tokens
tmpfs512 0 512 0% /dev/snd
tmpfs512 0 512 0% /dev/input/event0
tmpfs512 0 512 0% /dev/hbi
tmpfs454124 1652 452472 0% /tmp/ubus.sock
tmpfs454124 1652 452472 0% /run/dbus/system_bus_socket
cgroup_root10240 0 10240 0% /sys/fs/cgroup/ # lsof |grepdeleted318 /sbin/rc /run/openrc/exclusive/bootmisc (deleted)318 /sbin/rc /run/openrc/exclusive/networking (deleted)318 /sbin/rc /run/openrc/exclusive/syslog (deleted)318 /sbin/rc /run/openrc/exclusive/avs-server (deleted)/ # exit
但是我们到文件系统mount到的目录用du查看时,却发现实际的文件并没有占那么多。下面显示upperdir只占用了7M
[router] /overlay # du -d 1
2 ./workdir
7043 ./upperdir
12 ./lost+found
7058 .
那么多余的空间都跑哪去了呢?接着我们来检查一下文件系统对应的img。发现这个img居然被破坏了,这样就导致我们用df看到的信息不正确。
~$ fsck.ext4 rootfs_overlay.img
e2fsck 1.44.1 (24-Mar-2018)
rootfs_overlay.img contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 2378, i_blocks is 2, should be 0. Fix? no
Deleted inode 2381 has zero dtime. Fix? no
Deleted inode 2386 has zero dtime. Fix? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: +2612 -(8525--8526) -(8708--8720) -8725 -8731 -(8769--8772) -9235 -(9239--9248) -10319 -(10380--10396) -(10781--10785) -10923 -(12353--12356) -(12649--12656) -(13513--13764)
Fix? no
Free blocks count wrong for group #1 (7063, counted=7062).
Fix? no
Free blocks count wrong (8361, counted=8356).
Fix? no
Inode bitmap differences: -2381 -2386
Fix? no
rootfs_overlay.img: ********** WARNING: Filesystem still has errors **********
rootfs_overlay.img: 351/4096 files (2.3% non-contiguous), 8023/16384 blocks
之后我们用fsck将这个image修复。然后重新mount文件系统,系统这时就工作正常了。
~$ fsck.ext4 rootfs_overlay.img
e2fsck 1.44.1 (24-Mar-2018)
rootfs_overlay.img contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 2378, i_blocks is 2, should be 0. Fix? yes
Deleted inode 2381 has zero dtime. Fix? yes
Deleted inode 2386 has zero dtime. Fix? yes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: +2612 -(8525--8526) -(8708--8720) -8725 -8731 -(8769--8772) -9235 -(9239--9248) -10319 -(10380--10396) -(10781--10785) -10923 -(12353--12356) -(12649--12656) -(13513--13764)
Fix? yes
Free blocks count wrong for group #0 (1294, counted=1293).
Fix? yes
Free blocks count wrong for group #1 (7063, counted=7382).
Fix? yes
Free blocks count wrong (8361, counted=8675).
Fix? yes
Inode bitmap differences: -2381 -2386
Fix? yes
Free inodes count wrong for group #1 (1711, counted=1713).
Fix? yes
Free inodes count wrong (3745, counted=3747).
Fix ('a' enables 'yes' to all) ? yes to all
rootfs_overlay.img: ***** FILE SYSTEM WAS MODIFIED *****
rootfs_overlay.img: 349/4096 files (2.3% non-contiguous), 7709/16384 blocks
~$ fsck.ext4 rootfs_overlay.img
e2fsck 1.44.1 (24-Mar-2018)
rootfs_overlay.img: clean, 349/4096 files, 7709/16384 blocks
~$
文件系统被破坏,这个对于嵌入式系统来说,是一个很大的风险,设备极有可能变砖,导致返修。好在ext4是一个日志类型的文件系统,我们可以根据日志对文件系统进行恢复。所以设备启动,mount文件系统之前,一定要用fsck进行检查,一旦发现错误,必须立刻修复。