首先介绍下操作系统,ubuntu 14.04.1。
之前因为服务器故障然后shutdown一次,然后启动完之后发现数据库不能正常启动了,然后引申出一系列的问题。这里做个归纳总结。
首先,数据库是postgresql(以下简称post),每次启动的时候都会先去执行目录查找postmaster.pid文件,这个文件是post启动后生成的临时文件,当post关闭后会自动删除。所以当每次post启动时都会首先检测这个文件是否存在,如果存在的话就认为服务器已经启动,然后就会启动失败。只要删除这个文件就可以了。因为我的服务器是硬启动所以造成了这个文件还残留的问题。
由此问题引申出来的问题,当我要删除这个文件的时候出现了
rm: cannot remove ‘postmaster.pid’: Read-only file system
开始以为是文件权限的问题,后来切换成root用户还是没有用,而且这里的报错不是权限问题。
首先使用命令df查看文件系统的具体情况:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 37G 3.2G 32G 10% /
none 4.0K 0 4.0K 0% /sys/fs/cgroup
udev 3.9G 4.0K 3.9G 1% /dev
tmpfs 799M 524K 798M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 3.9G 0 3.9G 0% /run/shm
none 100M 0 100M 0% /run/user
/dev/sda5 55G 625M 52G 2% /home
/dev/sda6 98G 919M 92G 1% /var
我的服务器上postmaster.pid在/var目录下,所以使用的文件系统是/dev/sda6,然后使用cat /proc/mounts和cat /etc/mtab命令各自查看下,这里可以看出/etc/mtab 和/proc/mounts的区别。前者是mount时(通常是开机boot)的状态,而/proc/mounts是核心的动态状态,总是指“此刻” 。
root@GameMobileGS:/var# cat /etc/mtab
/dev/sda1 / ext4 rw,errors=remount-ro 0 0
proc /proc proc rw,noexec,nosuid,nodev 0 0
sysfs /sys sysfs rw,noexec,nosuid,nodev 0 0
none /sys/fs/cgroup tmpfs rw 0 0
none /sys/fs/fuse/connections fusectl rw 0 0
none /sys/kernel/debug debugfs rw 0 0
none /sys/kernel/security securityfs rw 0 0
udev /dev devtmpfs rw,mode=0755 0 0
devpts /dev/pts devpts rw,noexec,nosuid,gid=5,mode=0620 0 0
tmpfs /run tmpfs rw,noexec,nosuid,size=10%,mode=0755 0 0
none /run/lock tmpfs rw,noexec,nosuid,nodev,size=5242880 0 0
none /run/shm tmpfs rw,nosuid,nodev 0 0
none /run/user tmpfs rw,noexec,nosuid,nodev,size=104857600,mode=0755 0 0
none /sys/fs/pstore pstore rw 0 0
/dev/sda5 /home ext4 rw 0 0
/dev/sda6 /var ext4 rw 0 0
systemd /sys/fs/cgroup/systemd cgroup rw,noexec,nosuid,nodev,none,name=systemd 0 0
root@GameMobileGS:/var# cat /proc/mounts
rootfs / rootfs rw 0 0
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,relatime,size=4077276k,nr_inodes=1019319,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,noexec,relatime,size=817628k,mode=755 0 0
/dev/disk/by-uuid/f20497d6-7654-4ab5-a68b-c63a09cb5f1c / ext4 rw,relatime,errors=remount-ro,data=ordered 0 0
none /sys/fs/cgroup tmpfs rw,relatime,size=4k,mode=755 0 0
none /sys/fs/fuse/connections fusectl rw,relatime 0 0
none /sys/kernel/debug debugfs rw,relatime 0 0
none /sys/kernel/security securityfs rw,relatime 0 0
none /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
none /run/shm tmpfs rw,nosuid,nodev,relatime 0 0
none /run/user tmpfs rw,nosuid,nodev,noexec,relatime,size=102400k,mode=755 0 0
none /sys/fs/pstore pstore rw,relatime 0 0
/dev/sda5 /home ext4 rw,relatime,data=ordered 0 0
/dev/sda6 /var ext4 ro,relatime,data=ordered 0 0
systemd /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,name=systemd 0 0
这里可以看出sda6系统在启动之后由rw权限变成了ro权限,所以造成了文件不能删除的问题。
也就是说很有可能磁盘文件被破坏了或者磁盘有损坏。然后使用命令dmesg查看具体情况:
...
[894160.786037] 00
[894160.786038] end_request: I/O error, dev sda, sector 236571328
[894160.786039] end_request: I/O error, dev sda, sector 311863296
[894160.786042] EXT4-fs warning (device sda6): ext4_end_bio:317: I/O error -5 writing to inode 1966160 (offset 0 size 4096 starting block 29571417)
[894160.786042] Buffer I/O error on device sda6, logical block 12615680
[894160.786043] lost page write due to I/O error on sda6
[894160.786044] Buffer I/O error on device sda6, logical block 3204184
[894160.786064] JBD2: Error -5 detected when updating journal superblock for sda6-8.
[894160.786069] JBD2: Spotted dirty metadata buffer (dev = sda6, blocknr = 0). There's a risk of filesystem corruption in case of system crash.
[894160.790274] sd 2:0:0:0: [sda] Unhandled error code
[894160.790276] sd 2:0:0:0: [sda]
[894160.790277] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[894160.790278] sd 2:0:0:0: [sda] CDB:
[894160.790278] Write(10): 2a 00 0c 92 a8 00 00 00 08 00
[894160.790283] end_request: I/O error, dev sda, sector 210937856
[894160.790781] Buffer I/O error on device sda6, logical block 0
[894160.791322] lost page write due to I/O error on sda6
[894160.791373] EXT4-fs (sda6): I/O error while writing superblock
[894160.791390] EXT4-fs (sda6): I/O error while writing superblock
[894160.791392] EXT4-fs error (device sda6): ext4_journal_check_start:56: Detected aborted journal
[894160.791393] EXT4-fs (sda6): Remounting filesystem read-only
[894160.793417] EXT4-fs error (device sda6): ext4_journal_check_start:56: Detected aborted journal
[894341.061979] sd 2:0:0:0: timing out command, waited 180s
[894341.062604] sd 2:0:0:0: [sda] Unhandled error code
[894341.062606] sd 2:0:0:0: [sda]
[894341.062611] Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[894341.062613] sd 2:0:0:0: [sda] CDB:
[894341.062614] Write(10): 2a 00 0c 92 a8 00 00 00 08 00
[894341.062619] end_request: I/O error, dev sda, sector 210937856
[894341.063143] Buffer I/O error on device sda6, logical block 0
[894341.063658] lost page write due to I/O error on sda6
[894341.063681] EXT4-fs (sda6): ext4_writepages: jbd2_start: 13312 pages, ino 1966118; err -30
这里已经明确显示sda6不可写了。使用命令fsck -y进行磁盘修复,重启,ok。
ps:在修复之前要把一些重要的文件先进行备份,如果文件打包失败的话(因为如果在/var目录的话是不能tar的,tar命令需要可写权限),可以使用scp命令。