网上可以看到不少有关oracle运行时误删数据文件的讨论,对于使用rm操作系统命令删除的文件,可以通过copy文件描述符来恢复文件,当然这种方式的前提还是数据库还在运行。

    但在DB还在运行时直接copy文件描述符来恢复文件,这样操作其实有几个问题:

  1. copy过程中需要额外的存储空间

  2. 在数据库checkpoint发起之前,对误删数据文件的所有copy操作必须完成,否则来不及copy的数据文件将会因为实例终止而完全丢失,还在copy中的数据文件无效

  3. 如果误删的文件包括online redo或undo,这种方式很可能无效

  4. 恢复数据文件的操作必须将对应的表空间下线或关闭数据库,这势必影响线上业务



    深入思考一下,数据库进程读写数据文件,是通过文件描述符(以下简称FD)来操作硬盘上对应的数据文件的。既然这些误删的数据文件当前还可以在内存中读写,通过FD操作文件,而且这些误删的数据文件本身占用的空间也并没有释放出来,那是否可以通过某种类似undelete的方式将FD的内容保留会原先的硬盘上呢?

    其实github上有多个项目就是实现这个的,包括fdlink、vfs-undelete等。但这些方案也有问题:

  1. 需要编译内核模块来实现。但是这一点,这个要求就比较高,毕竟服务器上一般是不会安装内核源代码的,而且即使要下载也需要时间。

  2. 需要root权限

  3. 实际效果难说。我实测下来,centos5.6的系统,默认内核2.6.18-238.el5, 删除单个数据文件,用fdlink和vfs-undelete都不能成功恢复。

    所以这个方法基本不可行。


    那是否还有其他方式呢?其实再深入思考下,既然FD还能完全操作这些误删的数据文件,完全可以把FD直接当做数据文件来处理。为了保持原来的数据文件路径不变,我们可以建立一个soft link,把数据文件链接回FD,这样所有的oracle进程也就可以找到数据文件,这样误删操作也不会影响到正常业务进行。

    但这样做的问题又来了,毕竟FD不是硬盘上的文件,DB或系统重启、断电都会使FD失效,误删的数据文件还是会可能丢失,而直接copy FD生成的数据文件又有本文开头说的那些问题,怎么操作才能兼顾呢?

    解决这个问题其实也并不麻烦。只要了解FD的特性,只要还有进程持有该FD,对应的数据文件内容就就可以读写。我们可以使用tail命令来始终占用这个FD,那么即使DB关闭或者数据文件offline了,这个FD对应的数据文件内容还是可以读取的,而因为DB已经关闭或数据文件已经offline了,自然这些数据文件也就没有写操作了,此时copy出来的数据文件才是干净安全的。


    下面是对这个过程一个演示(test表在表空间users01.dbf上,重要操作加亮显示):


$ ls
bbed.par    control01.ctl  filelist.txt  redo01.log  redo03.log    system01.dbf  undotbs01.dbf
bifile.bbd  example01.dbf  log.bbd       redo02.log  sysaux01.dbf  temp01.dbf    users01.dbf
$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Tue Nov 4 10:48:09 2014

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 1870647296 bytes
Fixed Size                  2229424 bytes
Variable Size             452987728 bytes
Database Buffers         1409286144 bytes
Redo Buffers                6144000 bytes
Database mounted.
Database opened.
SQL> select count(*) from test;

  COUNT(*)
----------
    618237

$ ll
total 2218432
-rw-r--r-- 1 oracle oinstall        74 Oct 30 12:04 bbed.par
-rw-r--r-- 1 oracle oinstall     18432 Oct 30 13:48 bifile.bbd
-rw-r----- 1 oracle oinstall   9814016 Nov  4 10:48 control01.ctl
-rw-r----- 1 oracle oinstall 362422272 Nov  4 10:48 example01.dbf
-rw-r--r-- 1 oracle oinstall       259 Oct 30 12:03 filelist.txt
-rw-r--r-- 1 oracle oinstall      1725 Oct 30 13:48 log.bbd
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo01.log
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo02.log
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo03.log
-rw-r----- 1 oracle oinstall 639639552 Nov  4 10:48 sysaux01.dbf
-rw-r----- 1 oracle oinstall 943726592 Nov  4 10:48 system01.dbf
-rw-r----- 1 oracle oinstall  20979712 Nov  4 10:48 temp01.dbf
-rw-r----- 1 oracle oinstall  57679872 Nov  4 10:48 undotbs01.dbf
-rw-r----- 1 oracle oinstall  83894272 Nov  4 10:48 users01.dbf
$ rm -f users01.dbf 

$ ps -ef|grep ora_dbw
oracle     628     1  0 10:48 ?        00:00:00 ora_dbw0_KV
oracle     773  5117  0 10:48 pts/1    00:00:00 grep ora_dbw$ ll /proc/628/fd   
total 0
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 0 -> /dev/null
l-wx------ 1 oracle oinstall 64 Nov  4 10:48 1 -> /dev/null
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 10 -> /dev/zero
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 11 -> /dev/zero
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 12 -> /u01/app/oracle/product/11.2.0/dbhome_1/dbs/hc_KV.dat
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 13 -> /u01/app/oracle/product/11.2.0/dbhome_1/rdbms/mesg/oraus.msb
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 14 -> /proc/628/fd
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 15 -> /dev/zero
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 16 -> /u01/app/oracle/product/11.2.0/dbhome_1/dbs/hc_KV.dat
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 17 -> /u01/app/oracle/product/11.2.0/dbhome_1/dbs/lkKV
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 18 -> /u01/app/oracle/product/11.2.0/dbhome_1/rdbms/mesg/oraus.msb
l-wx------ 1 oracle oinstall 64 Nov  4 10:48 2 -> /dev/null
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 256 -> /u01/app/oracle/oradata/KV/control01.ctl
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 257 -> /u01/app/oracle/fast_recovery_area/KV/control02.ctl
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 258 -> /u01/app/oracle/oradata/KV/system01.dbf
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 259 -> /u01/app/oracle/oradata/KV/sysaux01.dbf
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 260 -> /u01/app/oracle/oradata/KV/undotbs01.dbf
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 261 -> /u01/app/oracle/oradata/KV/users01.dbf (deleted)
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 262 -> /u01/app/oracle/oradata/KV/example01.dbf
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 263 -> /u01/app/oracle/oradata/KV/temp01.dbf
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 3 -> /dev/null
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 4 -> /dev/null
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 5 -> /dev/null
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 6 -> /dev/null
lrwx------ 1 oracle oinstall 64 Nov  4 10:48 7 -> /u01/app/oracle/product/11.2.0/dbhome_1/dbs/hc_KV.dat
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 8 -> /dev/null
lr-x------ 1 oracle oinstall 64 Nov  4 10:48 9 -> /dev/null
$ tail -n +0 -f /proc/628/fd/261 >/dev/null  &
[1] 785  (备注:此处785是tail进程执行后的pid号,后面需要该pid号)
$ !sqlplus

sqlplus / as sysdba  

SQL*Plus: Release 11.2.0.3.0 Production on Tue Nov 4 10:49:36 2014

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> alter system checkpoint;
alter system checkpoint
*
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 792
Session ID: 142 Serial number: 5


SQL> Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
                                                                   
$ tac /u01/app/oracle/diag/rdbms/kv/KV/trace/alert_KV.log |less
Instance terminated by CKPT, pid = 636
Dumping diagnostic data in directory=[cdmp_20141104104959], requested by (instance=1, osid=636 (CKPT)), summary=[abnormal instance termination].
CKPT (ospid: 636): terminating the instance due to error 63999
System State dumped to trace file /u01/app/oracle/diag/rdbms/kv/KV/trace/KV_diag_611.trc
System state dump requested by (instance=1, osid=636 (CKPT)), summary=[abnormal instance termination].
Tue Nov 04 10:49:59 2014
Additional information: 3
Linux-x86_64 Error: 2: No such file or directory
ORA-27041: unable to open file
ORA-01110: data file 4: '/u01/app/oracle/oradata/KV/users01.dbf'
ORA-01116: error in opening database file 4
ORA-63999: data file suffered media failure
Errors in file /u01/app/oracle/diag/rdbms/kv/KV/trace/KV_ckpt_636.trc:
Additional information: 3
Linux-x86_64 Error: 2: No such file or directory
ORA-27041: unable to open file
ORA-01110: data file 4: '/u01/app/oracle/oradata/KV/users01.dbf'
ORA-01116: error in opening database file 4
ORA-63999: data file suffered media failure
Errors in file /u01/app/oracle/diag/rdbms/kv/KV/trace/KV_ckpt_636.trc:
Tue Nov 04 10:49:58 2014
CJQ0 started with pid=27, OS id=726
Tue Nov 04 10:48:23 2014
Starting background process CJQ0
space available in the underlying filesystem or ASM diskgroup.
database for recovery-related files, and does not reflect the amount of
user-specified limit on the amount of space that will be used by this
db_recovery_file_dest_size of 20000 MB is 8.36% used. This is a
Tue Nov 04 10:48:22 2014
ARC0: Becoming the heartbeat ARCH
ARC0: STARTING ARCH PROCESSES COMPLETE
ARC3: Archival started
ARC2: Archival started
Completed: ALTER DATABASE OPEN
QMNC started with pid=24, OS id=697
Tue Nov 04 10:48:21 2014
Starting background process QMNC
replication_dependency_tracking turned off (no async multimaster replication found)
No Resource Manager plan active
Database Characterset is WE8MSWIN1252
SMON: enabling tx recovery
Verifying 11g file header compatibility for tablespace encryption completed
Verifying file header compatibility for 11g tablespace encryption..
Undo initialization finished serial:0 start:429462324 end:429462404 diff:80 (0 seconds)
[676] Successfully onlined Undo Tablespace 2.
ARC3 started with pid=23, OS id=693
Tue Nov 04 10:48:20 2014

$ ll
total 2136420
-rw-r--r-- 1 oracle oinstall        74 Oct 30 12:04 bbed.par
-rw-r--r-- 1 oracle oinstall     18432 Oct 30 13:48 bifile.bbd
-rw-r----- 1 oracle oinstall   9814016 Nov  4 10:49 control01.ctl
-rw-r----- 1 oracle oinstall 362422272 Nov  4 10:48 example01.dbf
-rw-r--r-- 1 oracle oinstall       259 Oct 30 12:03 filelist.txt
-rw-r--r-- 1 oracle oinstall      1725 Oct 30 13:48 log.bbd
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo01.log
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo02.log
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo03.log
-rw-r----- 1 oracle oinstall 639639552 Nov  4 10:49 sysaux01.dbf
-rw-r----- 1 oracle oinstall 943726592 Nov  4 10:49 system01.dbf
-rw-r----- 1 oracle oinstall  20979712 Nov  4 10:48 temp01.dbf
-rw-r----- 1 oracle oinstall  57679872 Nov  4 10:49 undotbs01.dbf
$ ps -ef|grep tail
oracle     785  5117  0 10:49 pts/1    00:00:00 tail -n +0 -f /proc/628/fd/261
oracle     844  5117  0 10:51 pts/1    00:00:00 grep tail
$ ll /proc/785/fd
total 0
lrwx------ 1 oracle oinstall 64 Nov  4 10:52 0 -> /dev/pts/1
l-wx------ 1 oracle oinstall 64 Nov  4 10:52 1 -> /dev/null
lrwx------ 1 oracle oinstall 64 Nov  4 10:50 2 -> /dev/pts/1
lr-x------ 1 oracle oinstall 64 Nov  4 10:52 3 -> /u01/app/oracle/oradata/KV/users01.dbf (deleted)
$ cp  /proc/785/fd/3 users01.dbf
$ ll
total 2218432
-rw-r--r-- 1 oracle oinstall        74 Oct 30 12:04 bbed.par
-rw-r--r-- 1 oracle oinstall     18432 Oct 30 13:48 bifile.bbd
-rw-r----- 1 oracle oinstall   9814016 Nov  4 10:49 control01.ctl
-rw-r----- 1 oracle oinstall 362422272 Nov  4 10:48 example01.dbf
-rw-r--r-- 1 oracle oinstall       259 Oct 30 12:03 filelist.txt
-rw-r--r-- 1 oracle oinstall      1725 Oct 30 13:48 log.bbd
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo01.log
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo02.log
-rw-r----- 1 oracle oinstall  52429312 Nov  4 10:48 redo03.log
-rw-r----- 1 oracle oinstall 639639552 Nov  4 10:49 sysaux01.dbf
-rw-r----- 1 oracle oinstall 943726592 Nov  4 10:49 system01.dbf
-rw-r----- 1 oracle oinstall  20979712 Nov  4 10:48 temp01.dbf
-rw-r----- 1 oracle oinstall  57679872 Nov  4 10:49 undotbs01.dbf
-rw-r----- 1 oracle oinstall  83894272 Nov  4 10:52 users01.dbf
$ sqlplus / as sysdba

SQL*Plus: Release 11.2.0.3.0 Production on Tue Nov 4 10:52:39 2014

Copyright (c) 1982, 2011, Oracle.  All rights reserved.

Connected to an idle instance.

SQL> startup
ORACLE instance started.

Total System Global Area 1870647296 bytes
Fixed Size                  2229424 bytes
Variable Size             452987728 bytes
Database Buffers         1409286144 bytes
Redo Buffers                6144000 bytes
Database mounted.
Database opened.
SQL> select count(*) from test;

  COUNT(*)
----------
    618237

SQL> quit
Disconnected from Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options