7.DNS解析失败,oracle 数据库无法连接,listener重启hang住

pro_db数据库无法连接,listener重启hang住问题分析
解决: 1、移走 /etc/resolv.conf 配置文件让数据库无法走DNS解析后恢复
mv /etc/resolv.conf /etc/resolv.conf.bak
2、/etc/resolv.conf 中添加 options timeout:1

查看nsswitch.conf配置文件,配置的的确是先通过/etc/hosts文件解析主机名,再通过/etc/resolv.conf寻找DNS解析
cat /etc/nsswitch.conf
hosts: files dns

问题:为何配置了先通过/etc/hosts解析主机名再寻找DNS,但无法解析(或解析很慢)主机名导致数据监听hang住?到底是不是因为配置了DNS解析导致监听hang住无法连接?

最终结论:
oracle 11.2.0.1版本的listener解析主机名优先选择从DNS解析,然后才会尝试从hosts文件解析。当/etc/resolv.conf文件不存在时会从/etc/hosts解析,当DNS服务器地址无法访问
时,会出现数据库建立连接缓慢的问题,一旦DNS服务器本身解析的问题导致找不到ip时,listener就会响应延迟或者hang住。可以通过移除/etc/resolv.conf文件或者在文件中设置超时时间
echo “options timeout:1” >> /etc/resolv.conf 来解决。

1.在 备库 重现故障现象

1.1) 查看dns和host配置(/etc/host.conf /etc/resolv.conf /etc/hosts)
修改/etc/resolv.conf 中的DNS服务器为一个无法访问的地址,使DNS解析失败

[root@pro_db1-DG ~]# cat /etc/host.conf
multi on

[root@pro_db1-DG ~]# cat /etc/resolv.conf 
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
#nameserver 172.20.113.27
#nameserver 172.20.113.28
nameserver 172.20.113.29

[root@pro_db1-DG ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1   pro_db1-DG
172.20.124.139 laomatest


1.2) 查看listener配置
[oracle@pro_db1-DG ~]$ cat /mnt/data_1t/app/oracle/product/11.2.0/VMANAGE/network/admin/listener.ora 
SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
      (ORACLE_HOME = /mnt/data_1t/app/oracle/product/11.2.0/VMANAGE)
      (SID_NAME = VMANAGE)
    )
  )

LISTENER =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = pro_db1-DG)(PORT = 1521))
  )

ADR_BASE_LISTENER = /mnt/data_1t/app/oracle
#TRACE_LEVEL_listener=admin


2.打开客户端sqlnet.ora的trace
$ vi $ORACLE_HOME/network/admin/sqlnet.ora
TRACE_LEVEL_CLIENT = 16
TRACE_FILE_CLIENT = client
TRACE_DIRECTORY_CLIENT = /tmp
TRACE_TIMESTAMP_CLIENT = ON
DIAG_ADR_ENABLED=off    

打开listener的trace (未重启listener故而未实施)
listener.ora打开trace
DIAG_ADR_ENABLED_LISTENER =off 
TRACE_LEVEL_LISTENER = 16
TRACE_TIMESTAMP_LISTENER = ON 
TRACE_DIRECTORY_LISTENER = /tmp
TRACE_FILELEN_LISTENER = 1024
TRACE_FILENO_LISTENER = 1

3.sqlplus通过监听访问数据库,出现连接缓慢的情况 (耗时 12秒)

[oracle@pro_db1-DG ~]$ time sqlplus vmanage/y4yhl9t@vmanages

Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
SQL> 

real    0m12.592s
user    0m0.045s
sys     0m0.016s


4.通过日志分析缓慢原因
>>> 使用strace跟踪listener进程及其子进程调用  ps -ef | grep lsnr; strace -frT -o /tmp/listener.strc -p 106787
>>> stracelsof 跟踪sqlplus进程的调用  strace -frT -o /tmp/sqlplus.trc sqlplus vmanage/y4yhl9t@vmanages
>>> 查看客户端sqlnet的trace   vi /tmp/client_1830.trc
>>> 查看listener的trace(未打开)

日志分析
4.1)客户端sqlnet的trace反映第一次connect的包跟第二次nttrd的包之间有 10秒延迟
client_1830.trc显示 the delay between 1st Connect Packet and Resend packet on nttrd function

(714936352) [31-JAN-2018 14:56:26:271] nscon: entry
(714936352) [31-JAN-2018 14:56:26:271] nscon: recving a packet
(714936352) [31-JAN-2018 14:56:26:271] nsprecv: entry
(714936352) [31-JAN-2018 14:56:26:271] nsprecv: reading from transport...
(714936352) [31-JAN-2018 14:56:26:271] nttrd: entry
(714936352) [31-JAN-2018 14:56:36:301] nttrd: socket 8 had bytes read=8
(714936352) [31-JAN-2018 14:56:36:301] nttrd: exit


4.2.1)strace 跟踪sqlplus调用情况,/tmp/sqlplus.trc 中文件描述符fd=9 写入花了12s,耗时最久,其中打开了 inet_addr("172.20.114.30"),这一步正好是地址解析

27011      0.000144 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9 <0.000036>
27011      0.000094 fcntl(9, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000027>
27011      0.000086 connect(9, {sa_family=AF_INET, sin_port=htons(1521), sin_addr=inet_addr("172.20.114.30")}, 16) = -1 EINPROGRESS (Operation now in progress) <0.000093>
27011      0.000177 times({tms_utime=0, tms_stime=2, tms_cutime=0, tms_cstime=0}) = 2406186220 <0.000026>
27011      0.000085 mmap(NULL, 802816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0362452000 <0.000029>
27011      0.000085 poll([{fd=9, events=POLLOUT}], 1, 60000) = 1 ([{fd=9, revents=POLLOUT}]) <0.000028>
27011      0.000083 getsockopt(9, SOL_SOCKET, SO_ERROR, [-3925634571361583104], [4]) = 0 <0.000026>
27011      0.000071 fcntl(9, F_GETFL)   = 0x802 (flags O_RDWR|O_NONBLOCK) <0.000025>
27011      0.000098 fcntl(9, F_SETFL, O_RDWR) = 0 <0.000019>
27011      0.000064 getsockname(9, {sa_family=AF_INET, sin_port=htons(50243), sin_addr=inet_addr("172.20.114.30")}, [16]) = 0 <0.000025>
27011      0.000079 getsockopt(9, SOL_SOCKET, SO_SNDBUF, [-3925628283529411012], [4]) = 0 <0.000025>
27011      0.000067 getsockopt(9, SOL_SOCKET, SO_RCVBUF, [-3925628283529374176], [4]) = 0 <0.000026>
27011      0.000070 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0 <0.000027>
27011      0.000075 fcntl(9, F_SETFD, FD_CLOEXEC) = 0 <0.000025>
27011      0.000179 rt_sigaction(SIGPIPE, {SIG_IGN, ~[ILL ABRT BUS FPE SEGV USR2 XCPU XFSZ SYS RTMIN RT_1], SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x35bf40f4a0}, {SIG_DFL, [], 0}, 8) = 0 <0.000029>
27011      0.000135 write(9, "\0\352\0\0\1\0\0\0\1:\1,\fA \0\177\377\177\10\0\0\1\0\0\260\0:\0\0\10\0"..., 234) = 234 <0.000046>
27011      0.000117 read(9, "\0\10\0\0\v\0\0\0", 8208) = 8 <12.048260>
27011     12.048380 write(9, "\0\352\0\0\1\0\0\0\1:\1,\fA \0\177\377\177\10\0\0\1\0\0\260\0:\0\0\10\0"..., 234) = 234 <0.000058>
27011      0.000156 read(9, "\0 \0\0\2\0\0\0\1:\fA \0\177\377\1\0\0\0\0 AA\0\0\0\0\0\0\0\0", 8208) = 32 <0.000122>
27011      0.000384 write(9, "\0\234\0\0\6\0\0\0\0\0\336\255\276\357\0\222\0\0\0\0\0\4\0\0\4\0\3\0\0\0\0\0"..., 156) = 156 <0.000091


4.2.2)lsof 查看文件描述符 fd=9 的打开调用情况,文件描述符 9 是个管道
[oracle@pro_db1-DG tmp]$ ps -ef | grep sqlplus | grep -v grep               
oracle    2049 31214  0 12:46 pts/4    00:00:00 sqlplus  

[oracle@pro_db1-DG tmp]$ lsof -p 2049
COMMAND  PID   USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
tnslsnr 3489 oracle    8u  IPv6          282128771      0t0       TCP *:ncube-lm (LISTEN)
tnslsnr 3489 oracle    9w  FIFO                0,8      0t0 282128761 pipe
tnslsnr 3489 oracle   10u  unix 0xffff880e3f283080      0t0 282128774 /var/tmp/.oracle/s#3489.1
tnslsnr 3489 oracle   11u  unix 0xffff880e3f283680      0t0 282128776 /var/tmp/.oracle/s#3489.2
tnslsnr 3489 oracle   12u  IPv4          282128793      0t0       TCP localhost:15237->localhost:smux (CLOSE_WAIT)
tnslsnr 3489 oracle   13u  IPv6          282165915      0t0       TCP localhost:ncube-lm->localhost:45928 (ESTABLISHED)
tnslsnr 3489 oracle   14u  IPv6          453571765      0t0       TCP 172.20.114.30:ncube-lm->172.20.114.30:9881 (ESTABLISHED)
tnslsnr 3489 oracle   16w  FIFO                0,8      0t0 453571767 pipe
tnslsnr 3489 oracle   17r  FIFO                0,8      0t0 453571768 pipe

-------- FD 文件描述符,u 表示该文件被打开并处于读取/写入模式,而不是只读 (r) 或只写 (w) 模式。
  

4.3) 查看listener进程的strace日志
4.3.1) 通过strace查看listener进程的主机名解析调用顺序如下,主机名解析调用正好在管道 fd=9:
/etc/resolv.conf   >   /etc/host.conf  >  /etc/hosts  >  libnss_dns.so.2  >  libresolv.so.2

18880      0.000173 close(9)            = 0 <0.000037>
18880      0.000098 open("/etc/resolv.conf", O_RDONLY) = 9 <0.000035>
18880      0.000081 fstat(9, {st_mode=S_IFREG|0644, st_size=305, ...}) = 0 <0.000027>
18880      0.000070 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3a1ea34000 <0.000029>
18880      0.000064 read(9, "# Generated by NetworkManager\n\n\n"..., 4096) = 305 <0.000030>
18880      0.000076 read(9, "", 4096)   = 0 <0.000025>
18880      0.000064 close(9)            = 0 <0.000039>
18880      0.000099 munmap(0x7f3a1ea34000, 4096) = 0 <0.000033>
18880      0.000086 uname({sys="Linux", node="pro_db1-DG", ...}) = 0 <0.000019>
18880      0.000108 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9 <0.000033>
18880      0.000080 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) <0.000032>
18880      0.000085 close(9)            = 0 <0.000030>
18880      0.000071 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 9 <0.000028>
18880      0.000060 connect(9, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory) <0.000029>
18880      0.000078 close(9)            = 0 <0.000027>
18880      0.000073 open("/etc/host.conf", O_RDONLY) = 9 <0.000031>
18880      0.000076 fstat(9, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0 <0.000027>
18880      0.000071 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3a1ea34000 <0.000028>
18880      0.000069 read(9, "multi on\n", 4096) = 9 <0.000029>
18880      0.000073 read(9, "", 4096)   = 0 <0.000025>
18880      0.000064 close(9)            = 0 <0.000026>
18880      0.000064 munmap(0x7f3a1ea34000, 4096) = 0 <0.000025>
18880      0.000077 futex(0x35bf39f384, FUTEX_WAKE_PRIVATE, 2147483647) = 0 <0.000026>
18880      0.000071 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 9 <0.000031>
18880      0.000081 fstat(9, {st_mode=S_IFREG|0644, st_size=205, ...}) = 0 <0.000026>
18880      0.000068 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f3a1ea34000 <0.000026>
18880      0.000057 read(9, "127.0.0.1   localhost localhost."..., 4096) = 205 <0.000027>
18880      0.000088 read(9, "", 4096)   = 0 <0.000048>
18880      0.000092 close(9)            = 0 <0.000026>

18880      0.000078 open("/mnt/data_1t/app/oracle/product/11.2.0/VMANAGE/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000035>
18880      0.000087 open("/mnt/data_1t/app/oracle/product/11.2.0/VMANAGE/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000030>
18880      0.000079 open("/lib/libnss_dns.so.2", O_RDONLY) = 9 <0.000034>

18880      0.000068 open("/usr/lib/libnss_dns.so.2", O_RDONLY) = -1 ENOENT (No such file or directory) <0.000039>
18880      0.000086 open("/etc/ld.so.cache", O_RDONLY) = 9 <0.000029>
18880      0.000073 fstat(9, {st_mode=S_IFREG|0644, st_size=123514, ...}) = 0 <0.000027>
18880      0.000070 mmap(NULL, 123514, PROT_READ, MAP_PRIVATE, 9, 0) = 0x7f3a1d064000 <0.000030>
18880      0.000061 close(9)            = 0 <0.000026>
18880      0.000074 open("/lib64/libnss_dns.so.2", O_RDONLY) = 9 <0.000033>

18880      0.000088 open("/lib/libresolv.so.2", O_RDONLY) = 9 <0.000033>
18880      0.000077 read(9, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240&\0\0004\0\0\0"..., 832) = 832 <0.000026>
18880      0.000065 close(9)            = 0 <0.000026>


 
4.3.2) 再看一下,strace的listener进程中延迟操作 在FD 9跟15 之间 ,而 FD =9 中连接的地址就是/etc/resolv.conf中配置的DNS服务器地址 172.20.113.29
即fd=9这步调用需要去DNS服务器解析主机名,因为 172.20.113.29无法访问导致解析延迟,然后在去fd=15从本地解析,最后建立与客户端的连接

18880      0.000148 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 9 <0.000039>
18880      0.000093 connect(9, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("172.20.113.29")}, 16) = 0 <0.000035>
18880      0.000091 poll([{fd=9, events=POLLOUT}], 1, 0) = 1 ([{fd=9, revents=POLLOUT}]) <0.000028>
18880      0.000079 sendto(9, "\256>\1\0\0\1\0\0\0\0\0\0\tpro_db1-DG\0\0\34\0\1", 27, MSG_NOSIGNAL, NULL, 0) = 27 <0.000038>
18880      0.000086 poll([{fd=9, events=POLLIN}], 1, 5000 <unfinished ...>
3490       4.394932 <... nanosleep resumed> 0x7f29fc234e10) = 0 <5.000157>
3490       0.000082 socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 15 <0.000042>
3490       0.000089 fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000026>
3490       0.000070 setsockopt(15, SOL_SOCKET, SO_LINGER, {onoff=1, linger=900}, 8) = 0 <0.000029>
3490       0.000076 connect(15, {sa_family=AF_INET6, sin6_port=htons(6150), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ECONNREFUSED (Connection refused) <0.000108>
3490       0.000172 shutdown(15, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) <0.000029>
3490       0.000076 close(15)           = 0 <0.000037>
3490       0.000076 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 15 <0.000032>
3490       0.000069 fcntl(15, F_SETFD, FD_CLOEXEC) = 0 <0.000023>
3490       0.000061 setsockopt(15, SOL_SOCKET, SO_LINGER, {onoff=1, linger=900}, 8) = 0 <0.000023>
3490       0.000064 connect(15, {sa_family=AF_INET, sin_port=htons(6150), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused) <0.000053>
3490       0.000111 shutdown(15, 2 /* send and receive */) = -1 ENOTCONN (Transport endpoint is not connected) <0.000030>
3490       0.000082 close(15)           = 0 <0.000041>


5. 改回DNS服务器地址为正常地址后,listener恢复正常,sqlplus登录数据库正常
#vi /etc/resolv.conf 
# DNS1=xxx.xxx.xxx.xxx
# DNS2=xxx.xxx.xxx.xxx
# DOMAIN=lab.foo.com bar.foo.com
nameserver 172.20.113.27
nameserver 172.20.113.28

[oracle@dgvxl2249 ~]$ time sqlplus vmanage/y4yhl9t@vmanagess

real 0m0.952s >>>>>>>>登录只需要0.9秒,恢复正常
user 0m0.038s
sys 0m0.022s

其他测试对比:
pro_db的备机 172.20.221.137 数据库版本也是 11.2.0.1,但是单网卡 ------无法复现问题
plm的主机 172.20.220.164 数据库版本是11.2.0.4 ,单网卡 ------无法复现问题

参考文档:
DNS Issue: Connections To Oracle 11g are Slow or Delayed (Doc ID 561429.1)
Bug 9593134 - DNS or NIS mis-configuration can cause slow database connects (Doc ID 9593134.8)

附:
查看listener的版本,11.2.0.1

[oracle@pro_db1-DG ~]$ lsnrctl version
LSNRCTL for Linux: Version 11.2.0.1.0 - Production on 31-JAN-2018 15:03:48
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

dba任意

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值