java getLocalHost 速度慢的问题分析

在跑es压测的时候发现,有很多线程卡在java.net.InetAddress$NameServiceAddresses.get的一个锁上面,本文记录一下整个问题的分析过程和根因

现象

用户在跑es的时候使用jstack能看到很多线程BLOCKED

"pool-3-thread-1" #456 prio=5 os_prio=0 cpu=40493.55ms elapsed=4118.31s tid=0x00007f1b3017d9f0 nid=0x48a83 waiting for monitor entry  [0x00007f1b21bda000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:835)
	- waiting to lock <0x000000079780efb0> (a java.net.InetAddress$NameServiceAddresses)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:886)
	at java.net.InetAddress.getAllByName0(java.base@15.0.1/InetAddress.java:1507)
	at java.net.InetAddress.getAllByName0(java.base@15.0.1/InetAddress.java:1427)
	at java.net.SocketPermission.getIP(java.base@15.0.1/SocketPermission.java:814)
	at java.net.SocketPermission.impliesIgnoreMask(java.base@15.0.1/SocketPermission.java:976)
	at java.net.SocketPermissionCollection.implies(java.base@15.0.1/SocketPermission.java:1446)
	at java.security.Permissions.implies(java.base@15.0.1/Permissions.java:178)
	at sun.security.provider.PolicyFile.implies(java.base@15.0.1/PolicyFile.java:994)
	at sun.security.provider.PolicySpiFile.engineImplies(java.base@15.0.1/PolicySpiFile.java:75)
	at java.security.Policy$PolicyDelegate.implies(java.base@15.0.1/Policy.java:796)
	at org.elasticsearch.bootstrap.ESPolicy.implies(ESPolicy.java:86)
	at java.security.ProtectionDomain.implies(java.base@15.0.1/ProtectionDomain.java:321)
	at java.security.ProtectionDomain.impliesWithAltFilePerm(java.base@15.0.1/ProtectionDomain.java:353)
	at java.security.AccessControlContext.checkPermission(java.base@15.0.1/AccessControlContext.java:450)
	at java.security.AccessController.checkPermission(java.base@15.0.1/AccessController.java:1036)
	at java.lang.SecurityManager.checkPermission(java.base@15.0.1/SecurityManager.java:408)
	at java.lang.SecurityManager.checkConnect(java.base@15.0.1/SecurityManager.java:907)
	at java.net.InetAddress.getLocalHost(java.base@15.0.1/InetAddress.java:1628)

分析是被下面的线程占用了锁

"pool-3-thread-64" #596 prio=5 os_prio=0 cpu=42789.62ms elapsed=4107.41s tid=0x00007f1b1c28a360 nid=0x48b33 runnable  [0x00007f0ce31f0000]
   java.lang.Thread.State: RUNNABLE
        at java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@15.0.1/Native Method)
        at java.net.InetAddress$PlatformNameService.lookupAllHostAddr(java.base@15.0.1/InetAddress.java:932)
        at java.net.InetAddress.getAddressesFromNameService(java.base@15.0.1/InetAddress.java:1517)
        at java.net.InetAddress$NameServiceAddresses.get(java.base@15.0.1/InetAddress.java:851)
        - locked <0x000000079780efb0> (a java.net.InetAddress$NameServiceAddresses)
        at java.net.InetAddress.getAllByName0(java.base@15.0.1/InetAddress.java:1507)
        at java.net.InetAddress.getLocalHost(java.base@15.0.1/InetAddress.java:1639)

嗯,看起来是在通过本机主机名拿本机ip的时候慢了

这个时候执行一下hostname,发现hostname的输出并没有在/etc/hosts中配置
尝试把主机名配置到/etc/hosts中,解析为127.0.0.1,问题就能得到恢复

那么疑问来了,在解析主机名的时候,配置hosts和不配置的速度差别有多少呢?为什么会block呢?

查看jdk 1.5.0 源码 看一下NameServiceAddresses.get的实现
https://github.com/openjdk/jdk/blob/jdk-15-ga/src/java.base/share/classes/java/net/InetAddress.java#L835
发现有一个synchronized,ok锁的位置确认,继续向下
观察到runnable的线程正在执行的函数是

java.net.Inet4AddressImpl.lookupAllHostAddr(java.base@15.0.1/Native Method)

是一个native函数

jdk的实现在这里
https://github.com/openjdk/jdk/blob/jdk-15-ga/src/java.base/unix/native/libnet/Inet4AddressImpl.c#L86

在109行会调用到getaddrinfo函数,是glibc提供的

Given node and service, which identify an Internet host and a
       service, getaddrinfo() returns one or more addrinfo structures,
       each of which contains an Internet address that can be specified
       in a call to bind(2) or connect(2).

ok, 我们写一个demo,来测试一下配置hosts的主机名和不配置hosts的主机名解析速度的差异,我们就将jdk中的实现抠出来

#include <ctype.h>
#include <errno.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <netinet/in_systm.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <arpa/inet.h>

int main(int argc, char *argv[]) {

    if (argc < 2) {
        printf("arg err\n");
        return -1;
    }
    struct addrinfo hints, *rp = NULL, *result = NULL, *resNew = NULL, *last = NULL,
        *iterator;

    int error;
    char str[INET_ADDRSTRLEN] = {0};
  

    // try once, with our static buffer
    memset(&amp;hints, 0, sizeof(hints));
    hints.ai_flags = AI_CANONNAME;
    hints.ai_family = AF_INET;

    for (int i = 0; i < 100; i ++ )
        error = getaddrinfo(argv[1], NULL, &amp;hints, &amp;result);
    return 0;
}

我们看执行100次getaddrinfo的性能区别

看一下当前/etc/hosts的配置

[root@VM-0-13-centos c]# cat /etc/hosts
127.0.0.1 VM-0-13-centos VM-0-13-centos
127.0.0.1 localhost.localdomain localhost
127.0.0.1 localhost4.localdomain4 localhost4
127.0.0.1 f11.maxxipoint.com

::1 VM-0-13-centos VM-0-13-centos
::1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6

将当前hostname修改一下,使之不存在于/etc/hosts中

[root@VM-0-13-centos c]# hostname wanglihost
[root@VM-0-13-centos c]# hostname
wanglihost

由于用户机器上开启了nscd,我们也开启
nscd 全称是 “Name Service Cache Daemon”,开启之后glibc会通过unixsocket /var/run/nscd/socket与之交互

看一下性能对比

[root@VM-0-13-centos c]# time ./getaddrinfo VM-0-13-centos

real    0m0.001s
user    0m0.001s
sys     0m0.000s
[root@VM-0-13-centos c]# time ./getaddrinfo wanglihost

real    0m0.463s
user    0m0.001s
sys     0m0.003s

看起来差了400倍左右的速度,难怪不配置hosts会block

但是不配置hosts对glibc的行为影响到底是什么呢

抓一下strace,发现有这样的区别

共同点

connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 0
sendto(3, "\2\0\0\0\r\0\0\0\6\0\0\0hosts\0", 18, MSG_NOSIGNAL, NULL, 0) = 18
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="hosts\0", iov_len=6}, {iov_base="\310O\3\0\0\0\0\0", iov_len=8}], msg_iovlen=2, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], msg_controllen=20, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 14
mmap(NULL, 217032, PROT_READ, MAP_SHARED, 4, 0) = 0x7f2741251000

无论主机名称是否在/etc/hosts中,glibc都会连接/var/run/nscd/socket
并发送一个hosts相关命令
之后mmap一段内存

不同点

不在/etc/hosts中的主机名称,会在之后不断和nscd通过socket交互

sendto(3, "\2\0\0\0\16\0\0\0\f\0\0\0wanglihost1\0", 24, MSG_NOSIGNAL, NULL, 0) = 24
poll([{fd=3, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=3, revents=POLLIN|POLLHUP}])
read(3, "\2\0\0\0\1\0\0\0\2\0\0\0\24\0\0\0\f\0\0\0\0\0\0\0", 24) = 24
read(3, "\254\25\0\r\376\200\0\0\0\0\0\0PT\0\377\3762\327i\2\nwanglihost"..., 34) = 34

为了弄清楚实现
我们查找getaddrinfo在glibc中的实现
https://codebrowser.dev/glibc/glibc/sysdeps/posix/getaddrinfo.c.html#getaddrinfo
发现一段跟nscd相关的代码
https://codebrowser.dev/glibc/glibc/sysdeps/posix/getaddrinfo.c.html#610

#ifdef USE_NSCD

...

int err = __nscd_getai (name, &air, &h_errno);
	      

继续看__nscd_getai的实现
https://codebrowser.dev/glibc/glibc/nscd/nscd_getai.c.html#62

 mapped = __nscd_get_map_ref (GETFDHST, "hosts", &__hst_map_handle,
			       &gc_cycle);

继续看__nscd_get_map_ref的实现

__nscd_get_map_ref (request_type type, const char *name,
		    volatile struct locked_map_ptr *mapptr, int *gc_cyclep)
{
  struct mapped_database *cur = mapptr->mapped;
  if (cur == NO_MAPPING)
    return cur;
  if (!__nscd_acquire_maplock (mapptr))
    return NO_MAPPING;
  cur = mapptr->mapped;
  if (__glibc_likely (cur != NO_MAPPING))
    {
      /* If not mapped or timestamp not updated, request new map.  */
      if (cur == NULL
	  || (cur->head->nscd_certainly_running == 0
	      && cur->head->timestamp + MAPPING_TIMEOUT < time_now ())
	  || cur->head->data_size > cur->datasize)
	cur = __nscd_get_mapping (type, name,
				  (struct mapped_database **) &mapptr->mapped);
      if (__glibc_likely (cur != NO_MAPPING))
	{
	  if (__builtin_expect (((*gc_cyclep = cur->head->gc_cycle) & 1) != 0,
				0))
	    cur = NO_MAPPING;
	  else
	    atomic_increment (&cur->counter);
	}
    }
  mapptr->lock = 0;
  return cur;
}

可以看到这里会去访问一个缓存的mapptr,这个缓存对象是根据传入的name(这里是"hosts")去向nscd要的,
具体怎么要的呢,看__nscd_get_mapping的实现
https://codebrowser.dev/glibc/glibc/nscd/nscd_helper.c.html#__nscd_get_mapping

int sock = open_socket (type, key, keylen);

继续看open_socket实现
https://codebrowser.dev/glibc/glibc/nscd/nscd_helper.c.html#open_socket

  strcpy (sun.sun_path, _PATH_NSCDSOCKET);
  if (__connect (sock, (struct sockaddr *) &sun, sizeof (sun)) < 0
      && errno != EINPROGRESS)

这里_PATH_NSCDSOCKET就是"/var/run/nscd/socket"

这就和strace对应上了

总结

调用逻辑如下
java.net.InetAddress$NameServiceAddresses.get

->

lookupAllHostAddr

->

getaddrinfo

->

创建socket连接nscd

->

发送hosts命令,nscd会创建共享内存将hosts信息写入,客户端(glibc)这边mmap读入

->

如果没有读到,继续sendto 主机名信息给nscd

却别就在如果hosts中有主机名,那么这个信息就在共享内存中

这个行为的区别会造成400倍的性能差异

分析完毕

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值