使用DNS缓存修复Docker中的DNS超时[教程]

最新推荐文章于 2023-11-16 09:17:03 发布

dfsgwe1231

最新推荐文章于 2023-11-16 09:17:03 发布

阅读量1.2k

点赞数

文章标签：运维网络 awk

本文讲述了在AdaptJS项目中遇到的CI测试由于Docker DNS超时导致的不稳定性问题。通过使用dnsmasq作为Docker的DNS缓存，成功减少了对AWS DNS服务器的查询，从而解决了超时问题并提高了CI的稳定性。

摘要由CSDN通过智能技术生成

在CI中进行不稳定的测试是一场噩梦。您无法确定您的新代码是否损坏了某些东西，或者仅仅是那些测试再次变得不稳定。因此，每当我们看到开源项目Adapt的CI出现奇怪的随机故障时，我们都将尽速查找罪魁祸首。这是关于我们如何发现我们（意外地）用流量泛滥DNS服务器以及如何在Docker中使用DNS缓存来解决问题的故事。

背景

AdaptJS是我从事的一个开源项目之一，可以将应用程序部署到多种云和技术中，因此使用Docker，Kubernetes，AWS，Google Cloud和其他类似技术进行了大量的系统测试和端到端测试。

我们在测试中大量使用了Docker，因此最终创建了许多短期容器，这些容器会启动，完成一些工作，例如构建或安装应用程序，然后被删除。随着我们添加越来越多的这些测试，我们开始看到以前稳定的系统测试在CI中随机失败。

症状：测试超时

我们看到的第一个症状是测试超时。我们的许多端到端测试的超时时间都很短，因此我们可以检测到新代码是否突然使最终用户花费了更长的时间。但是现在，通常需要半秒的测试有时会花费5.5秒。

额外的5秒是一个很好的线索-听起来像-5秒可能是某种超时。怀着这种直觉，我们回顾了所有看似随机的测试失败并找到了共同的线索：它们都是引发网络请求的测试。我们还注意到有些测试花费了更长的时间才能失败...总是以5秒为增量。

这里没有太多的网络协议可以使用，因此快速搜索可以为我们指明正确的方向。在Linux上，DNS服务器查询的默认超时仅为5秒。

为了了解DNS发生了什么，我们找到了可能是在Linux上调试网络问题的最重要的工具： tcpdump 。（或者，如果您更喜欢GUI版本， wireshark也很棒。）我们在主机系统（Amazon Workspaces Linux实例）上运行tcpdump并使用过滤器查看DNS流量：

$  tcpdump -n -i eth1 port 53
11:35:59.474735 IP 172.16.0.131.54264 > 172.16.0.119.domain: 64859+ AAAA? registry-1.docker.io. (38)
11:35:59.474854 IP 172.16.0.131.49631 > 172.16.0.119.domain: 43524+ A? registry-1.docker.io. (38)
11:35:59.476871 IP 172.16.0.119.domain > 172.16.0.131.49631: 43524 8/0/1 A 34.197.189.129, A 34.199.40.84, A 34.199.77.19, A 34.201.196.144, A 34.228.211.243, A 34.232.31.24, A 52.2.186.244, A 52.55.198.220 (177)
11:35:59.476957 IP 172.16.0.119.domain > 172.16.0.131.54264: 64859 0/1/1 (133)

我们注意到的第一件事是，我们正在为VPC生成大量DNS查询到AWS默认DNS服务器。看起来由于各种原因，所有那些短命的容器在启动时都倾向于进行一堆DNS查找。接下来，我们注意到其中一些DNS查询没有得到答复。

共享DNS服务器实施速率限制是很常见的，这样一个用户就不会降低其他用户的性能。在这里，我们怀疑AWS DNS服务器正是在这样做。我们无法找到一种方法来确认我们是否确实达到了AWS速率限制，但是对于我们来说，不要对我们的DNS服务器进行DoS似乎是明智的。

解决方案：使用dnsmasq的Docker DNS缓存

为了隔离主机中的DNS流量，我们需要一个本地DNS服务器充当缓存。 dnsmasq是此类缓存的绝佳选择。它可靠，使用广泛且设置超级简单。而且由于我们所有的测试都在Docker容器中运行，因此也可以在Docker中运行DNS服务器。

基本思想非常简单：在Docker主机网络上运行dnsmasq容器作为DNS缓存，然后使用--dns选项指向缓存容器的IP地址运行测试容器。

这是启动DNS缓存容器的dns_cache脚本：

#!/usr/bin/env bash

: " ${IMAGE:=andyshinn/dnsmasq:2.76} "
: " ${NAME:=dnsmasq} "
: " ${ADAPT_DNS_IP_FILE:=/tmp/adapt_dns_ip} "

# Get IP address for an interface, as visible from inside a container
# connected to the host network
interfaceIP () {
    # Run a container and get ifconfig output from inside
    # We need the ifconfig that will be visible from inside the dnsmaq
    # container
    docker run --rm --net=host busybox ifconfig " $1 " 2>/dev/null | \
        awk '/inet /{print(gensub(/^.*inet (addr:)?([0-9.]+)\s.*$/, "\\2", 1))}'
}

if docker inspect -- type container " ${NAME} " >& /dev/null ; then
    if [ -f " ${ADAPT_DNS_IP_FILE} " ]; then
        # dnsmasq is already started
        cat " ${ADAPT_DNS_IP_FILE} "
        exit 0
    else
        echo DNS cache container running but file ${ADAPT_DNS_IP_FILE} does not exist. >&2
        exit 1
    fi
fi

# We only support attaching to the default (host) bridge named "bridge".
DOCKER_HOST_NETWORK=bridge

# Confirm that "bridge" is the default bridge
IS_DEFAULT=$(docker network inspect " ${DOCKER_HOST_NETWORK} " --format '{{(index .Options "com.docker.network.bridge.default_bridge")}}' )
if [ " ${IS_DEFAULT} " != "true" ]; then
    echo Cannot start DNS cache. The Docker network named \ " ${DOCKER_HOST_NETWORK} \" does not exist or is not the default bridge. >&2
    exit 1
fi

# Get the Linux interface name for the bridge, typically " docker0 "
INTF_NAME= $(docker network inspect "${DOCKER_HOST_NETWORK}" --format '{{(index .Options "com.docker.network.bridge.name") }}')
if [ -z " ${INTF_NAME} " ]; then
    echo Cannot start DNS cache. Unable to determine default bridge interface name. >&2
    exit 1
fi

# Get the IP address of the bridge interface. This is the address that
# dnsmasq will listen on and other containers will send DNS requests to.
IP_ADDR= $(interfaceIP "${INTF_NAME}")
if [ -z " ${IP_ADDR} " ]; then
    echo Cannot start DNS cache. Docker bridge interface ${INTF_NAME} does not exist. >&2
    exit 1
fi

# Run the dnsmasq container. The hosts's /etc/resolv.conf configuration will
# be used by dnsmasq to resolve requests.
docker run --rm -d --cap-add=NET_ADMIN --name " ${NAME} " --net=host -v/etc/resolv.conf:/etc/resolv.conf " ${IMAGE} " --bind-interfaces --listen-address=" ${IP_ADDR} " --log-facility=- > /dev/null
if [ $? -ne 0 ]; then
    echo Cannot start DNS cache. Docker run failed.
    exit 1
fi

# Remember what IP address to use as DNS server, then output it.
echo ${IP_ADDR} > " ${ADAPT_DNS_IP_FILE} "
echo ${IP_ADDR}

除了启动容器（如果尚未运行）之外，脚本还会输出缓存容器的IP地址。我们将在启动的任何其他容器的命令行上使用它。该脚本还确保dnsmasq仅在Docker内部（在Docker桥接口上）侦听DNS请求，因此需要做一些额外的工作来确定要侦听的IP地址。

这是一个如何启动DNS缓存，记住变量DNS_IP的IP地址，然后运行另一个将使用该缓存的容器的DNS_IP 。

$  DNS_IP=$(dns_cache)
$  docker run --dns ${DNS_IP} --rm busybox ping -c1 adaptjs.org

验证缓存工作正常

在我们的测试中开始使用缓存之后，主机系统发送到AWS DNS服务器的DNS查询数量就减少了。我们还通过检查dnsmasq统计信息来确认缓存正常运行。将SIGUSR1发送到dnsmasq会使其将统计信息打印到其日志中：

$  docker kill -s USR1 dnsmasq
$  docker logs dnsmasq
dnsmasq[1]: cache size 150, 1085/4664 cache insertions re-used unexpired cache entries.
dnsmasq[1]: queries forwarded 1712, queries answered locally 3940
dnsmasq[1]: queries for authoritative zones 0
dnsmasq[1]: server 172.16.0.119#53: queries sent 1172, retried or failed 0
dnsmasq[1]: server 172.16.1.65#53: queries sent 252, retried or failed 0
dnsmasq[1]: server 172.16.0.2#53: queries sent 608, retried or failed 0

最重要的是，我们看到系统测试超时显着减少，并且CI运行稳定。

这个问题使我们花了一段时间才找到答案。但是保持CI健康非常重要。如果您有太多的零星测试失败，则开发人员倾向于忽略CI结果，并推送可能损坏的代码。

因此，即使跟踪这些故障很费时，但鉴于修复的简便性，绝对值得投资。

最初发布在 Adapt博客上 。

From: https://hackernoon.com/fixing-dns-timeouts-in-docker-hbr32ej