colease函数 mysql_etcd clientv3的lease keepalive租约频繁续期bug 排查小记

背景:

golang 1.9.1

centos 6

etcd cluster集群节点7台,节点member之间 https 通讯,与 client 之间也是 https 通讯

测试环境 2W 台主机,每台都要安装这个 agent,每个 agent注册一个包含lease的key,key的 ttl 设置为10s

设置keepalive

产品等着发布上线,持续遇到集群性能问题,长时间找不到原因,感觉我要崩溃了

集群性能问题不解决,线上更多的服务器更是没法部署,领导不认可,不满意,我就要被开除了,我就要辞职了。。。

Etcd github官方提了issue,官方没时间帮我们看(理解官方的苦衷,issue 太多,很多都是使用者的问题)…

var err error

ctx, cancel := context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)

leaseResp, err := this.GetEtcdClient().Grant(ctx, 10)//租约时间设定为10秒

cancel()

if err != nil {

return err

}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)

kvc := etcdclient.NewKV(this.kapi)

var txnResp *etcdclient.TxnResponse

txnResp, err = kvc.Txn(ctx).

If(etcdclient.Compare(etcdclient.CreateRevision(keyName), "=", 0)).

Then(etcdclient.OpPut(keyName, value.GetRegisterInfo(),etcdclient.WithLease(etcdclient.LeaseID(leaseResp.ID)))).

Commit()

_, err = this.kapi.Put(ctx, keyName, value.GetRegisterInfo())

cancel()

if err != nil {

return err

}

if !txnResp.Succeeded {

return fmt.Errorf("无法注册自己:%s ,该 Key 已经存在",keyName)

}

ctx, cancel = context.WithTimeout(context.Background(), ETCD_TRANSPORT_TIMEOUT)

_, err = this.GetEtcdClient().KeepAlive(context.TODO(), leaseResp.ID)

cancel()

if err != nil {

return err

}

问题现象:

带有lease id的 key 无故丢失

leader节点 大部分时间 TIME-WAIT数量非常高

leader 节点的 CLOSE-WAIT 有时也非常高

leader 的 CPU 使用率持续在80-90%的样子

leader 节点的内存使用率持续上升,疑似内存泄露的症状

监控图如下:

etcd leader的 etcd 进程入手:

先分析「疑似内存溢出」问题,在 etcd 里,已经有pprof的信息,golang的tool里也有直接读取 http pprof 性能分析的工具,只是不支持 https 的。需要将 pprof 文件下载到本地,再从本地读取

curl --cacert /xxx/xxx/etcd/config/ssl/ca.pem --cert /xxx/xxx/etcd/config/ssl/client.pem --key /xxx/xxx/etcd/config/ssl/client-key.pem -L https://leader_ip:2379/debug/pprof/profile > leader_ip_cpu.txt

结果如下:

cfc4n@cnxct:~/Downloads$go tool pprof leader_ip_cpu.txt

File: etcd

Type: cpu

Time: Jul 3, 2018 at 4:13pm (CST)

Duration: 2.64mins, Total samples = 34.54mins (1306.43%)

Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top 20

Showing nodes accounting for 23.98mins, 69.42% of 34.54mins total

Dropped 1702 nodes (cum <= 0.17mins)

Showing top 20 nodes out of 211

flat flat% sum% cum cum%

3.62mins 10.48% 10.48% 3.62mins 10.48% runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s

2.99mins 8.66% 19.13% 2.99mins 8.66% math/big.divWVW /usr/local/go/src/math/big/arith_amd64.s

2.34mins 6.78% 25.91% 2.34mins 6.78% math/big.bitLen /usr/local/go/src/math/big/arith_amd64.s

1.90mins 5.51% 31.42% 10.44mins 30.22% math/big.nat.montgomery /usr/local/go/src/math/big/nat.go

1.67mins 4.82% 36.24% 1.67mins 4.82% runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s

1.46mins 4.22% 40.47% 3.57mins 10.33% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go

1.36mins 3.93% 44.40% 1.44mins 4.17% syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s

1.30mins 3.78% 48.17% 1.30mins 3.78% runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go

1.04mins 3.01% 51.19% 1.04mins 3.02% runtime.greyobject /usr/local/go/src/runtime/mgcmark.go

0.94mins 2.71% 53.90% 0.94mins 2.71% math/big.divWVW

0.80mins 2.33% 56.23% 0.83mins 2.41% runtime.siftdownTimer /usr/local/go/src/runtime/time.go

0.80mins 2.32% 58.55% 3.17mins 9.17% math/big.nat.divLarge /usr/local/go/src/math/big/nat.go

0.78mins 2.27% 60.82% 0.78mins 2.27% math/big.addMulVVW /usr/local/go/src/math/big/arith_amd64.s

0.57mins 1.66% 62.48% 0.57mins 1.66% runtime._ExternalCode /usr/local/go/src/runtime/proc.go

0.57mins 1.64% 64.12% 0.57mins 1.64% math/big.(*byteReader).Read

0.42mins 1.22% 65.33% 3.22mins 9.32% runtime.mallocgc /usr/local/go/src/runtime/malloc.go

0.40mins 1.17% 66.50% 0.40mins 1.17% math/big.mulAddVWW /usr/local/go/src/math/big/arith_amd64.s

0.34mins 0.99% 67.49% 0.34mins 0.99% vendor/golang_org/x/crypto/curve25519.ladderstep /usr/local/go/src/vendor/golang_org/x/crypto/curve25519/ladderstep_amd64.s

0.34mins 0.99% 68.49% 0.34mins 0.99% math/big.bitLen

0.32mins 0.93% 69.42% 0.41mins 1.18% sync.(*Pool).pin /usr/local/go/src/sync/pool.go

可以看到大量 math 相关的包函数占用了大量的 cpu,而我们程序的业务中,并不涉及数学计算相关业务,也不会引用 math 包。唯一可能性就是在 https 的 tls 证书在做 RSA 算法校验时,用到了大量数学计算。我把 pprof 导出问 svg 图确认了一下,确实是 tls 的证书计算占用了大量 CPU。

但是 https 的链接,不应该是建立1次后就不再建立了吗?为什么会有这么大量的 cpu 计算呢。难道 TIMEWAIT 这么高跟这个有关系吗?回到 TIMEWAIT 的问题上来,此刻,leader 节点是 xxxx-etcdserver-xxx02,且TIME_WAIT非常高

[user@xxxx-etcdserver-xxx02 etcd]$ ss |grep 2380|awk '{print $5}'|awk -F ":" '{print $1}'|sort |uniq -c|sort -rn

53882 192.168.198.234

53667 192.168.227.202

53539 192.168.255.210

53197 192.168.58.204

53135 192.168.137.210

51966 192.168.128.205

leader 节点上的time-wait来源,都是来自各个member节点,平均每个节点将近几万个 TIMEWAIT,相隔几秒再次查询,还是5万多个。而各个 member 节点上,查到的是跟 leader 节点的2380端口 保持1千多个 ESTABLISHED 状态的连接,也确实是频繁的创建链接。为什么member节点频繁的创建链接呢?tcpdump抓点TCP 通讯包看一下:

如上抓包的图,发现了什么?

1201402018-06-22 20:37:30.21910.72.198.2347410.20.127.136TCP12902 → etcd-server(2380) [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=3008487334 TSecr=0 WS=128

1201412018-06-22 20:37:30.21910.20.127.1367410.72.198.234TCPetcd-server(2380) → 12902 [SYN, ACK] Seq=0 Ack=1 Win=14480 Len=0 MSS=1460 SACK_PERM=1 TSval=2796204056 TSecr=3008487334 WS=128

1203622018-06-22 20:37:30.24210.72.198.2346610.20.127.136TCP12902 → etcd-server(2380) [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=3008487358 TSecr=2796204056

1203682018-06-22 20:37:30.24210.72.198.23420410.20.127.136TCP12902 → etcd-server(2380) [PSH, ACK] Seq=1 Ack=1 Win=14720 Len=138 TSval=3008487358 TSecr=2796204056

1203692018-06-22 20:37:30.24210.20.127.1366610.72.198.234TCPetcd-server(2380) → 12902 [ACK] Seq=1 Ack=139 Win=15616 Len=0 TSval=2796204080 TSecr=3008487358

1740762018-06-22 20:37:40.24210.72.198.2346610.20.127.136TCP12902 → etcd-server(2380) [FIN, ACK] Seq=139 Ack=1 Win=14720 Len=0 TSval=3008497358 TSecr=2796204080

1743772018-06-22 20:37:40.28210.20.127.1366610.72.198.234TCPetcd-server(2380) → 12902 [ACK] Seq=1 Ack=140 Win=15616 Len=0 TSval=2796214120 TSecr=3008497358

从这些数据包中,你能看出什么问题?想一想,打开这张图,想5分钟再往下看。。。

120140是10.72.198.234 的 member 节点作为客户端,向10.20.127.136的2380的 leader 节点发起TCP 连接请求。

120141是 leader 回的握手确认包。120362是 member 发的 ACK 握手完成确认包。这些都很简单明了。

120368是 member 向 leader 发的数据包,120369是 leader 回 member 收到前面数据包的 ack。嗯,这些也很好明白。

174076是什么? member向 leader 发的`FIN` 关闭 TCP 连接的数据包,为什么?为什么member要关闭连接?

再抽查几个数据包看看,从 “专家信息”里预览的1000多个“SYN”里找….

可以看出,都是 member主动向 leader 发 FIN 关闭包,这是为什么呢? 仅仅是看出来这些吗? 有没有注意到 member 发的消息包,收到 leader 回的 ack 之后,距离 member 主动发的 FIN 之间的时间间隔?都是10s ?为什么是10s?

leader 回了 ack,但没主动向 member 发“FIN,ACK”,为什么没回呢?若没回,那么 leader 上的这个 TCP 五元组的状态是什么?是的,是“CLOSE-WAIT”,难道这就是造成leader上大量 CLOSE-WAIT 的原因?先怀疑一下,继续往下看…

如上所说,member 向 leader 发了一个PSH数据包,发来什么内容呢?为什么 leader 没回信息?

payload的总长度是138,完全看不出来是什么数据包?不,应该还是有眼熟的,看前3个字节,0x16,0x03,0x01,如果对 HTTPS 熟悉的话,会联想到这是 TLS 的magic number,0x16对应十进制的22 即“Client Hello 的 Handshake”,后面的0x31,0x01是 TLS 的版本号。再配合我们的 member 跟 leader 之间确实是 https 通讯,可以确定这个地方是 tls 的client Hello握手包。也就是说配合前面发现的 10s 的间隔,可以看出 member 与 leader 之间的 TLS 握手并未完成,leader 并没有回 Server Hello… 整个TCP 包比较短,没有发现其他有用的信息…

去掉 HTTPS,减轻 cpu 负担:

回到之前的 CPU pprof 情况来看,是 TLS 相关证书计算,导致 CPU 使用率较高,但大量的 TCP 连接的发起,才是导致大量 TLS 计算的原因,导致内存飙升,消息积压的原因。不过,我们还是尝试去掉 member 与 leader 时间的 tls 证书校验,尝试减轻 leader 的CPU 使用。但结果影响不大,CPU 使用率依旧很高,不过,这也在意料之中,毕竟不是问题根本原因,查看pprof 如下

cfc4n@cnxct:~/Downloads$ go tool pprof 111.111.111.111-0703_cpu.txt

File: etcd

Type: cpu

Time: Jul 3, 2018 at 4:55pm (CST)

Duration: 30.01s, Total samples = 6.16mins (1231.36%)

Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top 20

Showing nodes accounting for 4.29mins, 69.69% of 6.16mins total

Dropped 928 nodes (cum <= 0.03mins)

Showing top 20 nodes out of 181

flat flat% sum% cum cum%

2.13mins 34.62% 34.62% 2.13mins 34.62% runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s

0.59mins 9.51% 44.13% 0.62mins 10.06% syscall.Syscall /usr/local/go/src/syscall/asm_linux_amd64.s

0.33mins 5.42% 49.55% 0.33mins 5.42% runtime._ExternalCode /usr/local/go/src/runtime/proc.go

0.25mins 4.07% 53.62% 1.23mins 20.00% runtime.lock /usr/local/go/src/runtime/lock_futex.go

0.13mins 2.18% 55.80% 0.13mins 2.18% runtime.procyield /usr/local/go/src/runtime/asm_amd64.s

0.11mins 1.80% 57.60% 0.23mins 3.72% runtime.selectgoImpl /usr/local/go/src/runtime/select.go

0.10mins 1.57% 59.17% 0.10mins 1.57% runtime.osyield /usr/local/go/src/runtime/sys_linux_amd64.s

0.09mins 1.39% 60.56% 1.47mins 23.85% runtime.unlock /usr/local/go/src/runtime/lock_futex.go

0.07mins 1.12% 61.69% 0.33mins 5.30% runtime.mallocgc /usr/local/go/src/runtime/malloc.go

0.07mins 1.06% 62.74% 0.07mins 1.06% runtime.heapBitsForObject /usr/local/go/src/runtime/mbitmap.go

0.06mins 1.01% 63.76% 0.15mins 2.36% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go

0.05mins 0.77% 64.53% 0.05mins 0.77% runtime.siftupTimer /usr/local/go/src/runtime/time.go

0.04mins 0.71% 65.24% 0.04mins 0.71% runtime.greyobject /usr/local/go/src/runtime/mgcmark.go

0.04mins 0.71% 65.95% 0.07mins 1.14% runtime.deferreturn /usr/local/go/src/runtime/panic.go

0.04mins 0.7% 66.65% 1.35mins 21.86% runtime.deltimer /usr/local/go/src/runtime/time.go

0.04mins 0.67% 67.32% 0.12mins 1.92% runtime.pcvalue /usr/local/go/src/runtime/symtab.go

0.04mins 0.61% 67.93% 0.04mins 0.64% runtime.siftdownTimer /usr/local/go/src/runtime/time.go

0.04mins 0.6% 68.53% 0.23mins 3.74% runtime.gentraceback /usr/local/go/src/runtime/traceback.go

0.04mins 0.59% 69.11% 0.04mins 0.59% runtime.casgstatus /usr/local/go/src/runtime/proc.go

0.04mins 0.57% 69.69% 0.04mins 0.57% runtime.memmove /usr/local/go/src/runtime/memmove_amd64.s

CPU 使用率较高的,是 runtime.futex 跟系统调用的 syscall.Syscall,以及runtime下的包,heapBitsForObject、scanobject、greyobject是 golang GC 时的运行时函数,意味着GC 已经占用很大的 CPU 了。

配合memory pprof来看下

cfc4n@cnxct:~/Desktop$ go tool pprof 10.3.5.52.txt

File: etcd

Type: inuse_space

Time: Jul 3, 2018 at 4:57pm (CST)

Entering interactive mode (type "help" for commands, "o" for options)

(pprof) top 20

Showing nodes accounting for 10665.34MB, 94.72% of 11259.98MB total

Dropped 368 nodes (cum <= 56.30MB)

Showing top 20 nodes out of 57

flat flat% sum% cum cum%

7156.71MB 63.56% 63.56% 7156.71MB 63.56% github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*recvBuffer).put /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/transport.go

2258.28MB 20.06% 83.61% 9417.49MB 83.64% github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport.(*serverHandlerTransport).HandleStreams.func4 /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/transport/handler_server.go

663.28MB 5.89% 89.51% 663.28MB 5.89% net/http.http2putRequestBodyBuf /usr/local/go/src/net/http/h2_bundle.go

277.15MB 2.46% 91.97% 277.15MB 2.46% runtime.makechan /usr/local/go/src/runtime/chan.go

66.11MB 0.59% 92.55% 66.11MB 0.59% crypto/tls.(*block).reserve /usr/local/go/src/crypto/tls/conn.go

65.24MB 0.58% 93.13% 68.74MB 0.61% net/http.(*Transport).dialConn /usr/local/go/src/net/http/transport.go

56.52MB 0.5% 93.64% 56.52MB 0.5% runtime.malg /usr/local/go/src/runtime/proc.go

54.52MB 0.48% 94.12% 85.91MB 0.76% runtime.mapassign /usr/local/go/src/runtime/hashmap.go

23.53MB 0.21% 94.33% 113.68MB 1.01% crypto/tls.(*Conn).readHandshake /usr/local/go/src/crypto/tls/conn.go

15MB 0.13% 94.46% 74.58MB 0.66% runtime.systemstack /usr/local/go/src/runtime/asm_amd64.s

12MB 0.11% 94.57% 795MB 7.06% net/http.(*http2Server).ServeConn /usr/local/go/src/net/http/h2_bundle.go

11MB 0.098% 94.67% 65.23MB 0.58% net/http.(*http2Framer).readMetaFrame /usr/local/go/src/net/http/h2_bundle.go

2MB 0.018% 94.68% 990.71MB 8.80% net/http.(*conn).serve /usr/local/go/src/net/http/server.go

2MB 0.018% 94.70% 77.73MB 0.69% net/http.(*http2serverConn).readFrames /usr/local/go/src/net/http/h2_bundle.go

1MB 0.0089% 94.71% 721.45MB 6.41% net/http.(*http2serverConn).newWriterAndRequest /usr/local/go/src/net/http/h2_bundle.go

1MB 0.0089% 94.72% 755.99MB 6.71% net/http.(*http2serverConn).serve /usr/local/go/src/net/http/h2_bundle.go

0 0% 94.72% 64.32MB 0.57% [etcd]

0 0% 94.72% 190.21MB 1.69% crypto/tls.(*Conn).Handshake /usr/local/go/src/crypto/tls/conn.go

0 0% 94.72% 64.60MB 0.57% crypto/tls.(*Conn).readRecord /usr/local/go/src/crypto/tls/conn.go

0 0% 94.72% 190.21MB 1.69% crypto/tls.(*Conn).serverHandshake /usr/local/go/src/crypto/tls/handshake_server.go

如上,可以看出,大约65%的内存都用在了*recvBuffer.put

// recvBuffer is an unbounded channel of recvMsg structs.

// Note recvBuffer differs from controlBuffer only in that recvBuffer

// holds a channel of only recvMsg structs instead of objects implementing "item" interface.

// recvBuffer is written to much more often than

// controlBuffer and using strict recvMsg structs helps avoid allocation in "recvBuffer.put"

type recvBuffer struct {

c chan recvMsg

mu sync.Mutex

backlog []recvMsg

}

func newRecvBuffer() *recvBuffer {

b := &recvBuffer{

c: make(chan recvMsg, 1),

}

return b

}

func (b *recvBuffer) put(r recvMsg) {

b.mu.Lock()

if len(b.backlog) == 0 {

select {

case b.c

b.mu.Unlock()

return

default:

}

}

b.backlog = append(b.backlog, r)

b.mu.Unlock()

}

这里的 put 方法是将收到的recvMsg写入到b.backlog这个 slice 中的,func (b *recvBuffer) put(r recvMsg)跟(b *recvBuffer) load()两个函数都对b.backlog有读写操作,也就是说,这个slice 写入、读取时,都会进行加锁,严重占用 CPU 以及影响性能,增大系统负载。而且若(b *recvBuffer) load()处理速度跟不上func (b *recvBuffer) put(r recvMsg)写入速度,意味着这个slice会持续增长,造成看似内存溢出的现象。。。

那问题来了,func (b *recvBuffer) put(r recvMsg)为什么有这么大的调用量,为什么在频繁写入…为什么呢???

定位到内存溢出的疑似原因:

grpc/server.go func (s *Server) serveStreams(st transport.ServerTransport)

grpc/transport/handler_server.go 文件284行 func (ht *serverHandlerTransport) HandleStreams 函数中的334行

s.trReader = &transportReader{

reader: &recvBufferReader{ctx: s.ctx, recv: s.buf},

windowHandler: func(int) {},

}

// readerDone is closed when the Body.Read-ing goroutine exits.

readerDone := make(chan struct{})

go func() {

defer close(readerDone)

// TODO: minimize garbage, optimize recvBuffer code/ownership

const readSize = 8196

for buf := make([]byte, readSize); ; {

n, err := req.Body.Read(buf)

if n > 0 {

s.buf.put(recvMsg{data: buf[:n:n]})

buf = buf[n:]

}

if err != nil {

s.buf.put(recvMsg{err: mapRecvMsgError(err)})

return

}

if len(buf) == 0 {

buf = make([]byte, readSize)

}

}

}()

然后再走到了,func (b *recvBuffer) put(r recvMsg)函数里,也就是说,这里的消息slice堆积,是因为接收到了大量http request。那接下来就是要找到这些http request 是来自谁了?

阅读 etcd 源码:

配合 member 之间大量的 TCP TIME-WAIT 状态链接,排查方向改到 member 节点上。

要查找 etcd member 建立 TCP/HTTP 链接的地方, 在源码里,还是非常多的,一时间没找到更好的bug 定位方式,只要从etcd 源码去逐步分析排查。

etcdmain/main.go 39行 startEtcdOrProxyV2()

etcdmain/etcd.go 103行 startEtcd()

etcdmain/etcd.go 186行 embed.StartEtcd(cfg)

embed/etcd.go 107行 startPeerListeners() ,用 rafthttp.NewListener 创建了一个Listener。用于 etcd member 之间通讯。依赖github.con/coreos/etcd/pkg/transport包

embed/etcd.go 110行 startClientListeners() 用 net.Listen 创建了一个Listener。用于 client 跟 member 之间通讯。keepalive 使用github.con/coreos/etcd/pkg/transport包

157行 etcdserver.NewServer() 对本地配置读取,数据库目录读取,设置mvcc多版本控制函数,设置 lease 租约续期函数等等

166行 e.servePeers() 进行对 Linster 启动 Serve 接受请求

317行 etcdhttp.NewPeerHandler(e.Server) 设置HTTP Handler,里面函数如下:

如上,etcd 节点之间的通讯,设定了这几个Handler

“/” –> http.NotFound

“/raft”、”/raft/” –> raftHandler

“/members” –> mh

“/leases/internal” 、 “/leases” –> raftHandler

“/version” –> versionHandler

也就是说,member 之间通讯的大量通讯只会是这些 handler 中间的一个,除非源码找漏了。。。

从这个开源产品功能特性上来看,member 之间需要大量同步的,应该就是数据了,应该把线索放到raftHandler上吗?头大….

raftHandler是s.RaftHandler() 返回的, 值是 *EtcdServer.r.transport.Handler()返回值。

*EtcdServer.r是在etcdserver/server.go的414行赋值的

r: *newRaftNode(

raftNodeConfig{

isIDRemoved: func(id uint64) bool { return cl.IsIDRemoved(types.ID(id)) },

Node: n,

heartbeat: heartbeat,

raftStorage: s,

storage: NewStorage(w, ss),

},

),

etcdserver/server.go 的511行 *EtcdServer.r.transport = tr进行赋值,类型为*rafthttp.Transport ,其Handler 如下代码:

func (t *Transport) Handler() http.Handler {

pipelineHandler := newPipelineHandler(t, t.Raft, t.ClusterID)

streamHandler := newStreamHandler(t, t, t.Raft, t.ID, t.ClusterID)

snapHandler := newSnapshotHandler(t, t.Raft, t.Snapshotter, t.ClusterID)

mux := http.NewServeMux()

mux.Handle(RaftPrefix, pipelineHandler)

mux.Handle(RaftStreamPrefix+"/", streamHandler)

mux.Handle(RaftSnapshotPrefix, snapHandler)

mux.Handle(ProbingPrefix, probing.NewHandler())

return mux

}

这里又有几个 Handler ,而且都是 raft 处理相关的,代码读起来比较吃力,很难理解作者意图,

打印http.dial堆栈:

要想从这里查找 TCP 连接的发起代码,恐怕是难上加难,所以,我换了个思路,直接在 TCP 连接 Dail 的地方,判断 addr 是 etcd leader 的地址,打印调用栈。使用

runtime/debug下的debug.PrintStack()函数输出。同时打印了连接完成后的remote.addr,用于校对。不过,这改动是 golang 的系统 runtime 类库,(改了系统库,记得改回来)。要小心谨慎,在http/transport.go 882行 增加如下

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {

if t.DialContext != nil {

return t.DialContext(ctx, network, addr)

}

if addr == "leader_ip:2380" {

debug.PrintStack()

}

if t.Dial != nil {

c, err := t.Dial(network, addr)

if c == nil && err == nil {

err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")

}

return c, err

}

return zeroDialer.DialContext(ctx, network, addr)

}

重新编译 etcd server,替换到集群中的某一个 member 节点上。 开始压测这个节点,重现大量 TIME-WAIT 的现象。现象重现后,在日志里也拿到了运行堆栈

0. net.(*Dialer).Dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/dial.go:323

1. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial() /Users/cfc4n/gopat

h/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeout_d

ialer.go:29

2. github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial-fm() /Users/cfc4n/go

path/src/github.com/coreos/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/pkg/transport/timeou

t_transport.go:51

3. net/http.(*Transport).dial() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:887

4. net/http.(*Transport).dialConn() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:1060

5. net/http.(*Transport).getConn.func4() /usr/local/Cellar/go/1.9.1/libexec/src/net/http/transport.go:943

6. runtime.goexit() /usr/local/Cellar/go/1.9.1/libexec/src/runtime/asm_amd64.s:2337

在压测时,我用 netstat 抽样打印几次当前 member 上的 TCP 连接信息数据,也是用于校对日志输出的数据。

在压测的日志数据里,每建立一个 TCP 连接,都打印了这些堆栈,并且打印了远程 IP 的地址、端口。跟前面抽样打印的 TCP 数据核对一下,发现连接 leader 2380端口的几个 TCP 连接五元组跟 上面打印的remote.addr的五元组一致。

也就是说,是这个函数创建了大量的从 member 到 leader 之间的 TCP 连接。好了,如果找到大量 TCP 连接发起的函数了,又找到了该函数的调用栈了,那岂不是找到问题了?

然而,事情并没有这么简单,在上面的调用栈日志里,只能定位到coreos/etcd/pkg/transport.(*rwTimeoutDialer).Dial,并不能打印出所有调用堆栈,为什么?

// src/net/http/transport.go:886、887行 @TODO 高亮下面两行

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {

if t.DialContext != nil {

return t.DialContext(ctx, network, addr)

}

if t.Dial != nil {

c, err := t.Dial(network, addr)

if c == nil && err == nil {

err = errors.New("net/http: Transport.Dial hook returned (nil, nil)")

}

return c, err

}

return zeroDialer.DialContext(ctx, network, addr)

}

找到 etcd member 10s 后主动断开 TCP 连接的原因:

在系统类库里的net/http/transport.go的*Transport.dial方法中,如果该Transport自定义了Dial方法,则调用自定义的…这个自定义的Dial方法是在github.com/coreos/etcd/pkg/transport/transport.go里定义的,如下:

func NewTransport(info TLSInfo, dialtimeoutd time.Duration) (*http.Transport, error) {

cfg, err := info.ClientConfig()

if err != nil {

return nil, err

}

t := &http.Transport{

Proxy: http.ProxyFromEnvironment,

Dial: (&net.Dialer{

Timeout: dialtimeoutd,

// value taken from http.DefaultTransport

KeepAlive: 30 * time.Second,

}).Dial,

// value taken from http.DefaultTransport

TLSHandshakeTimeout: 10 * time.Second,

TLSClientConfig: cfg,

}

dialer := (&net.Dialer{

Timeout: dialtimeoutd,

KeepAlive: 30 * time.Second,

})

dial := func(net, addr string) (net.Conn, error) {

return dialer.Dial("unix", addr)

}

tu := &http.Transport{

Proxy: http.ProxyFromEnvironment,

Dial: dial,

TLSHandshakeTimeout: 10 * time.Second,//高亮,这里是10s,也解开了最初的10s TLS 超时的原因

TLSClientConfig: cfg,

}

ut := &unixTransport{tu}

t.RegisterProtocol("unix", ut)

t.RegisterProtocol("unixs", ut)

return t, nil

}

在代码dial := func(net, addr string) (net.Conn, error) …这里,自定义了一个 Dial 方法,赋值到http.Transport.Dial 函数属性上,并且设定TLS 的超时时间是 TLSHandshakeTimeout: 10 * time.Second,这也就是之前抓包时,看到的10s 后,member 节点主动断开连接的原因—-超过 TLS 设定的timeout时间了。

函数NewTransport的调用地方,只在NewTimeoutTransport这里找到,如下:

func NewTimeoutTransport(info TLSInfo, dialtimeoutd, rdtimeoutd, wtimeoutd time.Duration) (*http.Transport, error) {

tr, err := NewTransport(info, dialtimeoutd)

if err != nil {

return nil, err

}

if rdtimeoutd != 0 || wtimeoutd != 0 {

// the timed out connection will timeout soon after it is idle.

// it should not be put back to http transport as an idle connection for future usage.

tr.MaxIdleConnsPerHost = -1

} else {

// allow more idle connections between peers to avoid unnecessary port allocation.

tr.MaxIdleConnsPerHost = 1024

}

tr.Dial = (&rwTimeoutDialer{

Dialer: net.Dialer{

Timeout: dialtimeoutd,

KeepAlive: 30 * time.Second,

},

rdtimeoutd: rdtimeoutd,

wtimeoutd: wtimeoutd,

}).Dial

return tr, nil

}

也就是说,找到NewTimeoutTransport调用着,就能确定发起大量 TCP 的代码…但之前的堆栈打印,并没有打印到这些函数调用,只好再想其他办法。

在NewTimeoutTransport的代码里,有一句对 HTTP Keepalive的最大持有空闲长连接数字的设置tr.MaxIdleConnsPerHost = 1024,这个参数决定 golang 的 http client 包对 server 发起 http 请求时,能持有的最大空闲 TCP 连接数量,在 etcd 的代码里,被设置为1024了,也就是说,etcd member 跟 etcd leader 之间空闲时,持有TCP 连接数可以是1024个,便于下次复用。(插一句,有人说,http 是最好的 RPC,这里我是不认同的,光从TCP 连接复用上,就有非常大的差异,在http 1.1 TCP 链接 「另类」复用,http 1.1 的keepalive属性也是仅当前一个请求响应后,这个链接才能继续复用,请求响应串行化。而在 http2里,是不需要等待上一个请求返回的,才有真正意义上的多路复用见:Streams and Multiplexing , 如果你做过 rpc 的功能实现,你就知道 rpc 在 tcp 上是如何实现请求响应的一一对应的,你就能很快明白http2\http1.1的 TCP 链接复用的区别了。)

排查 etcd member 跟 etcd leader 之间的通讯:

回到正题,对于代码这里的定位,我尝试去阅读代码来协助定位,但 etcd 的源码太复杂了,我读到 raft 协议节点之间数据同步这块,卡住了。开始从其他路子尝试解决,之前的 cpu 占用较高问题,我们在 member 节点之间去掉了 tls 证书验证,那么我们在 member 上做 tcp 抓包,抓2380端口数据包,看看数据发送的都是什么内容。

数据包中,可以看出 member 向 leader 上发送了大量的 /leases的 POST 请求。根据这个 uri,找到发起 HTTP 请求的代码,在etcdserver/v3_server.go的252行

func (s *EtcdServer) LeaseRenew(ctx context.Context, id lease.LeaseID) (int64, error) {

ttl, err := s.lessor.Renew(id)

if err == nil { // already requested to primary lessor(leader)

return ttl, nil

}

if err != lease.ErrNotPrimary {

return -1, err

}

cctx, cancel := context.WithTimeout(ctx, s.Cfg.ReqTimeout())

defer cancel()

// renewals don't go through raft; forward to leader manually

for cctx.Err() == nil && err != nil {

leader, lerr := s.waitLeader(cctx)

if lerr != nil {

return -1, lerr

}

for _, url := range leader.PeerURLs {

lurl := url + leasehttp.LeasePrefix

ttl, err = leasehttp.RenewHTTP(cctx, id, lurl, s.peerRt)

if err == nil || err == lease.ErrLeaseNotFound {

return ttl, err

}

}

}

return -1, ErrTimeout

}

阅读这段代码,可以看到函数调用leasehttp.RenewHTTP将请求发送到leader.PeerURLs,即发送到 leader 节点上。继续跟进leasehttp.RenewHTTP,

func RenewHTTP(ctx context.Context, id lease.LeaseID, url string, rt http.RoundTripper) (int64, error) {

// will post lreq protobuf to leader

lreq, err := (&pb.LeaseKeepAliveRequest{ID: int64(id)}).Marshal()

if err != nil {

return -1, err

}

cc := &http.Client{Transport: rt}

req, err := http.NewRequest("POST", url, bytes.NewReader(lreq))

if err != nil {

return -1, err

}

req.Header.Set("Content-Type", "application/protobuf")

req.Cancel = ctx.Done()

resp, err := cc.Do(req)

if err != nil {

return -1, err

}

可以看出,该请求使用 golang 标准包的http.Client.Do来将请求发送出去。http.Client.Transport使用参数传入的rt http.RoundTripper。

定位到10s TLS timeout 的相关函数调用:

回到上一层代码

s.peerRt参数来自*EtcdServer.peerRt,在etcdserver/server.go的429行被赋值peerRt: prt

prt变量在该文件的288行被赋值prt, err := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout())

rafthttp.NewRoundTripper函数调用rafthttp.NewTimeoutTransport

没错,rafthttp.NewTimeoutTransport调用了rafthttp.NewTransport函数,也就是之前堆栈打印时,发现大量 TCP 链接创建使用的这个函数,也就是10s TLS timeout 的地方。

这里可以关联上了,确认大量发起 TCP 连接的代码调用关系,证明方向正确,剩下的就是顺藤摸瓜,找到发起大量请求的原因即可。

回到LeaseRenew函数调用的地方,调用func (s *EtcdServer) LeaseRenew函数代码见github.com/coreos/etcd/etcdserver/api/v3rpc/lease.go的88行附近

func (ls *LeaseServer) leaseKeepAlive(stream pb.Lease_LeaseKeepAliveServer) error {

for {

req, err := stream.Recv()

if err == io.EOF {

return nil

}

if err != nil {

if isClientCtxErr(stream.Context().Err(), err) {

plog.Debugf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())

} else {

plog.Warningf("failed to receive lease keepalive request from gRPC stream (%q)", err.Error())

}

return err

}

// Create header before we sent out the renew request.

// This can make sure that the revision is strictly smaller or equal to

// when the keepalive happened at the local server (when the local server is the leader)

// or remote leader.

// Without this, a lease might be revoked at rev 3 but client can see the keepalive succeeded

// at rev 4.

resp := &pb.LeaseKeepAliveResponse{ID: req.ID, Header: &pb.ResponseHeader{}}

ls.hdr.fill(resp.Header)

ttl, err := ls.le.LeaseRenew(stream.Context(), lease.LeaseID(req.ID))

if err == lease.ErrLeaseNotFound {

err = nil

ttl = 0

}

if err != nil {

return togRPCError(err)

}

resp.TTL = ttl

err = stream.Send(resp)

if err != nil {

if isClientCtxErr(stream.Context().Err(), err) {

plog.Debugf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())

} else {

plog.Warningf("failed to send lease keepalive response to gRPC stream (%q)", err.Error())

}

return err

}

}

}

71行 func (ls *LeaseServer) LeaseKeepAlive函数调用func (ls *LeaseServer) leaseKeepAlive()

etcdserver/etcdserverpb/rpc.pb.go的3417行func _Lease_LeaseKeepAlive_Handler(srv interface{}, stream grpc.ServerStream)调用了func (ls *LeaseServer) LeaseKeepAlive函数

etcdserver/etcdserverpb/rpc.pb.go的3480行将函数_Lease_LeaseKeepAlive_Handler设置为 StreamName: “LeaseKeepAlive”的 Handler,在通过*grpc.Server的RegisterService方法,在启动时,注册到监听器中

leaseKeepAlive函数是一个事件循环,从stream.Recv()拿到数据,则进行发送到 leader 节点上

stream.Recv()函数是google.golang.org/grpc/stream.go中的函数,这个不是 etcd 官方自己的代码,而是另外一个开源项目,意味着我还需要再把这块代码再阅读阅读,理解理解。

通过漫长的阅读,可以看到这里收到的消息都是来自 client 发送来的消息,其实可以从变量命名上看到端倪,比如 client 跟 member 之间通讯处理的变量都包含stream字样,而member到 leader 或 member 节点之间都带有peer字样,其实,client 跟 etcd 节点之间通讯用的是 http2传输协议,消息编码用的是grpc+protobuf;而etcd member节点之间是http1.1传输,消息编码也是 protobuf。

etcd lease 租约续期的时机:

就是说,这里的大量lease 租约续期的请求,来自 client 的请求,在我的案例中,我把主机 lease 的 TTL 设置为10S,这块会有异常吗?client 的 lease 续期是谁来发送?client 自己吗?多久发送一次?过期前的1秒?

在 etcd 的 client 类库中,对于租约续期的代码在clientv3/lease.go的474行

// send update to all channels

nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)

ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)

for _, ch := range ka.chs {

select {

case ch

ka.nextKeepAlive = nextKeepAlive

default:

}

}

nextKeepAlive变量是 lease TTL 的三分之一,用于下次续期时的触发时刻,在后面的sendKeepAliveLoop函数续期时,回每隔500ms 进行一次循环,检测是否续期。也就是说,我们场景下,10s 的 TTL 会每隔3S 一次续期。3W 客户端的话,每秒1W 的写请求,而之前我们压测结果是集群节点可以有4W 以上的 (官方压测数据)qps,不至于这点都撑不住。这是个疑点。不过,前面的分析判断,大量的续约续期导致大量 TCP 链接的建立,大量 TCP 链接的建立,导致 leader 节点上大量对象分配,大量内存申请,golang gc 时,消耗大量 CPU,那这个问题也应该要优化一下,为此,我们把 TTL 设定为600S,即10分钟,也就是说,会在200秒后进行一次续期。

扩大lease TTL,尝试优化集群负载:

修复后的版本,我们在线下发布了6000多台,观察效果,果然,在发布后,内存有明显的下降,CPU 指标也有,但是在1个小时后,指标又上升了。。。

问题依旧:

这是为什么呢?中间大约1个小时的负载指标下降是因为发布过程带来的假象吗?上次变更的 lease 有效果吗?可我更改时,是测试过的,测试报告都有呢—-agent 的 etcd lease租约时长 调整 ,TTL 续期是200S 一次了,但这是为什么?为了确认这个问题,我在其中一台 member 上抓包确认

上面确实是大量的 lease 包,那么每个 lease id 之间的发送间隔是多少?是我们预期的200秒吗?或者这些 lease 包是同一个 lease id 吗?一共有多少个 lease id 呢?

分析 tcpdump 的pcapng文件:

之前从 etcd 的源码分析可以看出,etcd node 节点之间的通讯消息编码也是 protobuf 的,我们需要把 http response body 的字节流复制出来,用 protobuf 的对应结果体解码即可。可是从 2018-07-05 16:38:31.015981 到 2018-07-05 16:43:20.420401 之间,间隔5分钟之间,一共103万个数据包,手动复制,实在没法统计。借用pcap包,对整个 tcpdump抓包的pcapng文件,分析, 过滤响应类型是application/protobuf的,取 http response body内容,并对其做 pb.LeaseKeepAliveRequest protobuf的结构体解析,计算 lease 总数与唯一数

运行结果: lease id 总数数: 77550, lease id 唯一数:1382。

这些结果说明,相同的 lease id 在频繁发送续约请求。再次写程序,把这些 lease id 取出来,到 etcd 集群上查看 TTL 剩余时间是否是预期的400-600秒之间,但结果却总是599秒,以其中一个 lease id 为例,每隔1秒去抓取 TTL 信息,但每次结果 TTL 都是599秒, grantTTL 是600,意味着 etcd member 持续不断的为这些 lease id 不停的续期。 我们之前排查结果是, member 的续期是来自 client 的 lease keepalive 请求,那么方向得放到 client 上。

抽查其中一台 client,抓包分析,也确实是在频繁发送/etcdserverpb.Lease/LeaseKeepAlive

请求,这就很奇怪,跟之前我的测试结果完全不一样,明明测试时,是每个200S 一起续期请求,为什么这里变得这么频繁,每秒都发起。

回到 etcd clientv3的代码上:

现象很奇怪,排查这个问题,只能再次从源码层面入手,方向也放到 /etcdserverpb.Lease/LeaseKeepAlive的地方,函数sendKeepAliveLoop这里开始循环处理判断是否需要续期租约

// sendKeepAliveLoop sends keep alive requests for the lifetime of the given stream.

func (l *lessor) sendKeepAliveLoop(stream pb.Lease_LeaseKeepAliveClient) {

for {

var tosend []LeaseID

now := time.Now()

l.mu.Lock()

for id, ka := range l.keepAlives {

if ka.nextKeepAlive.Before(now) {

tosend = append(tosend, id)

}

}

l.mu.Unlock()

for _, id := range tosend {

r := &pb.LeaseKeepAliveRequest{ID: int64(id)}

if err := stream.Send(r); err != nil {

// TODO do something with this error?

return

}

}

select {

case

case

return

case

return

case

return

}

}

}

如上代码,每隔 500毫秒一次循环,遍历l.keepAlives map,判断每个LeaseID的nextKeepAlive是否到了,client 决定是否发起LeaseKeepAliveRequest。这个nextKeepAlive的赋值是在clientv3/lease.go的recvKeepAlive函数内,第473行

// send update to all channels

nextKeepAlive := time.Now().Add((time.Duration(karesp.TTL) * time.Second) / 3.0)

ka.deadline = time.Now().Add(time.Duration(karesp.TTL) * time.Second)

for _, ch := range ka.chs {

select {

case ch

ka.nextKeepAlive = nextKeepAlive

default:

}

}

是的,这块代码之前看过,之前关注的只是 TTL 时间的设定,但仔细看看这段代码ka.nextKeepAlive = nextKeepAlive,这里做下一次lease 续期的赋值,判断方法是select case,判断条件是ch 否则不赋值。。。否则,不赋值。。。否则,不赋值。。。

那么,这么 channel 长度多少呢?会不会写入的速度大于读取导致没写进去,导致下一次的续期时间没赋值成功呢?继续追查,在clientv3/lease.go的222行KeepAlive函数里

func (l *lessor) KeepAlive(ctx context.Context, id LeaseID) (

ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)

l.mu.Lock()

// ensure that recvKeepAliveLoop is still running

select {

case

err := l.loopErr

l.mu.Unlock()

close(ch)

return ch, ErrKeepAliveHalted{Reason: err}

default:

}

ka, ok := l.keepAlives[id]

if !ok {

// create fresh keep alive

ka = &keepAlive{

chs: []chan

ch := make(chan *LeaseKeepAliveResponse, leaseResponseChSize)进行chan 的申请,长度leaseResponseChSize是常量,值是16。chs是个slice,长度1,只有这一个。那么,接下来,去确认这个 ch 读取的代码吧,可是我读完整个 lease.go也没找到哪里读取这个 channel 的代码,除了几处 close 的回收操作。那问题来了,会不会是这里把 ch channel 写满了,就无法再写入进去,导致ka.nextKeepAlive = nextKeepAlive流程走不到,导致ka.nextKeepAlive一直是上次的过去的续期时间,导致频繁进行续期呢?

找到疑点,每秒2次 lease keepalive 的请求:

开始测试,在 ch channel 写入的 default 分支里,打印“无法写入”等字样,并在for 后面打印这个ka的 TTL 等信息,用于核对。log.Println(fmt.Sprintf(“karesp.TTL:%d, nextKeepAlive:%s,ka.nextKeepAlive:%s, ka.deadline:%s”,karesp.TTL, nextKeepAlive.String(),ka.nextKeepAlive.String(), ka.deadline.String()))

改动后,重新编译代码,上传至某台 client 上 ,关掉 DEBUG 模式,观察日志输出。 文件见 /xxxx/xxx/core/nohup.out.1 ,在 2018/07/05 21:10:42 启动程序,经过1个多小时的观察,果然在2018/07/05 22:04:05时,lease 续约续期的请求从200秒变成了每秒2次。同时到 etcd 集群上查询响应 lease id,发现 TTL 均为599,grantTTL 为600,则可说明 是客户端在频繁重新发送租约请求。 第一个版本没打印ka.nextKeepAlive ,调整后,重新打印,为了让等待时间变短些,我把600s 改成60,便于观察。得到如下日志

2018/07/05 22:40:02 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:40:02 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508,ka.nextKeepAlive:2018-07-05 22:40:22.811626819 +0800 CST m=+40.135091508, ka.deadline:2018-07-05 22:41:02.811627601 +0800 CST m=+80.135092217

2018/07/05 22:40:22 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:40:22 karesp.TTL:60, nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715,ka.nextKeepAlive:2018-07-05 22:40:42.922698966 +0800 CST m=+60.246163715, ka.deadline:2018-07-05 22:41:22.922699917 +0800 CST m=+100.246164470

2018/07/05 22:40:43 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:40:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932,ka.nextKeepAlive:2018-07-05 22:41:03.330494852 +0800 CST m=+80.653959932, ka.deadline:2018-07-05 22:41:43.330496602 +0800 CST m=+120.653961489

2018/07/05 22:41:03 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:41:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164,ka.nextKeepAlive:2018-07-05 22:41:23.341112484 +0800 CST m=+100.664577164, ka.deadline:2018-07-05 22:42:03.341113662 +0800 CST m=+140.664578212

2018/07/05 22:41:23 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:41:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313,ka.nextKeepAlive:2018-07-05 22:41:43.350264635 +0800 CST m=+120.673729313, ka.deadline:2018-07-05 22:42:23.350265498 +0800 CST m=+160.673730050

2018/07/05 22:41:43 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:41:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032,ka.nextKeepAlive:2018-07-05 22:42:03.572888276 +0800 CST m=+140.896353032, ka.deadline:2018-07-05 22:42:43.572889271 +0800 CST m=+180.896353840

2018/07/05 22:42:03 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:42:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488,ka.nextKeepAlive:2018-07-05 22:42:23.86879775 +0800 CST m=+161.192262488, ka.deadline:2018-07-05 22:43:03.868798687 +0800 CST m=+201.192263243

2018/07/05 22:42:23 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:42:23 karesp.TTL:60, nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020,ka.nextKeepAlive:2018-07-05 22:42:43.877613966 +0800 CST m=+181.201079020, ka.deadline:2018-07-05 22:43:23.877615637 +0800 CST m=+221.201080417

2018/07/05 22:42:43 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:42:43 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122,ka.nextKeepAlive:2018-07-05 22:43:03.884576518 +0800 CST m=+201.208041122, ka.deadline:2018-07-05 22:43:43.884577361 +0800 CST m=+241.208041904

2018/07/05 22:43:03 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:43:03 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996,ka.nextKeepAlive:2018-07-05 22:43:23.893274193 +0800 CST m=+221.216738996, ka.deadline:2018-07-05 22:44:03.893275148 +0800 CST m=+261.216739702

2018/07/05 22:43:23 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:43:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998,ka.nextKeepAlive:2018-07-05 22:43:44.058766258 +0800 CST m=+241.382230998, ka.deadline:2018-07-05 22:44:24.058767168 +0800 CST m=+281.382231711

2018/07/05 22:43:44 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:43:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737,ka.nextKeepAlive:2018-07-05 22:44:04.413012991 +0800 CST m=+261.736477737, ka.deadline:2018-07-05 22:44:44.41301386 +0800 CST m=+301.736478391

2018/07/05 22:44:04 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:44:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106,ka.nextKeepAlive:2018-07-05 22:44:24.420796492 +0800 CST m=+281.744261106, ka.deadline:2018-07-05 22:45:04.420797607 +0800 CST m=+321.744262161

2018/07/05 22:44:24 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:44:24 karesp.TTL:60, nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589,ka.nextKeepAlive:2018-07-05 22:44:44.428145652 +0800 CST m=+301.751610589, ka.deadline:2018-07-05 22:45:24.428147239 +0800 CST m=+341.751612018

2018/07/05 22:44:44 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:44:44 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:45:44.436152168 +0800 CST m=+361.759616723

2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:45:04 无法写入

2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.441563525 +0800 CST m=+341.765028110,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.44156426 +0800 CST m=+381.765028789

2018/07/05 22:45:04 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:45:04 无法写入

2018/07/05 22:45:04 karesp.TTL:60, nextKeepAlive:2018-07-05 22:45:24.941897369 +0800 CST m=+342.265361977,ka.nextKeepAlive:2018-07-05 22:45:04.436151288 +0800 CST m=+321.759615961, ka.deadline:2018-07-05 22:46:04.941898136 +0800 CST m=+382.265362694

2018/07/05 22:45:05 lease Id KeepAliveRequest:5904610818195269641

2018/07/05 22:45:05 无法写入

再16次之后,即第17次,开始恢复每秒2次的租约请求。。。所以,也验证了此 bug。

也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!

也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!

也就是说 不管grant的租约TTL设置多久,最终都会变成0.5秒发送lease 续期一次!!!!!

之前的134版本灰度灰度后,中间CPU 降低了大约1个小时的样子,这是为什么呢?

解释 cpu 降低1小时的疑点:

我们这么计算, TTL 600,续约续约是200, ka.chs的 数组长度是1,第一个chan

另外再从我加日志的第一次测试时间来算: 2018/07/05 22:04:05 – 2018/07/05 21:10:42 约为53分钟。也就是说,将近1小时,跟 falcon 的统计吻合的。。。

找到真凶:

定问题了,就是修复方式了,我并没看到 ch channel 的使用的地方,或许我没理解作者的用意,我觉得,不管 ch channel 是否写入,都必须将下一次nextKeepAlive的时间赋值为最新时间。将ka.nextKeepAlive = nextKeepAlive移到for select外面即可。

不过,我们产品急于推广,保险起见,牺牲了 lease 的需求,去掉了这个功能。发布到服务器后,负载情况有非常大的改善,CPU 从40%-50%降低到2%-5%左右。

回顾总结:

回顾etcd 集群遇到的所有现象,配合这个问题,所有问题都好解释了,那么,我按照事情发生发展的顺序总结一下原因:

etcd clientv3的lease keepalive 类库有 bug,不管 ttl 设置多久,都会在续期16次后,变为0.5s 一次

agent 部署到大量服务器上后(2W 台),每个 agent 持有一个 lease id

agent(相对于 etcd 来说,是client)连接到 etcd member 上,将lease 续期请求每秒2次发送到 member 上

etcd member 跟 leader 在续期 lease 上是使用 http 1.1,TCP 连接在上一个请求没响应之前,没法复用,几乎都是新创建的 TCP 连接

etcd leader 节点每秒收到 2W * 2 个 TCP 连接请求

早期 etcd node 之间是 https 通讯,TCP 建立后,会进行 TLS 握手,TLS 校验是 RSA 的数学计算,会消耗大量 CPU,导致 leader 节点 CPU 飙升

etcd leader 使用 grpc 类库监听的 http server,每一个 client 连接过来时,都会创建对象、结构体、channel用来通讯,占用更多内存

etcd leader 收到 etcd member转发来的 lease keepalive请求, 放到一个slice里(实际上这块是grpc做的)

etcd leader 对存放recvMsg的b.backlog 写入速度大于消费速度,导致 slice 越来越大,占用更多内存,导致内存不停上升,导致内存溢出假象

etcd leader 处理请求的速度小于写入速度,导致请求堆积,超过 lease id 的 TTL,导致 lease id 判断过期,删除 TTL,也解释了 hids server 的 key 莫名其妙消失的现象

etcd member 大量新建 TCP 请求到 etcd leader,leader 处理后就关闭连接,导致 leader 侧会出现大量 TIME-WAIT 状态的链接。

etcd leader 处理 member lease 的请求太慢,超过 TLS 的10s 后,member 触发 TLS 超时,主动断开连接,但 leader 仍处理不过来,TCP 连接变为 CLOSE-WAIT 状态

困扰我1个多月的 bug,不会忘记这个时刻

关于 etcd 过载保护的想法:

etcd lease 的 keepalive 放到2/3的时间点去做是不是更合适?

grpc 的过载保护是不是也要做下,比如接受 client 请求这里从数组改用定长队列,超过丢弃,来保证集群可用。

etcd 节点之间通讯改用 http2是不是更合适

声明

对你的etcd 产品是否影响:

使用etcd clientv3(golang 的客户端类库) ,所有v3版本的客户端(包括3.0.x,3.1.x,3.2.x,3.3.x,3.4.x)。

使用 lease 租约,并设置 keepalive

client 量比较大能看出来效果,几百台的话,对性能影响不大。

官方已修复

coreos etcd在2018年7月24日已经修复该 bug,涉及etcd client v3的3个版本效率非常高。详情如下

Fix lease keepalive interval updates when response queue is full.

If sending keepalive request every 500ms instead of expected rate of every “TTL / 3” duration.

Coreos Etcd分布式集QQ交流群

QQ 群:Coreos Etcd分布式集 ,群号:177508087。一键加群: ,欢迎大家加群一起学习 etcd 的知识。

No related posts.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值