问题描述
在 debian11 虚拟机中,使用脚本调用 etcdctl 持续更新特定 key 的值测试数据交互程序的稳定性,刷了3 个小时后,程序访问 etcd 时报了如下错误信息:
2022/10/18 14:19:13 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2022/10/18 14:19:14 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: connect: connection refused"; Reconnecting to "127.0.0.1:2379"
2022/10/18 14:19:16 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: connect: connection refused"; Reconnecting to "127.0.0.1:2379"
2022/10/18 14:19:16 grpc: Conn.transportMonitor exits due to: grpc: timed out trying to connect
2022/10/18 14:19:17 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: connect: connection refused"; Reconnecting to "127.0.0.1:2379"
2022/10/18 14:19:19 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: connect: connection refused"; Reconnecting to "127.0.0.1:2379"
2022/10/18 14:19:20 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: connect: connection refused"; Reconnecting to "127.0.0.1:2379"
2022/10/18 14:19:21 grpc: Conn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:2379: connect: connection refused"; Reconnecting to "127.0.0.1:2379"
报错信息表明是连接失败,具体的原因需要进一步定位。
定位过程
查看 etcd 的运行状态
执行 systemctl status etcd 命令查看到 etcd 服务日志信息如下:
● etcd.service - etcd - highly-available key value store
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2022-10-18 14:19:13 EDT; 7h ago
Docs: https://etcd.io/docs
man:etcd
Process: 612 ExecStart=/usr/bin/etcd $DAEMON_ARGS (code=exited, status=1/FAILURE)
Main PID: 612 (code=exited, status=1/FAILURE)
CPU: 50min 14.795s
Oct 18 14:18:51 debian etcd[612]: segmented wal file /var/lib/etcd/default/member/wal/0000000000000009-00000000000ef088.wal is crea>
Oct 18 14:19:01 debian etcd[612]: purged file /var/lib/etcd/default/member/wal/0000000000000004-000000000006a724.wal successfully
Oct 18 14:19:06 debian etcd[612]: read-only range request "key:\"/dynamic_response/>
Oct 18 14:19:06 debian etcd[612]: WARNING: 2022/10/18 14:19:06 grpc: Server.processUnaryRPC failed to write status: connection erro>
Oct 18 14:19:11 debian etcd[612]: read-only range request "key:\"/register>
Oct 18 14:19:11 debian etcd[612]: WARNING: 2022/10/18 14:19:11 grpc: Server.processUnaryRPC failed to write status: connection erro>
Oct 18 14:19:13 debian etcd[612]: cannot commit tx (write /var/lib/etcd/default/member/snap/db: no space left on device)
Oct 18 14:19:13 debian systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Oct 18 14:19:13 debian systemd[1]: etcd.service: Failed with result 'exit-code'.
Oct 18 14:19:13 debian systemd[1]: etcd.service: Consumed 50min 14.795s CPU time.
Oct 18 14:19:13 debian etcd[612]: cannot commit tx (write /var/lib/etcd/default/member/snap/db: no space left on device) 这句 log 信息表明磁盘空间不足,etcd 数据库无法写入数据库。查看磁盘占用情况,发现 /var/ 所在的分区已经占满。
增加可用磁盘空间的解决方案
- 扩展磁盘空间,重启 etcd 服务
- 删除本地 etcd 数据库,重启 etcd 服务
扩展磁盘的过程有些复杂,我选择删除本地 etcd 数据库来恢复业务,示例命令如下:
root@debian:/lib/systemd# rm -rf /var/lib/etcd/default/member/
root@debian:/lib/systemd# systemctl start etcd
为什么会出现这种问题?
表面原因是本地磁盘空间较小,etcd 数据库持续扩展在一定时间后将本地磁盘空间占满,无法存储数据后异常退出,故而不能再提供服务,连接 etcd 数据库报超时。
根本原因是用于消息通信的 key 没有删除,在持续更新,etcd 记录了所有版本变化到数据库,持续频繁通信导致 etcd 数据库大小不断增大,最终导致磁盘空间被占满,etcd 异常退出。
根本解决方案
用于通信的 etcd key 在通信完成后即从 etcd 数据库中删除,避免 etcd 数据库大小持续扩展。