服务器异常断开导致kudu无法启动问题(Data length checksum does not match: Incorrect checksum in file ... : Checksum )

某日一台服务器异常断开,无法登陆,后续正常后,agent服务自动重启,服务器上的各种角色也在启动恢复,但是kudu无法恢复。手动重启后失败,查看报错原因当时kudu正在进行数据写入操作,由于服务器异常断开,导致kudu数据文件异常,报错如下:

++ date
+ timestamp='Wed Oct 13 10:57:02 CST 2021'
+ echo 'Wed Oct 13 10:57:02 CST 2021: Found master(s) on hadoop11,hadoop12,hadoop13'
+ echo 'Wed Oct 13 10:57:02 CST 2021: Found master(s) on hadoop11,hadoop12,hadoop13'
Wed Oct 13 10:57:02 CST 2021: Found master(s) on hadoop11,hadoop12,hadoop13
+ '[' false == true ']'
+ KUDU_ARGS=
+ '[' false == true ']'
+ '[' tserver = master ']'
+ '[' tserver = tserver ']'
+ KUDU_ARGS=' --tserver_master_addrs=hadoop11,hadoop12,hadoop13'
+ exec /opt/cloudera/parcels/CDH-5.15.0-1.cdh5.15.0.p0.21/lib/kudu/sbin/kudu-tserver --tserver_master_addrs=hadoop11,hadoop12,hadoop13 --flagfile=/run/cloudera-scm-agent/process/18986-kudu-KUDU_TSERVER/gflagfile
F1013 10:57:03.360226 62788 tablet_server_main.cc:80] Check failed: _s.ok() Bad status: Corruption: Failed to load FS layout: Could not process records in container /mnt/sdf/kudu/tserver/data/e68f1a45d9f144d9a5242a2067cb8d37: Data length checksum does not match: Incorrect checksum in file /mnt/sdf/kudu/tserver/data/e68f1a45d9f144d9a5242a2067cb8d37.metadata at offset 2697092: Checksum does not match. Expected: 0. Actual: 1214729159
*** Check failure stack trace: ***
Wrote minidump to /var/log/kudu/minidumps/kudu-tserver/498615d9-40ee-493b-14fc78f5-777a0a19.dmp
*** Aborted at 1634093823 (unix time) try "date -d @1634093823" if you are using GNU date ***
PC: @     0x7f7760f4e1d7 __GI_raise
*** SIGABRT (@0x3cf0000f544) received by PID 62788 (TID 0x7f776350b9c0) from PID 62788; stack trace: ***
    @     0x7f7762ed5370 (unknown)
    @     0x7f7760f4e1d7 __GI_raise
    @     0x7f7760f4f8c8 __GI_abort
    @          0x1b49fe9 (unknown)
    @           0x8e27ad google::LogMessage::Fail()
    @           0x8e4703 google::LogMessage::SendToLog()
    @           0x8e2309 google::LogMessage::Flush()
    @           0x8e508f google::LogMessageFatal::~LogMessageFatal()
    @           0x883d86 (unknown)
    @     0x7f7760f3ab35 __libc_start_main
    @           0x883755 (unknown)

度娘说找出报错数据文件操作时间,然后把那个时间节点的文件最后一行都删除掉,重启后依次这样处理有问题的文件;实际操作发现这种操作太多了,花费时间太久。

sudo ls -l --full-time /mnt/sdh/kudu/tserver/data/29676e68406a4421b04d368798607062.metadata | awk {'print $7'}| cut -c 1-8
	 
for i in `sudo ls -l /mnt/sdh/kudu/tserver/data/ --full-time |grep "2021-10-13 10:37:19" | grep ".metadata" | awk {'print $9'}`; do  sudo sed -i '$d' /mnt/sdh/kudu/tserver/data/$i; done 
	 

最终解决:由于只涉及一台tablet Server的数据文件,文件都是备份三份的,所以直接停止该服务器的kudu服务,并删除这个服务器的kudu角色,废弃之前的kudu数据文件,重新添加tablet server。

根据该服务器的kudu配置,将文件路径和wal路径记录下来后续要进行重命名备份

Kudu Tablet Server WAL Directory

Kudu Tablet Server Data Directories

CM界面然后停止异常服务器kudu的Tablet Server ,删除该异常服务器的kudu角色。

登录异常服务器的后台,将配置的文件路径和wal路径进行重命名进行备份(不重命名会导致后续重新添加这台服务器的tablet server时报错)。

通过后台ksck发现仍然在显示连接。

sudo -u kudu kudu cluster ksck hadoopap11  | head -n 10

Connected to the Master
WARNING: Unable to connect to Tablet Server a8c0534fc01d4c3bae02faec3d3fddd4 (hadoop 32:7050): Network error: could not send Ping RPC to server: Client connection negotiation failed: client connection to 172.18.8.52:7050: connect: Connection refused (error 111)
WARNING: Fetched info from 18 Tablet Servers, 1 weren't reachable

最后发现需要重启kudu的master,否则连接信息会一直存在(我当时直接滚动重启所有kudu)。

再次ksck命令发正常了,再重新添加这台服务器kudu的Tablet Server服务。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值