近日在一套客户的测试环境遇到一个棘手的问题。问题现象是这样的,客户CDH测试环境由于之前磁盘容量较小导致磁盘占用满的问题,在做了磁盘逻辑卷扩容后发现CDH无法正常恢复的问题,重启测试环境虚拟机也不起作用。
1、一开始,发现CDH Manager Service服务中的EventServer启动异常,报错如以下截图
解决方案:
删除/var/lib/cloudera-scm-eventserver/*并重启,解决!
2、问题1解决后我们发现Zookeeper服务一直显示无法停止的状态,如以下截图
解决步骤:
- 重启CDH Manager服务及CDH Agent服务,不生效。
- 重启虚拟机,不生效。
- 删除/var/lib/zookeper/*并重启Zookeeper,不生效,在CDH界面仍然显示Zookeeper正在停止状态,无法重启Zookeeper服务。
- 查看CDH Agent日志,报错如下,
日志报错显示,找不到/run/cloudera-scm-agent/process/192-zookeeper-server/proc.json,此时Zookeeper进程确认已经不存在,不明白CM界面从哪里检查Zookeeper仍然有残留信息。 - 进入元数据库(此处为PG)检查Zookeeper残留信息
使用psql --user=scm --port=7432 --host=localhost登录到postgresql,检查Zookeeper相关信息并清理,主要命令如下,
select process_id from processes where name='zookeeper-server';
delete from process_active_releases where process_id in (select process_id from processes where name='zookeeper-server');
delete from processes_detail where process_id in (select process_id from processes where name='zookeeper-server');
delete from processes where name='zookeeper-server';
select service_id from services where name='zookeeper';
delete from commands_detail where command_id in (select command_id from commands where service_id in (select service_id from services where name='zookeeper'));
delete from commands where service_id in (select service_id from services where name='zookeeper');
delete from configs where service_id in (select service_id from services where name='zookeeper');
delete from role_staleness_status where role_id in (select role_id from roles where service_id in (select service_id from services where name='zookeeper'));
delete from roles where service_id in (select service_id from services where name='zookeeper');
delete from role_config_groups where service_id in (select service_id from services where name='zookeeper');
delete from services where name='zookeeper';
以上命令的最终目标就是删除Zookeeper的相关残留信息,因为此时Zookeeper已经不存在了,所以可以从元数据库表里面清理相关的信息(虽然很暴力)。
- 现在CM界面已经不再显示Zookeeper组件了,重新添加Zookeeper组件成功,问题修复。
虽然这次问题解决了,但里面的细节还不是特别清晰,怀疑是磁盘扩容后导致了postgresql元数据信息不一致的情况,需要手动清理才能完成!