报错:
FATAL: the database system is in recovery mode
解决思路:
在hawq master节点
1、执行hawq state ,提示 database is down
2、查看hawq master进程: ps aux | grep postgresql ,发现master进程不在
3、查看pg_log 下 当天的log
2018-02-11 16:34:32.089297 CST,,,p636599,th589375776,,,,0,,,seg-10000,,,,,"LOG","00000","seqserver process (PID 499050) exited with exit code 2",,,,,,,0,,"postmaster.c",4726,
2018-02-11 16:34:32.089388 CST,,,p636599,th589375776,,,,0,,,seg-10000,,,,,"LOG","00000","walsendserver process (PID 499051) exited with exit code
发现master进程被人为kill掉了。
4、手动启动master
source /usr/local/hawq/greenplum.sh
su gpadmin
hawq start master,因为有master pid存在,系统认为master进程存在,于是手动强制停止master:
hawq stop master -M immediate
hawq start master,成功启动master
发现segments并未注册到master
5、重启整个集群:
hawq restart cluster
再次执行: hawq state 一切正常。
整个问题产生的原因:
hawq master跟namenode同一个节点,运维的一个同事启动Namenode失败,没有确定真正原因的情况下,强行Kill掉了hawq master进程。