Ambari纳管集群,随着时间的推移,会积累大量告警记录,这些记录会存放在数据库中,如不定期清理,数据库记录数不断激增,严重影响Ambari性能。
其中,影响较大的两张表为alert_current,alert_history,清理不完全会导致Ambari页面告警无法正常显示,无法刷新,页面是这样的:
观察ambari-server.log日志发现报错“java.lang.NullPointerException”
2023-07-11 18:40:33.601 WARN [ambari-client-thread-142467 HttpChanne1:776-/api/v1/clusters/mycluster/alerts java.lang.NullPointerException
at org.apache.ambariservercontrollerinternal.AlertResource Provider.toResource(AlertResourceProviderjava:260)
at org.apache.ambari.server.controllerinternalAlertResourceF Provider.etResources(AlertResourceProvideriava:240)
找到代码对应位置:
private Resource toResource(boolean isCollection, String clusterName, AlertCurrentEntity entity, Set<String> requestedIds) {
AlertHistoryEntity history = entity.getAlertHistory();
AlertDefinitionEntity definition = history.getAlertDefinition();
...
}
根据代码走查结果(更快捷的方法是打开调试模式直接观察),这部分关系如下图,简而言之就是:
AlertCurrentEntity拿着alert_current (history_id)去匹配alert_history (alert_id),得到一个AlertHistoryEntity实体后,拿着alert_history (alert_definition_id)再去匹配alert_definition (definition_id),得到AlertDefinitionEntity实体。
NullPointerException异常在于,AlertHistoryEntity history这里出问题了,于是,检查数据库中是否有脏数据。最后发现,在alert_current (history_id)匹配alert_history (alert_id)时出现异常,导致AlertDefinitionEntity生成报错。
解决方法
进入后台数据库,查询是否有不匹配的记录:
select count(*) from alert_current where history_id not in (select alert_id from alert_history);
select count(*) from alert_history where alert_definition_id not in (select definition_id from alert_definition);
如果有记录,备份数据库(防止误操作),删除不匹配的记录:
delete from alert_current where history_id not in (select alert_id from alert_history);
delete from alert_history where alert_definition_id not in (select definition_id from alert_definition);
重启ambari-server,刷新告警页面,观察是否正常显示,后台日志是否无报错。
建议:操作Ambari数据库的时候,停止ambari-server后再操作。