aodh源码分析
1 介绍
aodh是从ceilometer拆分出来的告警组件,现在主要包括:
evaluator服务,notifier服务,listener服务。
evaluator服务主要用于每隔一定时间进行告警的校验,如果校验有告警产生,则与notifier服务通信,进行告警通知
notifier服务接收来自evaluator服务的通信,将触发的告警以日志,http请求等形式进行告警分发
listener服务可接收来自第三方的告警内容并进行存储
2 Aodh的evaluator服务
2.1 服务启动部分在: aodh/evaluator/__init__.py的AlarmEvaluationService类的start函数,具体如下:
def start(self):
super(AlarmEvaluationService, self).start()
self.partition_coordinator.start()
self.partition_coordinator.join_group(self.PARTITIONING_GROUP_NAME)
# allow time for coordination if necessary
delay_start = self.partition_coordinator.is_active()
if self.evaluators:
interval = self.conf.evaluation_interval
# 以定时任务形式每隔一定时间调用_evaluate_assigned_alarms方法
self.tg.add_timer(
interval,
self._evaluate_assigned_alarms,
initial_delay=interval if delay_start else None)
if self.partition_coordinator.is_active():
heartbeat_interval = min(self.conf.coordination.heartbeat,
self.conf.evaluation_interval / 4)
self.tg.add_timer(heartbeat_interval,
self.partition_coordinator.heartbeat)
# Add a dummy thread to have wait() working
self.tg.add_timer(604800, lambda: None)
# NOTE(r-mibu): The 'event' type alarms will be evaluated by the
# event-driven alarm evaluator, so this periodical evaluator skips
# those alarms.
all_alarms = self._storage_conn.get_alarms(enabled=True,
exclude=dict(type='event'))
all_alarms = list(all_alarms)
all_alarm_ids = [a.alarm_id for a in all_alarms]
selected = self.partition_coordinator.extract_my_subset(
self.PARTITIONING_GROUP_NAME, all_alarm_ids)
return list(filter(lambda a: a.alarm_id in selected, all_alarms))
分析:
以定时任务形式每隔一定时间调用_evaluate_assigned_alarms方法
self.tg.add_timer(
interval,
self._evaluate_assigned_alarms,
initial_delay=interval if delay_start else None)
2.2 _evaluate_assigned_alarms方法具体如下:
def _evaluate_assigned_alarms(self):
try:
alarms = self._assigned_alarms()
LOG.info(_('initiating evaluation cycle on %d alarms') %
len(alarms))
for alarm in alarms:
self._evaluate_alarm(alarm)
except Exception:
LOG.exception(_('alarm evaluation cycle failed'))
分析:
这个方法是从数据库中获取所有告警,然后遍历每个告警,对每个告警调用
_evaluate_alarm方法进行告警验证
2.3 _evaluate_alarm方法具体如下:
def _evaluate_alarm(self, alarm):
"""Evaluate the alarms assigned to this evaluator."""
if alarm.type not in self.evaluators:
LOG.debug('skipping alarm %s: type unsupported', alarm.alarm_id)
return
LOG.debug('evaluating alarm %s', alarm.alarm_id)
try:
self.evaluators[alarm.type].obj.evaluate(alarm)
except Exception:
LOG.exception(_('Failed to evaluate alarm %s'), alarm.alarm_id)
分析:
该方法是对当前待校验告警,获取其告警类型alarm.type,然后找到该告警类型对应的evaluator对象
【目前有: ThresholdEvaluator, GnocchiResourceThresholdEvaluator等】,调用该
evaluator对象的evaluate方法,最终会调用aodh/evaluator/threshold.py的
ThresholdEvaluator类的evaluate方法
2.4 aodh/evaluator/threshold.py中ThresholdEvaluator类的evaluate方法具体如下:
def evaluate(self, alarm):
if not self.within_time_constraint(alarm):
LOG.debug('Attempted to evaluate alarm %s, but it is not '
'within its time constraint.', alarm.alarm_id)
return
state, trending_state, statistics, outside_count = self.evaluate_rule(
alarm.rule, alarm_id=alarm.alarm_id)
self._transition_alarm(alarm, state, trending_state, statistics,
outside_count)