摘要
Apache Atlas作为一款数据治理领域的开源组件,凭借其优异的表/字段级别的血缘关系处理,构造数据字典等特性广受关注。生产环境下常和Apache Hive组件结合,自动同步Hive的元数据及数据变更,形成数据血缘并提供元数据检索。
Atlas 通过自带的Hive Hook程序获取hive sql执行过程中的元数据变动。本文通过分析Hive Hook的相关代码来探究其运行机制。
正文
Apache Hive组件自身为提升扩展能力,在运行过程中预置了较多的Hook函数(钩子函数)。外部程序通过相关钩子函数即可获得特定执行步骤下的元数据信息。
一段Hive sql的执行会经过解析/编译/优化/执行等步骤。这些步骤涉及的Hook函数如下
Atlas 基于hive.exec.post.hooks来解析hive的HookContext,获取元数据信息。并将信息通过kafka发送到atlas,转化为对应的实体/实体关系。整个过程的大致流程如下。
查看Atlas的hive hook 代码,上述流程中涉及的主类HiveHook类及其他相关类如下图所示。
Atlas的HiveHook类实现了hive提供的ExecuteWithHookContext接口并重写了run()方法。重写后的run()方法基于hive的hookcontext获取其中的操作oper(如建表/建库等),并将oper转化为对应的event。最后借助父类AtlasHook类的notifyEntities()方法将相关消息发送到kafka。
HiveHook中的run()方法核心逻辑如下
public void run(HookContext hookContext) throws Exception {
HiveOperation oper = OPERATION_MAP.get(hookContext.getOperationName());
AtlasHiveHookContext context = new AtlasHiveHookContext(this, oper, hookContext, getKnownObjects(), isSkipTempTables());
BaseHiveEvent event = null;
switch (oper) { //将oper转化为对应的event
case CREATEDATABASE:
event = new CreateDatabase(context);
break;
case DROPDATABASE:
event = new DropDatabase(context);
break;
case CREATETABLE:
event = new CreateTable(context);
break;
...
}
if (event != null) {
final UserGroupInformation ugi = hookContext.getUgi() == null ? Utils.getUGI() : hookContext.getUgi();
//将event发送到kafka
super.notifyEntities(ActiveEntityFilter.apply(event.getNotificationMessages()), ugi);
}
...
}
其中AtlasHook类的notifyEntities()方法通过NotificationProvider.get()方法初始化了KafkaNotification.并在notifyEntities()方法中使用notifyEntitiesInternal()方法将消息传递到kafka.
public abstract class AtlasHook {
...
protected static NotificationInterface notificationInterface;
notificationInterface = NotificationProvider.get();
public static void notifyEntities(List<HookNotification> messages, UserGroupInformation ugi, int maxRetries, MessageSource source) {
if (executor == null) { // send synchronously
notifyEntitiesInternal(messages, maxRetries, ugi, notificationInterface, logFailedMessages, failedMessagesLogger, source);
} else {
executor.submit(new Runnable() {
@Override
public void run() {
//notificationInterface即为kafkaNotification,实现了将消息转为json格式并发送到kafka等功能
notifyEntitiesInternal(messages, maxRetries, ugi, notificationInterface, logFailedMessages, failedMessagesLogger, source);
}
});
}
}
...
}
static void notifyEntitiesInternal(List<HookNotification> messages, int maxRetries, UserGroupInformation ugi,
NotificationInterface notificationInterface,
boolean shouldLogFailedMessages, FailedMessagesLogger logger, MessageSource source) {
if (ugi == null) {
/*
* NotificationInterface notificationInterface = new kafkaNotification()
*
* 1. AbstractNotification implements NotificationInterface
* (1)AbstractNotification重写了NotificationInterface的send()方法
* 。send方法调用createNotificationMessages()方法,将消息转变为Json格式
* (2) 在send方法中调用了sendInternal()抽象方法
*
* 2. KafkaNotification extends AbstractNotification
* (1)KafkaNotification 重写了sendInternal()抽象方法,将Json格式的消息发送到kafka
*
* 综合1&2,notificationInterface实例具备将将消息转变为Json格式并发送到kafka的能力
*/
notificationInterface.send(NotificationInterface.NotificationType.HOOK, messages, source);
} else {
PrivilegedExceptionAction<Object> privilegedNotify = new PrivilegedExceptionAction<Object>() {
@Override
public Object run() throws Exception {
notificationInterface.send(NotificationInterface.NotificationType.HOOK, messages, source);
return messages;
}
};
ugi.doAs(privilegedNotify);
}
...
}