mysql 钩子函数_你想了解的Hive Query生命周期--钩子函数篇！

最新推荐文章于 2023-02-25 16:42:20 发布

Derek 程勇

最新推荐文章于 2023-02-25 16:42:20 发布

阅读量282

点赞数

文章标签： mysql 钩子函数

本文链接：https://blog.csdn.net/weixin_35572363/article/details/114338098

版权

前言

无论Hive Cli还是HiveServer2，一个HQl语句都要经过Driver进行解析和执行，粗略如下图：

Driver处理的流程如下：

HQL解析(生成AST语法树) => 语法分析(得到QueryBlock) => 生成逻辑执行计划(Operator) => 逻辑优化(Logical Optimizer Operator) => 生成物理执行计划(Task Plan) => 物理优化(Task Tree) => 构建执行计划(QueryPlan) => 表以及操作鉴权 => 执行引擎执行

流程涉及HQL解析，HQL编译(语法分析，逻辑计划和物理计划，鉴权)，执行器执行三个大的方面，在整个生命周期中，按执行顺序会有如下钩子函数：

Driver.run()之前的preDriverRun

该钩子函数由配置 hive.exec.driver.run.hooks 控制，多个钩子实现类以逗号间隔，钩子需实现 org.apache.hadoop.hive.ql.HiveDriverRunHook 接口，该接口描述如下：

public interface HiveDriverRunHook extends Hook {

/**

* Invoked before Hive begins any processing of a command in the Driver,

* notably before compilation and any customizable performance logging.

public void preDriverRun(

HiveDriverRunHookContext hookContext) throws Exception;

/**

* Invoked after Hive performs any processing of a command, just before a

* response is returned to the entity calling the Driver.

public void postDriverRun(

HiveDriverRunHookContext hookContext) throws Exception;

}

可以看出钩子还提供了 postDriverRun 方法供HQL执行完，数据返回前调用，这个在后面会说到

其参数在Hive里使用的是 HiveDriverRunHookContext 的默认实现类 org.apache.hadoop.hive.ql.HiveDriverRunHookContextImpl，里面提供了两个有用的参数，分别是HiveConf和要执行的Command，其调用信息如下：

HiveDriverRunHookContext hookContext = new HiveDriverRunHookContextImpl(conf, command);

// Get all the driver run hooks and pre-execute them.

List driverRunHooks;

try {

driverRunHooks = getHooks(HiveConf.ConfVars.HIVE_DRIVER_RUN_HOOKS,

HiveDriverRunHook.class);

for (HiveDriverRunHook driverRunHook : driverRunHooks) {

driverRunHook.preDriverRun(hookContext);

}

} catch (Exception e) {

errorMessage = "FAILED: Hive Internal Error: " + Utilities.getNameMessage(e);

SQLState = ErrorMsg.findSQLState(e.getMessage());

downstreamError = e;

console.printError(errorMessage + "\n"

+ org.apache.hadoop.util.StringUtils.stringifyException(e));

return createProcessorResponse(12);

}

语法分析之前的preAnalyze

在Driver开始run之后，HQL经过解析会进入编译阶段的语法分析，而在语法分析前会经过钩子 HiveSemanticAnalyzerHook 的 preAnalyze 方法，该钩子函数由 hive.semantic.analyzer.hook 配置，钩子需实现 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHook 接口，接口描述如下：

public interface HiveSemanticAnalyzerHook extends Hook {

public ASTNode preAnalyze(

HiveSemanticAnalyzerHookContext context,

ASTNode ast) throws SemanticException;

public void postAnalyze(

HiveSemanticAnalyzerHookContext context,

List> rootTasks) throws SemanticException;

}

可以看出钩子类还提供了 postAnalyze 方法供语法分析完后调用，这个在后面会提到

其参数在Hive里使用的是 HiveSemanticAnalyzerHookContext 的默认实现类 org.apache.hadoop.hive.ql.parse.HiveSemanticAnalyzerHookContextImpl，里面提供了HQL对应的输入，输出，提交用户，HiveConf和客户端IP等信息，输入和输出的表及分区信息需要做完语法分析后才能得到，在 preAnalyze 里获取不到，其调用信息如下：

List saHooks =

getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,

HiveSemanticAnalyzerHook.class);

// Do semantic analysis and plan generation

if (saHooks != null) {

HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();

hookCtx.setConf(conf);

hookCtx.setUserName(userName);

hookCtx.setIpAddress(SessionState.get().getUserIpAddress());

hookCtx.setCommand(command);

for (HiveSemanticAnalyzerHook hook : saHooks) {

tree = hook.preAnalyze(hookCtx, tree);

}

// 此处开始进行语法分析，会涉及逻辑执行计划和物理执行计划的生成和优化

sem.analyze(tree, ctx);

// 更新分析器以便后续的postAnalyzer钩子执行

hookCtx.update(sem);

for (HiveSemanticAnalyzerHook hook : saHooks) {

hook.postAnalyze(hookCtx, sem.getRootTasks());

}

} else {

sem.analyze(tree, ctx);

}

语法分析之后的postAnalyze

从 preAnalyze 的分析可以看出，postAnalyze 与其属于同一个钩子类，因此配置也相同，不同的是它位于Hive的语法分析之后，因此可以获取到HQL的输入和输出表及分区信息，以及语法分析得到的Task信息，由此可以判断是否是需要分布式执行的任务，以及执行引擎是什么，具体的代码和配置可见上面的 preAnalyze 分析

生成执行计划之前的redactor钩子

这个钩子函数是在语法分析之后，生成QueryPlan之前，所以执行它的时候语法分析已完成，具体要跑的任务已定，这个钩子的目的在于完成QueryString的替换，比如QueryString中包含敏感的表或字段信息，在这里都可以完成替换，从而在Yarn的RM界面或其他方式查询该任务的时候，会显示经过替换后的HQL

该钩子由 hive.exec.query.redactor.hooks 配置，多个实现类以逗号间隔，钩子需继承 org.apache.hadoop.hive.ql.hooks.Redactor 抽象类，并替换 redactQuery 方法，接口描述如下：

public abstract class Redactor implements Hook, Configurable {

private Configuration conf;

public void setConf(Configuration conf) {

this.conf = conf;

}

public Configuration getConf() {

return conf;

}

/**

* Implementations may modify the query so that when placed in the job.xml

* and thus potenially exposed to admin users, the query does not expose

* sensitive information.

public String redactQuery(String query) {

return query;

}

其调用信息如下：

public static String redactLogString(HiveConf conf, String logString)

throws InstantiationException, IllegalAccessException, ClassNotFoundException {

String redactedString = logString;

if (conf != null && logString != null) {

List queryRedactors = getHooks(conf, ConfVars.QUERYREDACTORHOOKS, Redactor.class);

for (Redactor redactor : queryRedactors) {

redactor.setConf(conf);

redactedString = redactor.redactQuery(redactedString);

}

return redactedString;

}

Task执行之前的preExecutionHook

在执行计划QueryPlan生成完，并通过鉴权后，就会进行具体Task的执行，而Task执行之前会经过一个钩子函数，钩子函数由 hive.exec.pre.hooks 配置，多个钩子实现类以逗号间隔，该钩子的实现方式有两个，分别是：

一、实现 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口

该接口会传入 org.apache.hadoop.hive.ql.hooks.HookContext 的实例作为参数，而参数类里带有执行计划，HiveConf，Lineage信息，UGI，提交用户名，输入输出表及分区信息等私有变量，为我们实现自己的功能提供了很多帮助

接口描述如下：

public interface ExecuteWithHookContext extends Hook {

void run(HookContext hookContext) throws Exception;

}

二、实现 org.apache.hadoop.hive.ql.hooks.PreExecute 接口

该接口传入参数包括SessionState，UGI和HQL的输入输出表及分区信息，目前该接口被标为已过时的接口，相比上面的ExecuteWithHookContext，该接口提供的信息可能不完全能满足我们的需求

其接口描述如下：

public interface PreExecute extends Hook {

/**

* The run command that is called just before the execution of the query.

* @param sess

* The session state.

* @param inputs

* The set of input tables and partitions.

* @param outputs

* The set of output tables, partitions, local and hdfs directories.

* @param ugi

* The user group security information.

@Deprecated

public void run(SessionState sess, Set inputs,

Set outputs, UserGroupInformation ugi)

throws Exception;

}

该钩子的调用信息如下：

SessionState ss = SessionState.get();

HookContext hookContext = new HookContext(plan, conf, ctx.getPathToCS(), ss.getUserName(), ss.getUserIpAddress());

hookContext.setHookType(HookContext.HookType.PRE_EXEC_HOOK);

for (Hook peh : getHooks(HiveConf.ConfVars.PREEXECHOOKS)) {

if (peh instanceof ExecuteWithHookContext) {

perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

((ExecuteWithHookContext) peh).run(hookContext);

perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

} else if (peh instanceof PreExecute) {

perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

((PreExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),

Utils.getUGI());

perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.PRE_HOOK + peh.getClass().getName());

}

Task执行失败时的ON_FAILURE_HOOKS

Task执行完后，如果执行失败了，那么Hive会调用这个失败的Hook。该钩子由参数 hive.exec.failure.hooks 配置，多个钩子实现类以逗号间隔，钩子需实现 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口，该接口在上文已有描述。该钩子主要用于在任务执行失败时执行一些措施，比如统计等等

该钩子的调用信息如下：

hookContext.setHookType(HookContext.HookType.ON_FAILURE_HOOK);

// Get all the failure execution hooks and execute them.

for (Hook ofh : getHooks(HiveConf.ConfVars.ONFAILUREHOOKS)) {

perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());

((ExecuteWithHookContext) ofh).run(hookContext);

perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.FAILURE_HOOK + ofh.getClass().getName());

}

Task执行完毕的postExecutionHook

这个钩子是在Task任务执行完毕后执行，如果Task失败，会先执行ON_FAILURE_HOOKS这个钩子，之后执行postExecutionHook，该钩子由参数 hive.exec.post.hooks 配置，多个钩子实现类以逗号间隔，该钩子的实现方式也有两个

一、实现 org.apache.hadoop.hive.ql.hooks.ExecuteWithHookContext 接口

这个与preExecutionHook一致

二、实现 org.apache.hadoop.hive.ql.hooks.PostExecute 接口

该接口传入参数包括SessionState，UGI，列级的LineageInfo和HQL的输入输出表及分区信息，目前该接口被标为已过时的接口，相比上面的ExecuteWithHookContext，该接口提供的信息可能不完全能满足我们的需求

其接口描述如下：

public interface PostExecute extends Hook {

/**

* The run command that is called just before the execution of the query.

* @param sess

* The session state.

* @param inputs

* The set of input tables and partitions.

* @param outputs

* The set of output tables, partitions, local and hdfs directories.

* @param lInfo

* The column level lineage information.

* @param ugi

* The user group security information.

@Deprecated

void run(SessionState sess, Set inputs,

Set outputs, LineageInfo lInfo,

UserGroupInformation ugi) throws Exception;

}

该钩子的调用信息如下：

hookContext.setHookType(HookContext.HookType.POST_EXEC_HOOK);

// Get all the post execution hooks and execute them.

for (Hook peh : getHooks(HiveConf.ConfVars.POSTEXECHOOKS)) {

if (peh instanceof ExecuteWithHookContext) {

perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

((ExecuteWithHookContext) peh).run(hookContext);

perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

} else if (peh instanceof PostExecute) {

perfLogger.PerfLogBegin(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

((PostExecute) peh).run(SessionState.get(), plan.getInputs(), plan.getOutputs(),

(SessionState.get() != null ? SessionState.get().getLineageState().getLineageInfo()

: null), Utils.getUGI());

perfLogger.PerfLogEnd(CLASS_NAME, PerfLogger.POST_HOOK + peh.getClass().getName());

}

Task执行完毕结果返回之前的postDriverRun

该钩子在Task执行完毕，而结果尚未返回之前执行，与preDriverRun相对应，由于是同一个接口，这里不做详细描述

最后

至此，整个HQL执行生命周期中的钩子函数都讲完了，执行顺序和流程可梳理如下：

Driver.run()

=> HiveDriverRunHook.preDriverRun()(hive.exec.driver.run.hooks)

=> Driver.compile()

=> HiveSemanticAnalyzerHook.preAnalyze()(hive.semantic.analyzer.hook)

=> SemanticAnalyze(QueryBlock, LogicalPlan, PhyPlan, TaskTree)

=> HiveSemanticAnalyzerHook.postAnalyze()(hive.semantic.analyzer.hook)

=> QueryString redactor(hive.exec.query.redactor.hooks)

=> QueryPlan Generation

=> Authorization

=> Driver.execute()

=> ExecuteWithHookContext.run() || PreExecute.run() (hive.exec.pre.hooks)

=> TaskRunner

=> if failed, ExecuteWithHookContext.run()(hive.exec.failure.hooks)

=> ExecuteWithHookContext.run() || PostExecute.run() (hive.exec.post.hooks)

=> HiveDriverRunHook.postDriverRun()(hive.exec.driver.run.hooks)

Derek 程勇

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
mysql 钩子函数_你想了解的Hive Query生命周期--钩子函数篇！

前言无论Hive Cli还是HiveServer2，一个HQl语句都要经过Driver进行解析和执行，粗略如下图： Driver处理的流程如下：HQL解析(生成AST语法树) => 语法分析(得到QueryBlock) => 生成逻辑执行计划(Operator) => 逻辑优化(Logical Optimizer Operator) => 生成物理执行计划(Task Pla...
复制链接

扫一扫

mysql 钩子函数_你想了解的Hive Query生命周期--钩子函数篇！

“相关推荐”对你有帮助么？