[一起学Hive]之十九-使用Hive API分析HQL的执行计划、Job数量和表的血缘关系

 

关键字:Hive HQL Job数量、Hive执行计划、Hive LineageInfo

本文介绍使用Hive的API获取一条HQL的最终执行计划,从而获取这条HQL的Job数量,另外,介绍使用API分析一条HQL中所包含的输入表和输出表。这些信息在做元数据管理和Hive表的血缘分析时候很有用。

Hive在执行一条HQL的时候,会经过以下步骤:

  1. 语法解析:Antlr定义SQL的语法规则,完成SQL词法,语法解析,将SQL转化为抽象 语法树AST Tree;
  2. 语义解析:遍历AST Tree,抽象出查询的基本组成单元QueryBlock;
  3. 生成逻辑执行计划:遍历QueryBlock,翻译为执行操作树OperatorTree;
  4. 优化逻辑执行计划:逻辑层优化器进行OperatorTree变换,合并不必要的ReduceSinkOperator,减少shuffle数据量;
  1. 生成物理执行计划:遍历OperatorTree,翻译为MapReduce任务;
  2. 优化物理执行计划:物理层优化器进行MapReduce任务的变换,生成最终的执行计划;

关于这几个步骤,在美团的技术博客上有一篇文章介绍的非常好,可以参考:http://tech.meituan.com/hive-sql-to-mapreduce.html

一般情况下,HQL中的每一个表或者子查询都会生成一个job,这是逻辑执行计划中生成的,但后面Hive还会优化,比如:使用MapJoin,最终一条HQL语句生成的job数量很难通过HQL观察出来。

获取HQL的执行计划和Job数量

直接看代码吧:


 
  1. package com.lxw1234.test;
  2.  
  3. import org.apache.hadoop.fs.Path;
  4. import org.apache.hadoop.hive.conf.HiveConf;
  5. import org.apache.hadoop.hive.ql.Context;
  6. import org.apache.hadoop.hive.ql.QueryPlan;
  7. import org.apache.hadoop.hive.ql.exec.Utilities;
  8. import org.apache.hadoop.hive.ql.parse.ASTNode;
  9. import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
  10. import org.apache.hadoop.hive.ql.parse.ParseDriver;
  11. import org.apache.hadoop.hive.ql.parse.ParseUtils;
  12. import org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory;
  13. import org.apache.hadoop.hive.ql.session.SessionState;
  14.  
  15. /**
  16. * lxw的大数据田地 -- lxw1234.com
  17. * @author lxw1234
  18. *
  19. */
  20. public class HiveQueryPlan {
  21. public static void main(String[] args) throws Exception {
  22. HiveConf conf = new HiveConf();
  23. conf.addResource(new Path("file:///usr/local/apache-hive-0.13.1-bin/conf/hive-site.xml"));
  24. conf.addResource(new Path("file:///usr/local/apache-hive-0.13.1-bin/conf/hive-default.xml.template"));
  25. conf.set("javax.jdo.option.ConnectionURL",
  26. "jdbc:mysql://127.0.0.1:3306/hive?createDatabaseIfNotExist=true&useUnicode=true&characterEncoding=utf-8");
  27. conf.set("hive.metastore.local", "true");
  28. conf.set("javax.jdo.option.ConnectionDriverName","com.mysql.jdbc.Driver");
  29. conf.set("javax.jdo.option.ConnectionUserName", "hive");
  30. conf.set("javax.jdo.option.ConnectionPassword", "hive");
  31. conf.set("hive.stats.dbclass", "jdbc:mysql");
  32. conf.set("hive.stats.jdbcdriver", "com.mysql.jdbc.Driver");
  33. conf.set("hive.exec.dynamic.partition.mode", "nonstrict");
  34.  
  35. String command = args[0];
  36. SessionState.start(conf);
  37. Context ctx = new Context(conf);
  38. ParseDriver pd = new ParseDriver();
  39. ASTNode tree = pd.parse(command, ctx);
  40. tree = ParseUtils.findRootNonNullToken(tree);
  41. BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(conf, tree);
  42. sem.analyze(tree, ctx);
  43. sem.validate();
  44. QueryPlan queryPlan = new QueryPlan(command,sem,0l);
  45. int jobs = Utilities.getMRTasks(queryPlan.getRootTasks()).size();
  46. System.out.println("Total jobs = " + jobs);
  47.  
  48. }
  49. }
  50.  

将上面的代码打包成testhive.jar,运行该类需要引入Hive的依赖包,在包含Hadoop和Hive客户端的机器上执行下面的命令:


 
  1. for f in /usr/local/apache-hive-0.13.1-bin/lib/*.jar; do
  2. HADOOP_CLASSPATH=${HADOOP_CLASSPATH}:$f;
  3. done
  4. export HADOOP_CLASSPATH

分别解析下面三条HQL语句:


 
  1. HQL1:SELECT COUNT(1) FROM liuxiaowen.lxw1;
  2. HQL2:SELECT COUNT(1) FROM (SELECT url FROM liuxiaowen.lxw1 GROUP BY url) x;
  3. HQL3:SELECT COUNT(1) FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain);

解析HQL1:


 
  1. hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM liuxiaowen.lxw1"

结果如下:

Hive QueryPlan

解析HQL2:


 
  1. hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM (SELECT url FROM liuxiaowen.lxw1 GROUP BY url) x"

结果如下:

Hive QueryPlan

解析HQL3:


 
  1. hadoop jar testhive.jar com.lxw1234.test.HiveQueryPlan "SELECT COUNT(1) FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain)"

结果如下:

Hive QueryPlan

在HQL3中,由于Hive自动优化使用了MapJoin,因此,两个表的join最终只用了一个job,在Hive中执行验证一下:

Hive QueryPlan

解析HQL中表的血缘关系

在元数据管理中,可能需要知道Hive中有哪些表,以及这些表之间的关联关系,比如:A表是由B表和C表统计汇总而来。

Hive中本身自带了一个工具,用来分析一条HQL中的源表和目标表,org.apache.hadoop.hive.ql.tools.LineageInfo

但该类中目标表只能是使用INSERT语句插入数据的目标表,对于使用CREATE TABLE AS语句创建的表分析不出来。

下面的代码只对org.apache.hadoop.hive.ql.tools.LineageInfo做了小小的修改:


 
  1. package com.lxw1234.test;
  2.  
  3. import java.io.IOException;
  4. import java.util.ArrayList;
  5. import java.util.LinkedHashMap;
  6. import java.util.Map;
  7. import java.util.Stack;
  8. import java.util.TreeSet;
  9.  
  10. import org.apache.hadoop.hive.ql.lib.DefaultGraphWalker;
  11. import org.apache.hadoop.hive.ql.lib.DefaultRuleDispatcher;
  12. import org.apache.hadoop.hive.ql.lib.Dispatcher;
  13. import org.apache.hadoop.hive.ql.lib.GraphWalker;
  14. import org.apache.hadoop.hive.ql.lib.Node;
  15. import org.apache.hadoop.hive.ql.lib.NodeProcessor;
  16. import org.apache.hadoop.hive.ql.lib.NodeProcessorCtx;
  17. import org.apache.hadoop.hive.ql.lib.Rule;
  18. import org.apache.hadoop.hive.ql.parse.ASTNode;
  19. import org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer;
  20. import org.apache.hadoop.hive.ql.parse.HiveParser;
  21. import org.apache.hadoop.hive.ql.parse.ParseDriver;
  22. import org.apache.hadoop.hive.ql.parse.ParseException;
  23. import org.apache.hadoop.hive.ql.parse.SemanticException;
  24.  
  25. /**
  26. * lxw的大数据田地 -- lxw1234.com
  27. * @author lxw1234
  28. *
  29. */
  30. public class HiveLineageInfo implements NodeProcessor {
  31.  
  32. /**
  33. * Stores input tables in sql.
  34. */
  35. TreeSet inputTableList = new TreeSet();
  36. /**
  37. * Stores output tables in sql.
  38. */
  39. TreeSet OutputTableList = new TreeSet();
  40.  
  41. /**
  42. *
  43. * @return java.util.TreeSet
  44. */
  45. public TreeSet getInputTableList() {
  46. return inputTableList;
  47. }
  48.  
  49. /**
  50. * @return java.util.TreeSet
  51. */
  52. public TreeSet getOutputTableList() {
  53. return OutputTableList;
  54. }
  55.  
  56. /**
  57. * Implements the process method for the NodeProcessor interface.
  58. */
  59. public Object process(Node nd, Stack stack, NodeProcessorCtx procCtx,
  60. Object... nodeOutputs) throws SemanticException {
  61. ASTNode pt = (ASTNode) nd;
  62.  
  63. switch (pt.getToken().getType()) {
  64.  
  65. case HiveParser.TOK_CREATETABLE:
  66. OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode)pt.getChild(0)));
  67. break;
  68. case HiveParser.TOK_TAB:
  69. OutputTableList.add(BaseSemanticAnalyzer.getUnescapedName((ASTNode)pt.getChild(0)));
  70. break;
  71.  
  72. case HiveParser.TOK_TABREF:
  73. ASTNode tabTree = (ASTNode) pt.getChild(0);
  74. String table_name = (tabTree.getChildCount() == 1) ?
  75. BaseSemanticAnalyzer.getUnescapedName((ASTNode)tabTree.getChild(0)) :
  76. BaseSemanticAnalyzer.getUnescapedName((ASTNode)tabTree.getChild(0)) + "." + tabTree.getChild(1);
  77. inputTableList.add(table_name);
  78. break;
  79. }
  80. return null;
  81. }
  82.  
  83. /**
  84. * parses given query and gets the lineage info.
  85. *
  86. * @param query
  87. * @throws ParseException
  88. */
  89. public void getLineageInfo(String query) throws ParseException,
  90. SemanticException {
  91.  
  92. /*
  93. * Get the AST tree
  94. */
  95. ParseDriver pd = new ParseDriver();
  96. ASTNode tree = pd.parse(query);
  97.  
  98. while ((tree.getToken() == null) && (tree.getChildCount() > 0)) {
  99. tree = (ASTNode) tree.getChild(0);
  100. }
  101.  
  102. /*
  103. * initialize Event Processor and dispatcher.
  104. */
  105. inputTableList.clear();
  106. OutputTableList.clear();
  107.  
  108. // create a walker which walks the tree in a DFS manner while maintaining
  109. // the operator stack. The dispatcher
  110. // generates the plan from the operator tree
  111. Map<Rule, NodeProcessor> rules = new LinkedHashMap<Rule, NodeProcessor>();
  112.  
  113. // The dispatcher fires the processor corresponding to the closest matching
  114. // rule and passes the context along
  115. Dispatcher disp = new DefaultRuleDispatcher(this, rules, null);
  116. GraphWalker ogw = new DefaultGraphWalker(disp);
  117.  
  118. // Create a list of topop nodes
  119. ArrayList topNodes = new ArrayList();
  120. topNodes.add(tree);
  121. ogw.startWalking(topNodes, null);
  122. }
  123.  
  124. public static void main(String[] args) throws IOException, ParseException,
  125. SemanticException {
  126. String query = args[0];
  127. HiveLineageInfo lep = new HiveLineageInfo();
  128. lep.getLineageInfo(query);
  129. System.out.println("Input tables = " + lep.getInputTableList());
  130. System.out.println("Output tables = " + lep.getOutputTableList());
  131. }
  132. }
  133.  

将上面的程序打包成testhive.jar,同上面,执行时候需要引入Hive的依赖包:

分析下面两条HQL语句:


 
  1. HQL1:CREATE TABLE liuxiaowen.lxw1234 AS SELECT * FROM liuxiaowen.lxw1;
  2. HQL2:INSERT OVERWRITE TABLE liuxiaowen.lxw3 SELECT a.url FROM liuxiaowen.lxw1 a join liuxiaowen.lxw2 b ON (a.url = b.domain);

执行命令:


 
  1. hadoop jar testhive.jar com.lxw1234.test.HiveLineageInfo "CREATE TABLE liuxiaowen.lxw1234 AS SELECT * FROM liuxiaowen.lxw1"
  2. hadoop jar testhive.jar com.lxw1234.test.HiveLineageInfo "INSERT OVERWRITE TABLE liuxiaowen.lxw3 SELECT a.url FROM liuxiaowen.lxw1 a
  3. join liuxiaowen.lxw2 b ON (a.url = b.domain)"

分析结果:

Hive LineageInfo

HQL中的Input table和Output table已经正确解析出来。

展开阅读全文

Windows版YOLOv4目标检测实战:训练自己的数据集

04-26
©️2020 CSDN 皮肤主题: 大白 设计师: CSDN官方博客 返回首页
实付0元
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、C币套餐、付费专栏及课程。

余额充值