Apache Hive on Apache Tez

   Tez是Apache最新开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、Merge、Processor和Output等,这样,这些分解后的元操作可以任意灵活组合,产生新的操作,这些操作经过一些控制程序组装后,可形成一个大的DAG作业。总结起来,Tez有以下特点:
(1)Apache二级开源项目(源代码今天发布的)
(2)运行在YARN之上
(3) 适用于DAG(有向图)应用(同Impala、Dremel和Drill一样,可用于替换Hive/Pig等)
其中,第三点需要做一些简单的说明,Apache当前有顶级项目Oozie用于DAG作业设计,但Oozie是比较高层(作业层面)的,它只是提供了一种多类型作业(比如MR程序、Hive、Pig等)依赖关系表达方式,并按照这种依赖关系提交这些作业,而Tez则不同,它在更底层提供了DAG编程接口,用户编写程序时直接采用这些接口进行程序设计,这种更底层的编程方式会带来更高的效率.
详细介绍:http://dongxicheng.org/mapreduce-nextgen/apache-tez-newest-progress/

Tez有以下几个特色:
(1) 丰富的数据流(dataflow,NOT Streaming!)编程接口;
(2) 扩展性良好的“Input-Processor-Output”运行模型;
(3) 简化数据部署(充分利用了YARN框架,Tez本身仅是一个客户端编程库,无需事先部署相关服务)
(4) 性能优于MapReduce
(5)  优化的资源管理(直接运行在资源管理系统YARN之上)

(6) 动态生成物理数据流(dataflow)

操作日志:
http://zh.hortonworks.com/hadoop-tutorial/supercharging-interactive-queries-hive-tez/


[root@sandbox ~]# hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
hive> set hive.execution.engine=mr;
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr 
    > from building b join hvac h 
    > on b.buildingid = h.buildingid;
Query ID = root_20141030000808_c3d3772d-3aa0-4673-a28e-515043452d7a
Total jobs = 1
14/10/30 00:08:46 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/10/30 00:08:46 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
Execution log at: /tmp/root/root_20141030000808_c3d3772d-3aa0-4673-a28e-515043452d7a.log
2014-10-30 12:08:48     Starting to launch local task to process map join;      maximum memory = 260177920
2014-10-30 12:08:52     Dump the side-table into file: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2014-10-30 12:08:52     Uploaded 1 File to: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (1044 bytes)
2014-10-30 12:08:52     End of local task; Time Taken: 3.833 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414630785017_0012, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0012/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414630785017_0012
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2014-10-30 00:09:13,234 Stage-3 map = 0%,  reduce = 0%
2014-10-30 00:09:21,143 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.91 sec
MapReduce Total cumulative CPU time: 1 seconds 910 msec
Ended Job = job_1414630785017_0012
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 1.91 sec   HDFS Read: 240754 HDFS Write: 416262 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 910 msec
OK
6/1/13  0:00:01 66      58      13      20      4       Finland GG1919  17      M4
......
6/17/13 2:33:07 68      72      17      27      12      Finland FN39TG  26      M12
6/18/13 3:33:07 68      69      10      4       3       Brazil  JDNS77  28      M3
6/19/13 4:33:07 65      63      7       23      20      Argentina       ACMAX22 19      M20
6/20/13 5:33:07 66      66      9       21      3       Brazil  JDNS77  28      M3
Time taken: 51.333 seconds, Fetched: 8000 row(s)
hive> set hive.execution.engine=tez;
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr 
    > from building b join hvac h 
    > on b.buildingid = h.buildingid;
Query ID = root_20141030001010_9b9e76a3-5da9-48b6-9bd6-302c6560aa79
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0013)

Map 1: -/-      Map 2: -/-
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 1/1      Map 2: 0/1
Map 1: 1/1      Map 2: 1/1
Status: Finished successfully
OK
6/1/13  0:00:01 66      58      13      20      4       Finland GG1919  17      M4
......
6/17/13 2:33:07 68      72      17      27      12      Finland FN39TG  26      M12
6/18/13 3:33:07 68      69      10      4       3       Brazil  JDNS77  28      M3
6/19/13 4:33:07 65      63      7       23      20      Argentina       ACMAX22 19      M20
6/20/13 5:33:07 66      66      9       21      3       Brazil  JDNS77  28      M3
Time taken: 52.979 seconds, Fetched: 8000 row(s)

hive> select a.buildingid, b.buildingmgr, max(a.targettemp-a.actualtemp)
    > from hvac a join building b
    > on a.buildingid = b.buildingid
    > group by a.buildingid, b.buildingmgr limit 10;
Query ID = root_20141030001616_e16c39f3-8cc6-4809-96f1-c76a55543a72
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (application id: application_1414630785017_0014)

Map 1: -/-      Map 2: -/-      Reducer 3: 0/1
Map 1: 0/1      Map 2: 0/1      Reducer 3: 0/1
Map 1: 0/1      Map 2: 0/1      Reducer 3: 0/1
Map 1: 0/1      Map 2: 0/1      Reducer 3: 0/1
Map 1: 0/1      Map 2: 0/1      Reducer 3: 0/1
Map 1: 1/1      Map 2: 0/1      Reducer 3: 0/1
Map 1: 1/1      Map 2: 1/1      Reducer 3: 0/1
Map 1: 1/1      Map 2: 1/1      Reducer 3: 1/1
Status: Finished successfully
OK
1       M1      14
2       M2      15
3       M3      15
4       M4      15
5       M5      15
6       M6      15
7       M7      14
8       M8      15
9       M9      14
10      M10     15
Time taken: 37.644 seconds, Fetched: 10 row(s)
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr 
    > from building b join hvac h 
    > on b.buildingid = h.buildingid limit 10;
Query ID = root_20141030001919_7ef87e96-9848-404b-9f7c-2e847efbda72
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0014)

Map 1: -/-      Map 2: -/-
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 0/1      Map 2: 0/1
Map 1: 1/1      Map 2: 0/1
Map 1: 1/1      Map 2: 1/1
Status: Finished successfully
OK
6/1/13  0:00:01 66      58      13      20      4       Finland GG1919  17      M4
6/2/13  1:00:01 69      68      3       20      17      Egypt   FN39TG  11      M17
6/3/13  2:00:01 70      73      17      20      18      Indonesia       JDNS77  25      M18
6/4/13  3:00:01 67      63      2       23      15      Israel  ACMAX22 19      M15
6/5/13  4:00:01 68      74      16      9       3       Brazil  JDNS77  28      M3
6/6/13  5:00:01 67      56      13      28      4       Finland GG1919  17      M4
6/7/13  6:00:01 70      58      12      24      2       France  FN39TG  27      M2
6/8/13  7:00:01 70      73      20      26      16      Turkey  AC1000  23      M16
6/9/13  8:00:01 66      69      16      9       9       Mexico  GG1919  11      M9
6/10/13 9:00:01 65      57      6       5       12      Finland FN39TG  26      M12
Time taken: 30.373 seconds, Fetched: 10 row(s)

hive> set hive.execution.engine=mr;
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr 
    > from building b join hvac h 
    > on b.buildingid = h.buildingid limit 10;
Query ID = root_20141030002121_82b6c25d-3efe-4edb-97f0-6ee104a44793
Total jobs = 1
14/10/30 00:21:17 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/10/30 00:21:17 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
Execution log at: /tmp/root/root_20141030002121_82b6c25d-3efe-4edb-97f0-6ee104a44793.log
2014-10-30 12:21:20     Starting to launch local task to process map join;      maximum memory = 260177920
2014-10-30 12:21:26     Dump the side-table into file: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
2014-10-30 12:21:26     Uploaded 1 File to: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile40--.hashtable (1044 bytes)
2014-10-30 12:21:26     End of local task; Time Taken: 6.094 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414630785017_0015, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0015/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414630785017_0015
Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0
2014-10-30 00:21:40,956 Stage-3 map = 0%,  reduce = 0%
2014-10-30 00:21:51,576 Stage-3 map = 100%,  reduce = 0%
Ended Job = job_1414630785017_0015
MapReduce Jobs Launched: 
Job 0:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
6/1/13  0:00:01 66      58      13      20      4       Finland GG1919  17      M4
6/2/13  1:00:01 69      68      3       20      17      Egypt   FN39TG  11      M17
6/3/13  2:00:01 70      73      17      20      18      Indonesia       JDNS77  25      M18
6/4/13  3:00:01 67      63      2       23      15      Israel  ACMAX22 19      M15
6/5/13  4:00:01 68      74      16      9       3       Brazil  JDNS77  28      M3
6/6/13  5:00:01 67      56      13      28      4       Finland GG1919  17      M4
6/7/13  6:00:01 70      58      12      24      2       France  FN39TG  27      M2
6/8/13  7:00:01 70      73      20      26      16      Turkey  AC1000  23      M16
6/9/13  8:00:01 66      69      16      9       9       Mexico  GG1919  11      M9
6/10/13 9:00:01 65      57      6       5       12      Finland FN39TG  26      M12
Time taken: 45.935 seconds, Fetched: 10 row(s)
hive> select a.buildingid, b.buildingmgr, max(a.targettemp-a.actualtemp)
    > from hvac a join building b
    > on a.buildingid = b.buildingid
    > group by a.buildingid, b.buildingmgr limit 10;
Query ID = root_20141030002323_041fbe30-9942-4a05-b307-425a8865fe30
Total jobs = 1
14/10/30 00:23:16 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10007/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
14/10/30 00:23:16 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10007/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
Execution log at: /tmp/root/root_20141030002323_041fbe30-9942-4a05-b307-425a8865fe30.log
2014-10-30 12:23:17     Starting to launch local task to process map join;      maximum memory = 260177920
2014-10-30 12:23:19     Dump the side-table into file: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile51--.hashtable
2014-10-30 12:23:19     Uploaded 1 File to: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile51--.hashtable (714 bytes)
2014-10-30 12:23:19     End of local task; Time Taken: 1.254 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1414630785017_0016, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414630785017_0016
Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0
2014-10-30 00:23:27,599 Stage-2 map = 0%,  reduce = 0%
2014-10-30 00:23:37,062 Stage-2 map = 100%,  reduce = 0%
2014-10-30 00:23:44,578 Stage-2 map = 100%,  reduce = 100%
Ended Job = job_1414630785017_0016
MapReduce Jobs Launched: 
Job 0:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1       M1      14
2       M2      15
3       M3      15
4       M4      15
5       M5      15
6       M6      15
7       M7      14
8       M8      15
9       M9      14
10      M10     15
Time taken: 32.685 seconds, Fetched: 10 row(s)

Query Vectorization

Step 1:

hive> create table hvac_orc stored as orc as select * from hvac;
Query ID = root_20141030002727_5b56b417-f2d7-405c-bb2b-48a74971fa14
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414630785017_0017, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0017/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1414630785017_0017
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2014-10-30 00:27:56,573 Stage-1 map = 0%,  reduce = 0%
2014-10-30 00:28:08,218 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1414630785017_0017
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://sandbox.hortonworks.com:8020/tmp/hive-root/hive_2014-10-30_00-27-32_394_2850138279224601231-1/-ext-10001
Moving data to: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/hvac_orc
Table default.hvac_orc stats: [numFiles=1, numRows=8000, totalSize=26593, rawDataSize=1768000]
MapReduce Jobs Launched: 
Job 0:  HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 37.248 seconds

Step 2:

Run the following statement to enable Tez.

hive> set hive.execution.engine=tez;

Step 3:

hive> select date, count(buildingid) from hvac group by date;
Query ID = root_20141030003131_d3ff96e0-a7d7-42b2-ab4a-bcfa6507a5dd
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0019)

Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK
6/1/13  267
6/10/13 267
6/11/13 267
6/12/13 267
6/13/13 267
6/14/13 267
6/15/13 267
6/16/13 267
6/17/13 267
6/18/13 267
6/19/13 267
6/2/13  267
6/20/13 267
6/21/13 266
6/22/13 266
6/23/13 266
6/24/13 266
6/25/13 266
6/26/13 266
6/27/13 266
6/28/13 266
6/29/13 266
6/3/13  267
6/30/13 266
6/4/13  267
6/5/13  267
6/6/13  267
6/7/13  267
6/8/13  267
6/9/13  267
Time taken: 25.633 seconds, Fetched: 30 row(s)

Step 4:

hive> select date, count(buildingid) from hvac_orc group by date;
Query ID = root_20141030003131_fcf22ec7-ff7f-4a01-8316-ad07929929d8
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0019)

Map 1: -/-      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK
6/1/13  267
6/10/13 267
6/11/13 267
6/12/13 267
6/13/13 267
6/14/13 267
6/15/13 267
6/16/13 267
6/17/13 267
6/18/13 267
6/19/13 267
6/2/13  267
6/20/13 267
6/21/13 266
6/22/13 266
6/23/13 266
6/24/13 266
6/25/13 266
6/26/13 266
6/27/13 266
6/28/13 266
6/29/13 266
6/3/13  267
6/30/13 266
6/4/13  267
6/5/13  267
6/6/13  267
6/7/13  267
6/8/13  267
6/9/13  267
Time taken: 10.683 seconds, Fetched: 30 row(s)

Step 5:

Now let’s run the following steps to enable vectorization:

set hive.vectorized.execution.enabled;
hive> set hive.vectorized.execution.enabled;
hive.vectorized.execution.enabled=true

and then run the sql query from previous step

select date, count(buildingid) from hvac_orc group by date;

hive> select date, count(buildingid) from hvac_orc group by date;
Query ID = root_20141030003232_da026977-69c2-4dd8-8648-56c73434b4d0
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0019)

Map 1: -/-      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK
6/1/13  267
6/10/13 267
6/11/13 267
6/12/13 267
6/13/13 267
6/14/13 267
6/15/13 267
6/16/13 267
6/17/13 267
6/18/13 267
6/19/13 267
6/2/13  267
6/20/13 267
6/21/13 266
6/22/13 266
6/23/13 266
6/24/13 266
6/25/13 266
6/26/13 266
6/27/13 266
6/28/13 266
6/29/13 266
6/3/13  267
6/30/13 266
6/4/13  267
6/5/13  267
6/6/13  267
6/7/13  267
6/8/13  267
6/9/13  267
Time taken: 6.904 seconds, Fetched: 30 row(s)

Step 6:

Let’s look at the ‘explain’ plan to confirm that it is indeed using a vectorized query plan:

explain select date, count(buildingid) from hvac_orc group by date;
hive> explain select date, count(buildingid) from hvac_orc group by date;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20141030003333_eef5ffbf-8427-4d9b-b4be-4a8994926a77:8
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: hvac_orc
                  Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: date (type: string), buildingid (type: bigint)
                    outputColumnNames: date, buildingid
                    Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: count(buildingid)
                      keys: date (type: string)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: string)
                        Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)
            Execution mode: vectorized
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                keys: KEY._col0 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 4000 Data size: 884000 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: bigint)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 4000 Data size: 884000 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 4000 Data size: 884000 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

Time taken: 0.936 seconds, Fetched: 57 row(s)

Stats & Cost Based Optimization (CBO)

Cost Based Optimization(CBO) engine uses statistics within Hive tables to produce optimal query plans.

hive> desc hvac;
OK
date                    string                                      
time                    string                                      
targettemp              bigint                                      
actualtemp              bigint                                      
system                  bigint                                      
systemage               bigint                                      
buildingid              bigint                                      
Time taken: 1.951 seconds, Fetched: 7 row(s)
hive> explain select buildingid, max(targettemp-actualtemp) from hvac group by buildingid;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20141030003838_da1cb621-bd61-4533-b131-fe88a004fb9b:9
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: hvac
                  Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: buildingid (type: bigint), targettemp (type: bigint), actualtemp (type: bigint)
                    outputColumnNames: buildingid, targettemp, actualtemp
                    Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
                    Group By Operator
                      aggregations: max((targettemp - actualtemp))
                      keys: buildingid (type: bigint)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
                        value expressions: _col1 (type: bigint)
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: max(VALUE._col0)
                keys: KEY._col0 (type: bigint)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 5011 Data size: 120265 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: bigint), _col1 (type: bigint)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 5011 Data size: 120265 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 5011 Data size: 120265 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

Time taken: 0.374 seconds, Fetched: 56 row(s)
hive> analyze table hvac compute STATISTICS;
Query ID = root_20141030003939_a34fc478-076b-4563-9f6a-f1c4c8703acf
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.


Status: Running (application id: application_1414630785017_0020)

Map 1: -/-
Map 1: -/-
Map 1: -/-
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 1/1
Status: Finished successfully
Table default.hvac stats: [numFiles=1, numRows=8000, totalSize=240531, rawDataSize=232532]
OK
Time taken: 89.063 seconds
hive> ANALYZE TABLE hvac COMPUTE STATISTICS FOR COLUMNS targettemp,actualtemp,buildingid;
Query ID = root_20141030004343_cff30a2c-aa1c-48ff-af7c-ba92a4718693
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0020)

Map 1: -/-      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK
Time taken: 26.089 seconds
hive> hive.compute.query.using.stats = true;
NoViableAltException(26@[])
        at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:999)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
        at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:408)
        at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
        at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:976)
        at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1041)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:912)
        at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
        at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
        at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
        at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:0 cannot recognize input near 'hive' '.' 'compute'
hive> set hive.compute.query.using.stats = true;
hive> set hive.stats.fetch.column.stats = true;
hive> set hive.stats.fetch.partition.stats = true;
hive> set hive.cbo.enable = true;
hive> explain select buildingid, max(targettemp-actualtemp) from hvac group by buildingid;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
      DagName: root_20141030004545_b1fff813-64cb-4254-a526-9d492c1f0cfd:12
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: hvac
                  Statistics: Num rows: 8000 Data size: 232532 Basic stats: COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: buildingid (type: bigint), targettemp (type: bigint), actualtemp (type: bigint)
                    outputColumnNames: buildingid, targettemp, actualtemp
                    Statistics: Num rows: 8000 Data size: 232532 Basic stats: COMPLETE Column stats: COMPLETE
                    Group By Operator
                      aggregations: max((targettemp - actualtemp))
                      keys: buildingid (type: bigint)
                      mode: hash
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 8000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 8000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col1 (type: bigint)
        Reducer 2 
            Reduce Operator Tree:
              Group By Operator
                aggregations: max(VALUE._col0)
                keys: KEY._col0 (type: bigint)
                mode: mergepartial
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 4000 Data size: 32000 Basic stats: COMPLETE Column stats: COMPLETE
                Select Operator
                  expressions: _col0 (type: bigint), _col1 (type: bigint)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 4000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 4000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

Time taken: 3.874 seconds, Fetched: 56 row(s)
hive> select buildingid, max(targettemp-actualtemp) from hvac group by buildingid;
Query ID = root_20141030004545_c7a5f1a2-4f62-4a72-b36e-7e3d7f9c0673
Total jobs = 1
Launching Job 1 out of 1


Status: Running (application id: application_1414630785017_0020)

Map 1: -/-      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 0/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 0/1
Map 1: 1/1      Reducer 2: 1/1
Status: Finished successfully
OK
1       14
2       15
3       15
4       15
5       15
6       15
7       14
8       15
9       14
10      15
11      14
12      15
13      15
14      15
15      15
16      15
17      15
18      15
19      15
20      15
Time taken: 56.247 seconds, Fetched: 20 row(s)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值