Tez是Apache最新开源的支持DAG作业的计算框架,它直接源于MapReduce框架,核心思想是将Map和Reduce两个操作进一步拆分,即Map被拆分成Input、Processor、Sort、Merge和Output, Reduce被拆分成Input、Shuffle、Sort、Merge、Processor和Output等,这样,这些分解后的元操作可以任意灵活组合,产生新的操作,这些操作经过一些控制程序组装后,可形成一个大的DAG作业。总结起来,Tez有以下特点:
(1)Apache二级开源项目(源代码今天发布的)
(2)运行在YARN之上
(3) 适用于DAG(有向图)应用(同Impala、Dremel和Drill一样,可用于替换Hive/Pig等)
其中,第三点需要做一些简单的说明,Apache当前有顶级项目Oozie用于DAG作业设计,但Oozie是比较高层(作业层面)的,它只是提供了一种多类型作业(比如MR程序、Hive、Pig等)依赖关系表达方式,并按照这种依赖关系提交这些作业,而Tez则不同,它在更底层提供了DAG编程接口,用户编写程序时直接采用这些接口进行程序设计,这种更底层的编程方式会带来更高的效率.
详细介绍:http://dongxicheng.org/mapreduce-nextgen/apache-tez-newest-progress/
Tez有以下几个特色:
(1) 丰富的数据流(dataflow,NOT Streaming!)编程接口;
(2) 扩展性良好的“Input-Processor-Output”运行模型;
(3) 简化数据部署(充分利用了YARN框架,Tez本身仅是一个客户端编程库,无需事先部署相关服务)
(4) 性能优于MapReduce
(5) 优化的资源管理(直接运行在资源管理系统YARN之上)
(1)Apache二级开源项目(源代码今天发布的)
(2)运行在YARN之上
(3) 适用于DAG(有向图)应用(同Impala、Dremel和Drill一样,可用于替换Hive/Pig等)
其中,第三点需要做一些简单的说明,Apache当前有顶级项目Oozie用于DAG作业设计,但Oozie是比较高层(作业层面)的,它只是提供了一种多类型作业(比如MR程序、Hive、Pig等)依赖关系表达方式,并按照这种依赖关系提交这些作业,而Tez则不同,它在更底层提供了DAG编程接口,用户编写程序时直接采用这些接口进行程序设计,这种更底层的编程方式会带来更高的效率.
详细介绍:http://dongxicheng.org/mapreduce-nextgen/apache-tez-newest-progress/
Tez有以下几个特色:
(1) 丰富的数据流(dataflow,NOT Streaming!)编程接口;
(2) 扩展性良好的“Input-Processor-Output”运行模型;
(3) 简化数据部署(充分利用了YARN框架,Tez本身仅是一个客户端编程库,无需事先部署相关服务)
(4) 性能优于MapReduce
(5) 优化的资源管理(直接运行在资源管理系统YARN之上)
(6) 动态生成物理数据流(dataflow)
操作日志:
http://zh.hortonworks.com/hadoop-tutorial/supercharging-interactive-queries-hive-tez/
[root@sandbox ~]# hive
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
hive> set hive.execution.engine=mr;
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr
> from building b join hvac h
> on b.buildingid = h.buildingid;
Query ID = root_20141030000808_c3d3772d-3aa0-4673-a28e-515043452d7a
Total jobs = 1
14/10/30 00:08:46 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/10/30 00:08:46 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/root/root_20141030000808_c3d3772d-3aa0-4673-a28e-515043452d7a.log
2014-10-30 12:08:48 Starting to launch local task to process map join; maximum memory = 260177920
2014-10-30 12:08:52 Dump the side-table into file: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable
2014-10-30 12:08:52 Uploaded 1 File to: file:/tmp/root/hive_2014-10-30_00-08-32_847_2342871984031221118-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile00--.hashtable (1044 bytes)
2014-10-30 12:08:52 End of local task; Time Taken: 3.833 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414630785017_0012, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0012/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414630785017_0012
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2014-10-30 00:09:13,234 Stage-3 map = 0%, reduce = 0%
2014-10-30 00:09:21,143 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.91 sec
MapReduce Total cumulative CPU time: 1 seconds 910 msec
Ended Job = job_1414630785017_0012
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.91 sec HDFS Read: 240754 HDFS Write: 416262 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 910 msec
OK
6/1/13 0:00:01 66 58 13 20 4 Finland GG1919 17 M4
......
6/17/13 2:33:07 68 72 17 27 12 Finland FN39TG 26 M12
6/18/13 3:33:07 68 69 10 4 3 Brazil JDNS77 28 M3
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 51.333 seconds, Fetched: 8000 row(s)
hive> set hive.execution.engine=tez;
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr
> from building b join hvac h
> on b.buildingid = h.buildingid;
Query ID = root_20141030001010_9b9e76a3-5da9-48b6-9bd6-302c6560aa79
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0013)
Map 1: -/- Map 2: -/-
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 1/1 Map 2: 0/1
Map 1: 1/1 Map 2: 1/1
Status: Finished successfully
OK
6/1/13 0:00:01 66 58 13 20 4 Finland GG1919 17 M4
......
6/17/13 2:33:07 68 72 17 27 12 Finland FN39TG 26 M12
6/18/13 3:33:07 68 69 10 4 3 Brazil JDNS77 28 M3
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 52.979 seconds, Fetched: 8000 row(s)
hive> select a.buildingid, b.buildingmgr, max(a.targettemp-a.actualtemp)
> from hvac a join building b
> on a.buildingid = b.buildingid
> group by a.buildingid, b.buildingmgr limit 10;
Query ID = root_20141030001616_e16c39f3-8cc6-4809-96f1-c76a55543a72
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (application id: application_1414630785017_0014)
Map 1: -/- Map 2: -/- Reducer 3: 0/1
Map 1: 0/1 Map 2: 0/1 Reducer 3: 0/1
Map 1: 0/1 Map 2: 0/1 Reducer 3: 0/1
Map 1: 0/1 Map 2: 0/1 Reducer 3: 0/1
Map 1: 0/1 Map 2: 0/1 Reducer 3: 0/1
Map 1: 1/1 Map 2: 0/1 Reducer 3: 0/1
Map 1: 1/1 Map 2: 1/1 Reducer 3: 0/1
Map 1: 1/1 Map 2: 1/1 Reducer 3: 1/1
Status: Finished successfully
OK
1 M1 14
2 M2 15
3 M3 15
4 M4 15
5 M5 15
6 M6 15
7 M7 14
8 M8 15
9 M9 14
10 M10 15
Time taken: 37.644 seconds, Fetched: 10 row(s)
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr
> from building b join hvac h
> on b.buildingid = h.buildingid limit 10;
Query ID = root_20141030001919_7ef87e96-9848-404b-9f7c-2e847efbda72
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0014)
Map 1: -/- Map 2: -/-
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 0/1 Map 2: 0/1
Map 1: 1/1 Map 2: 0/1
Map 1: 1/1 Map 2: 1/1
Status: Finished successfully
OK
6/1/13 0:00:01 66 58 13 20 4 Finland GG1919 17 M4
6/2/13 1:00:01 69 68 3 20 17 Egypt FN39TG 11 M17
6/3/13 2:00:01 70 73 17 20 18 Indonesia JDNS77 25 M18
6/4/13 3:00:01 67 63 2 23 15 Israel ACMAX22 19 M15
6/5/13 4:00:01 68 74 16 9 3 Brazil JDNS77 28 M3
6/6/13 5:00:01 67 56 13 28 4 Finland GG1919 17 M4
6/7/13 6:00:01 70 58 12 24 2 France FN39TG 27 M2
6/8/13 7:00:01 70 73 20 26 16 Turkey AC1000 23 M16
6/9/13 8:00:01 66 69 16 9 9 Mexico GG1919 11 M9
6/10/13 9:00:01 65 57 6 5 12 Finland FN39TG 26 M12
Time taken: 30.373 seconds, Fetched: 10 row(s)
hive> set hive.execution.engine=mr;
hive> select h.*, b.country, b.hvacproduct, b.buildingage, b.buildingmgr
> from building b join hvac h
> on b.buildingid = h.buildingid limit 10;
Query ID = root_20141030002121_82b6c25d-3efe-4edb-97f0-6ee104a44793
Total jobs = 1
14/10/30 00:21:17 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/10/30 00:21:17 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10006/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/root/root_20141030002121_82b6c25d-3efe-4edb-97f0-6ee104a44793.log
2014-10-30 12:21:20 Starting to launch local task to process map join; maximum memory = 260177920
2014-10-30 12:21:26 Dump the side-table into file: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile40--.hashtable
2014-10-30 12:21:26 Uploaded 1 File to: file:/tmp/root/hive_2014-10-30_00-21-05_917_4670481632857759726-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile40--.hashtable (1044 bytes)
2014-10-30 12:21:26 End of local task; Time Taken: 6.094 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414630785017_0015, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0015/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414630785017_0015
Hadoop job information for Stage-3: number of mappers: 0; number of reducers: 0
2014-10-30 00:21:40,956 Stage-3 map = 0%, reduce = 0%
2014-10-30 00:21:51,576 Stage-3 map = 100%, reduce = 0%
Ended Job = job_1414630785017_0015
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
6/1/13 0:00:01 66 58 13 20 4 Finland GG1919 17 M4
6/2/13 1:00:01 69 68 3 20 17 Egypt FN39TG 11 M17
6/3/13 2:00:01 70 73 17 20 18 Indonesia JDNS77 25 M18
6/4/13 3:00:01 67 63 2 23 15 Israel ACMAX22 19 M15
6/5/13 4:00:01 68 74 16 9 3 Brazil JDNS77 28 M3
6/6/13 5:00:01 67 56 13 28 4 Finland GG1919 17 M4
6/7/13 6:00:01 70 58 12 24 2 France FN39TG 27 M2
6/8/13 7:00:01 70 73 20 26 16 Turkey AC1000 23 M16
6/9/13 8:00:01 66 69 16 9 9 Mexico GG1919 11 M9
6/10/13 9:00:01 65 57 6 5 12 Finland FN39TG 26 M12
Time taken: 45.935 seconds, Fetched: 10 row(s)
hive> select a.buildingid, b.buildingmgr, max(a.targettemp-a.actualtemp)
> from hvac a join building b
> on a.buildingid = b.buildingid
> group by a.buildingid, b.buildingmgr limit 10;
Query ID = root_20141030002323_041fbe30-9942-4a05-b307-425a8865fe30
Total jobs = 1
14/10/30 00:23:16 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10007/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
14/10/30 00:23:16 WARN conf.Configuration: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10007/jobconf.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Execution log at: /tmp/root/root_20141030002323_041fbe30-9942-4a05-b307-425a8865fe30.log
2014-10-30 12:23:17 Starting to launch local task to process map join; maximum memory = 260177920
2014-10-30 12:23:19 Dump the side-table into file: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile51--.hashtable
2014-10-30 12:23:19 Uploaded 1 File to: file:/tmp/root/hive_2014-10-30_00-23-12_351_424724314401221979-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile51--.hashtable (714 bytes)
2014-10-30 12:23:19 End of local task; Time Taken: 1.254 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1414630785017_0016, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414630785017_0016
Hadoop job information for Stage-2: number of mappers: 0; number of reducers: 0
2014-10-30 00:23:27,599 Stage-2 map = 0%, reduce = 0%
2014-10-30 00:23:37,062 Stage-2 map = 100%, reduce = 0%
2014-10-30 00:23:44,578 Stage-2 map = 100%, reduce = 100%
Ended Job = job_1414630785017_0016
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 M1 14
2 M2 15
3 M3 15
4 M4 15
5 M5 15
6 M6 15
7 M7 14
8 M8 15
9 M9 14
10 M10 15
Time taken: 32.685 seconds, Fetched: 10 row(s)
Query Vectorization
Step 1:
hive> create table hvac_orc stored as orc as select * from hvac;
Query ID = root_20141030002727_5b56b417-f2d7-405c-bb2b-48a74971fa14
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1414630785017_0017, Tracking URL = http://sandbox.hortonworks.com:8088/proxy/application_1414630785017_0017/
Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1414630785017_0017
Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0
2014-10-30 00:27:56,573 Stage-1 map = 0%, reduce = 0%
2014-10-30 00:28:08,218 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1414630785017_0017
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://sandbox.hortonworks.com:8020/tmp/hive-root/hive_2014-10-30_00-27-32_394_2850138279224601231-1/-ext-10001
Moving data to: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/hvac_orc
Table default.hvac_orc stats: [numFiles=1, numRows=8000, totalSize=26593, rawDataSize=1768000]
MapReduce Jobs Launched:
Job 0: HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
Time taken: 37.248 seconds
Step 2:
Run the following statement to enable Tez.
hive> set hive.execution.engine=tez;
Step 3:
hive> select date, count(buildingid) from hvac group by date;
Query ID = root_20141030003131_d3ff96e0-a7d7-42b2-ab4a-bcfa6507a5dd
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0019)
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 1/1
Status: Finished successfully
OK
6/1/13 267
6/10/13 267
6/11/13 267
6/12/13 267
6/13/13 267
6/14/13 267
6/15/13 267
6/16/13 267
6/17/13 267
6/18/13 267
6/19/13 267
6/2/13 267
6/20/13 267
6/21/13 266
6/22/13 266
6/23/13 266
6/24/13 266
6/25/13 266
6/26/13 266
6/27/13 266
6/28/13 266
6/29/13 266
6/3/13 267
6/30/13 266
6/4/13 267
6/5/13 267
6/6/13 267
6/7/13 267
6/8/13 267
6/9/13 267
Time taken: 25.633 seconds, Fetched: 30 row(s)
Step 4:
hive> select date, count(buildingid) from hvac_orc group by date;
Query ID = root_20141030003131_fcf22ec7-ff7f-4a01-8316-ad07929929d8
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0019)
Map 1: -/- Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 1/1
Status: Finished successfully
OK
6/1/13 267
6/10/13 267
6/11/13 267
6/12/13 267
6/13/13 267
6/14/13 267
6/15/13 267
6/16/13 267
6/17/13 267
6/18/13 267
6/19/13 267
6/2/13 267
6/20/13 267
6/21/13 266
6/22/13 266
6/23/13 266
6/24/13 266
6/25/13 266
6/26/13 266
6/27/13 266
6/28/13 266
6/29/13 266
6/3/13 267
6/30/13 266
6/4/13 267
6/5/13 267
6/6/13 267
6/7/13 267
6/8/13 267
6/9/13 267
Time taken: 10.683 seconds, Fetched: 30 row(s)
Step 5:
Now let’s run the following steps to enable vectorization:
set hive.vectorized.execution.enabled;
hive> set hive.vectorized.execution.enabled;
hive.vectorized.execution.enabled=true
and then run the sql query from previous step
select date, count(buildingid) from hvac_orc group by date;
hive> select date, count(buildingid) from hvac_orc group by date;
Query ID = root_20141030003232_da026977-69c2-4dd8-8648-56c73434b4d0
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0019)
Map 1: -/- Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 1/1
Status: Finished successfully
OK
6/1/13 267
6/10/13 267
6/11/13 267
6/12/13 267
6/13/13 267
6/14/13 267
6/15/13 267
6/16/13 267
6/17/13 267
6/18/13 267
6/19/13 267
6/2/13 267
6/20/13 267
6/21/13 266
6/22/13 266
6/23/13 266
6/24/13 266
6/25/13 266
6/26/13 266
6/27/13 266
6/28/13 266
6/29/13 266
6/3/13 267
6/30/13 266
6/4/13 267
6/5/13 267
6/6/13 267
6/7/13 267
6/8/13 267
6/9/13 267
Time taken: 6.904 seconds, Fetched: 30 row(s)
Step 6:
Let’s look at the ‘explain’ plan to confirm that it is indeed using a vectorized query plan:
explain select date, count(buildingid) from hvac_orc group by date;
hive> explain select date, count(buildingid) from hvac_orc group by date;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName: root_20141030003333_eef5ffbf-8427-4d9b-b4be-4a8994926a77:8
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: hvac_orc
Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: date (type: string), buildingid (type: bigint)
outputColumnNames: date, buildingid
Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(buildingid)
keys: date (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 8000 Data size: 1768000 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Execution mode: vectorized
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 4000 Data size: 884000 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4000 Data size: 884000 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 4000 Data size: 884000 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 0.936 seconds, Fetched: 57 row(s)
Stats & Cost Based Optimization (CBO)
Cost Based Optimization(CBO) engine uses statistics within Hive tables to produce optimal query plans.
hive> desc hvac;
OK
date string
time string
targettemp bigint
actualtemp bigint
system bigint
systemage bigint
buildingid bigint
Time taken: 1.951 seconds, Fetched: 7 row(s)
hive> explain select buildingid, max(targettemp-actualtemp) from hvac group by buildingid;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName: root_20141030003838_da1cb621-bd61-4533-b131-fe88a004fb9b:9
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: hvac
Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: buildingid (type: bigint), targettemp (type: bigint), actualtemp (type: bigint)
outputColumnNames: buildingid, targettemp, actualtemp
Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: max((targettemp - actualtemp))
keys: buildingid (type: bigint)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 10022 Data size: 240531 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: max(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 5011 Data size: 120265 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 5011 Data size: 120265 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 5011 Data size: 120265 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 0.374 seconds, Fetched: 56 row(s)
hive> analyze table hvac compute STATISTICS;
Query ID = root_20141030003939_a34fc478-076b-4563-9f6a-f1c4c8703acf
Total jobs = 1
Launching Job 1 out of 1
Tez session was closed. Reopening...
Session re-established.
Status: Running (application id: application_1414630785017_0020)
Map 1: -/-
Map 1: -/-
Map 1: -/-
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 0/1
Map 1: 1/1
Status: Finished successfully
Table default.hvac stats: [numFiles=1, numRows=8000, totalSize=240531, rawDataSize=232532]
OK
Time taken: 89.063 seconds
hive> ANALYZE TABLE hvac COMPUTE STATISTICS FOR COLUMNS targettemp,actualtemp,buildingid;
Query ID = root_20141030004343_cff30a2c-aa1c-48ff-af7c-ba92a4718693
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0020)
Map 1: -/- Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 1/1
Status: Finished successfully
OK
Time taken: 26.089 seconds
hive> hive.compute.query.using.stats = true;
NoViableAltException(26@[])
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:999)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:408)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:976)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1041)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:912)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:902)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:793)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
FAILED: ParseException line 1:0 cannot recognize input near 'hive' '.' 'compute'
hive> set hive.compute.query.using.stats = true;
hive> set hive.stats.fetch.column.stats = true;
hive> set hive.stats.fetch.partition.stats = true;
hive> set hive.cbo.enable = true;
hive> explain select buildingid, max(targettemp-actualtemp) from hvac group by buildingid;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
DagName: root_20141030004545_b1fff813-64cb-4254-a526-9d492c1f0cfd:12
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: hvac
Statistics: Num rows: 8000 Data size: 232532 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: buildingid (type: bigint), targettemp (type: bigint), actualtemp (type: bigint)
outputColumnNames: buildingid, targettemp, actualtemp
Statistics: Num rows: 8000 Data size: 232532 Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
aggregations: max((targettemp - actualtemp))
keys: buildingid (type: bigint)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 8000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: bigint)
sort order: +
Map-reduce partition columns: _col0 (type: bigint)
Statistics: Num rows: 8000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: bigint)
Reducer 2
Reduce Operator Tree:
Group By Operator
aggregations: max(VALUE._col0)
keys: KEY._col0 (type: bigint)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 4000 Data size: 32000 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: bigint), _col1 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 4000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 4000 Data size: 64000 Basic stats: COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 3.874 seconds, Fetched: 56 row(s)
hive> select buildingid, max(targettemp-actualtemp) from hvac group by buildingid;
Query ID = root_20141030004545_c7a5f1a2-4f62-4a72-b36e-7e3d7f9c0673
Total jobs = 1
Launching Job 1 out of 1
Status: Running (application id: application_1414630785017_0020)
Map 1: -/- Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 0/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 0/1
Map 1: 1/1 Reducer 2: 1/1
Status: Finished successfully
OK
1 14
2 15
3 15
4 15
5 15
6 15
7 14
8 15
9 14
10 15
11 14
12 15
13 15
14 15
15 15
16 15
17 15
18 15
19 15
20 15
Time taken: 56.247 seconds, Fetched: 20 row(s)