Fetch抓取是指hive在某些情况的查询可以不必使用MapReduce计算,例如在执行一个简单的select * from XX
时,我们只需要简单的进行抓取对应目录下的数据即可。
在hive-default.xml.template
中,hive.fetch.task.conversion
默认是morn,老版本中默认是minimal。
该属性为morn时,在全局查找,字段查找,limit查找等都不走MapReduce。
<property>
<name>hive.fetch.task.conversion</name>
<value>more</value>
<description>
Expects one of [none, minimal, more].
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
0. none : disable hive.fetch.task.conversion
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
</description>
</property>
我们来做个比较:
hive (default)> set hive.fetch.task.conversion=none;
hive (default)> select * from emp ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200710194508_8561a541-492c-48f4-a381-95f041e276c8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1594379941865_0001, Tracking URL = http://master:18088/proxy/application_1594379941865_0001/
Kill Command = /usr/hdk/hadoop/bin/hadoop job -kill job_1594379941865_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-07-10 19:45:29,124 Stage-1 map = 0%, reduce = 0%
2020-07-10 19:45:44,427 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.05 sec
MapReduce Total cumulative CPU time: 3 seconds 50 msec
Ended Job = job_1594379941865_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 3.05 sec HDFS Read: 5489 HDFS Write: 916 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 50 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 37.739 seconds, Fetched: 14 row(s)
这是使用mr去查询的运行结果,运行了37s。
hive (default)> set hive.fetch.task.conversion=more;
hive (default)> select * from emp ;
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
Time taken: 0.289 seconds, Fetched: 14 row(s)
这样只需要0.2s。
本地模式
我们在处理小数据集的时候可以使用本地模式来处理所有任务,这样可以明显的缩短执行时间。
本地模式要求数据量小于128MB(一个标准块大小),同时文件数小于4个(可调整)。
hive (default)> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive (default)> set hive.exec.mode.local.auto=true;
hive (default)> set hive.exec.mode.local.auto.inputbytes.max;
hive.exec.mode.local.auto.inputbytes.max=134217728
hive (default)> set hive.exec.mode.local.auto.input.files.max;
hive.exec.mode.local.auto.input.files.max=4
hive (default)> select * from emp cluster by deptno;
Automatically selecting local only mode for query
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200710210745_10f9202c-678f-4ddb-aa2b-555551d11e1d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Job running in-process (local Hadoop)
2020-07-10 21:07:46,989 Stage-1 map = 100%, reduce = 100%
Ended Job = job_local897334446_0004
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 10848 HDFS Write: 6412 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
Time taken: 1.559 seconds, Fetched: 14 row(s)
hive (default)> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive (default)> set hive.fetch.task.conversion;
hive.fetch.task.conversion=more
hive (default)> select * from emp cluster by deptno;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20200710210943_088e87fe-968d-4c11-868f-267a89bb5fef
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1594379941865_0003, Tracking URL = http://master:18088/proxy/application_1594379941865_0003/
Kill Command = /usr/hdk/hadoop/bin/hadoop job -kill job_1594379941865_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-10 21:09:53,706 Stage-1 map = 0%, reduce = 0%
2020-07-10 21:10:03,663 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.74 sec
2020-07-10 21:10:13,125 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 5.11 sec
MapReduce Total cumulative CPU time: 5 seconds 110 msec
Ended Job = job_1594379941865_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.11 sec HDFS Read: 10306 HDFS Write: 916 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 110 msec
OK
emp.empno emp.ename emp.job emp.mgr emp.hiredate emp.sal emp.comm emp.deptno
7934 MILLER CLERK 7782 1982-1-23 1300.0 NULL 10
7839 KING PRESIDENT NULL 1981-11-17 5000.0 NULL 10
7782 CLARK MANAGER 7839 1981-6-9 2450.0 NULL 10
7876 ADAMS CLERK 7788 1987-5-23 1100.0 NULL 20
7788 SCOTT ANALYST 7566 1987-4-19 3000.0 NULL 20
7369 SMITH CLERK 7902 1980-12-17 800.0 NULL 20
7566 JONES MANAGER 7839 1981-4-2 2975.0 NULL 20
7902 FORD ANALYST 7566 1981-12-3 3000.0 NULL 20
7844 TURNER SALESMAN 7698 1981-9-8 1500.0 0.0 30
7499 ALLEN SALESMAN 7698 1981-2-20 1600.0 300.0 30
7698 BLAKE MANAGER 7839 1981-5-1 2850.0 NULL 30
7654 MARTIN SALESMAN 7698 1981-9-28 1250.0 1400.0 30
7521 WARD SALESMAN 7698 1981-2-22 1250.0 500.0 30
7900 JAMES CLERK 7698 1981-12-3 950.0 NULL 30
Time taken: 30.785 seconds, Fetched: 14 row(s)
如果不开启本地模式,就要30s。
本地模式是搭配着hive.exec.mode.local.auto.inputbytes.max
和hive.exec.mode.local.auto.input.files.max
使用的。