Hive 数据操作
插入(加载)数据
方式一:使用load加载数据
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1,partcol2=val2...)]
- LOAD DATA LOCAL
- 加载本地文件
- 复制文件
- LOAD DATA
- 加载hdfs文件
- 移动文件
演示一:
## 加载本地文件到hive表中
load data local inpath '/home/hadoop/data/t_load.txt' into table t_load;
## 加载hdfs文件到hive表中
load data inpath '/t_load01.txt' into table t_load;
## 加载本地文件覆盖hive表中的数据
load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load;
## 加载本地数据到分区中
create table t_load_part(
id int,
name string,
age int
)partitioned by (time string)
row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load_part partition (time='2020-12-12');
演示二:
create table t_load_part_orc(
id int,
name string,
age int
)partitioned by (time string)
row format delimited fields terminated by ','
stored as orc;
load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load_part_orc partition (time='2020-12-12');
以上代码中创建的t_load_part_orc表的文件格式是orc 格式。但是加载的数据格式是txt文本,所以在查询的时 候抛出异常:
Failed with exception java.io.IOException:java.io.IOException:
Malformed(畸形的) ORC file hdfs://ns1/user/hive/warehouse/t_load_part_orc/time=2020-12-12/t_load.txt. Invalid postscript.
我们发现在执行load data命令的时候是没有启动MR程 序的。我们可以理解load data 的工作是:
1、将文件直 接加载到表的目录中
2、将相关的元数据信息加到 MySQL中。
没有涉及到文件格式的转换,所以不能成功。 那么,如果创建表的文件格式和加载的数据文件格式不 一致的时候,如何将文件数据加载到表中呢???此时,不能使用load命令,我们可以使用insert命令。
方式二:使用insert插入数据
— 执行MR,需要中间表
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
演示:
1、创建原始数据 - TextFile
1,zhangsan,18
2,lisi,19
3,wangwu,21
2、创建原始数据表
create table t_load_part_txt(
id int,
name string,
age int
)
row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load_part_txt;
3、创建一个orc格式的数据表
create table t_load_part_orc(
id int,
name string,
age int
)partitioned by (time string)
row format delimited fields terminated by ','
stored as orc;
4、使用insert将txt的数据存储在orc文件中
insert into table t_load_part_orc partition (time='2020-12-12') select * from t_load_part_txt;
Inserting values into tables from SQL
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [,values_row ...]
方式三:使用location关联已存在的数据
导出数据
insert导出数据
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format]
SELECT ... FROM ...
将表中的数据导出到本地保存成parquet格式的文件:
insert overwrite local directory '/home/hadoop/data/load001/'
row format delimited fields terminated by '\t'
stored as parquet
select * from t_load;
Import/Export
import导入的数据必须是export导出的数据。
EXPORT命令将表或分区的数据以及元数据导出到指定 的输出位置。然后可以将此输出位置移到另一个 Hadoop或Hive实例,并使用IMPORT命令从那里导入。
Export Syntax
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])] TO 'export_target_path';
export table t_load to "/export";
Import Syntax
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]]
FROM 'source_path' [LOCATION 'import_target_path']
## t_import 必须是空表或者不存在
import table t_import from "/export";
查询数据
SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT [offset,] rows]
Group by
select age,count(1) as num from t_student group by age having num > 3;
## 注意以下参数:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
## 没有设置number值就是获取当前值
0: jdbc:hive2://hadoop101:10000> set mapreduce.job.reduces;
+---------------------------+--+
| set |
+---------------------------+--+
| mapreduce.job.reduces=-1 |
+---------------------------+--+
## 有number就是设置值
0: jdbc:hive2://hadoop101:10000> set mapreduce.job.reduces=3;
0: jdbc:hive2://hadoop101:10000> set mapreduce.job.reduces;
+---------------------------+--+
| set |
+---------------------------+--+
| mapreduce.job.reduces=3 |
+---------------------------+--+
order by
全局排序。只有一个reducer。
insert overwrite local directory '/home/hadoop/data/orderby/'
row format delimited fields terminated by '\t'
select * from t_student order by age desc;
Limit
全局的。
insert overwrite local directory '/home/hadoop/data/orderby01/'
row format delimited fields terminated by '\t'
select * from t_student order by age desc limit 3;
严格模式
hive执行操作的模式。默认是:nonstrict
set hive.mapred.mode = strict;
在严格模式下,不允许运行一些有风险的查询。 不允许的操作如下:
Cartesian Product. 笛卡儿积
No partition being picked up for a query. 查询没有指定分区字段
Comparing bigints and strings. 比较bigints andstrings.
Comparing bigints and doubles. 比较bigints and doubles
Orderby without limit. orderby 没有 limit
Sort by
sort by 是分区排序。不受严格模式的限制。
insert overwrite local directory '/home/hadoop/data/orderby04/'
row format delimited fields terminated by '\t'
select * from t_student sort by age desc;
如果在sory by 后面使用limit。limit是全局的,会新启动一个MR程序完成全局的limit。
Distribute by
distribute by 用来指定分区字段,联合sort by使用的。
insert overwrite local directory '/home/hadoop/data/orderby07/'
row format delimited fields terminated by '\t'
select * from t_stu distribute by id sort by id;
如果分区字段和排序字段一样,且是升序,那么 distribute by + sort by可以使用 cluster by 替换。
cluster by
insert overwrite local directory '/home/hadoop/data/orderby08/'
row format delimited fields terminated by '\t'
select * from t_stu cluster by id;
Hive Operators 操作符
操作符优先级
Example | Operators | Description |
---|---|---|
A[B] , A.identifier | bracket_op([]), dot(.) | element selector, dot |
-A | unary(+), unary(-), unary(~) | unary prefix operators |
A IS [NOT] (NULL|TRUE|FALSE) | IS NULL,IS NOT NULL, … | unary suffix |
A ^ B | bitwise xor(^) | bitwise xor |
A * B | star(*), divide(/), mod(%), div(DIV) | multiplicative operators |
A + B | plus(+), minus(-) | additive operators |
A || B | string concatenate(||) | string concatenate |
A & B | bitwise and(&) | bitwise and |
A | B | bitwise or(|) | bitwise or |
关系操作符
Operator | Operand types | Description |
---|---|---|
Operator | Operand types | Description |
A = B | All primitive types | TRUE if expression A is equal to expression B otherwise FALSE. |
A == B | All primitive types | Synonym for the = operator. |
A <=> B | All primitive types | Returns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL. (As of version 0.9.0.) |
A <> B | All primitive types | NULL if A or B is NULL, TRUE if expression A is NOT equal to expression B, otherwise FALSE. |
A != B | All primitive types | Synonym for the <> operator. |
A < B | All primitive types | NULL if A or B is NULL, TRUE if expression A is less than expression B, otherwise FALSE. |
A <= B | All primitive types | NULL if A or B is NULL, TRUE if expression A is less than or equal to expression B, otherwise FALSE. |
A > B | All primitive types | NULL if A or B is NULL, TRUE if expression A is greater than expression B, otherwise FALSE. |
A >= B | All primitive types | NULL if A or B is NULL, TRUE if expression A is greater than or equal to expression B, otherwise FALSE. |
A [NOT] BETWEEN B AND C | All primitive types | NULL if A, B or C is NULL, TRUE if A is greater than or equal to B AND A less than or equal to C, otherwise FALSE. This can be inverted by using the NOT keyword. (As of version 0.9.0.) |
A IS NULL | All types | TRUE if expression A evaluates to NULL, otherwise FALSE. |
A IS NOT NULL | All types | FALSE if expression A evaluates to NULL, otherwise TRUE. |
A IS [NOT] (TRUE|FALSE) | Boolean types | Evaluates to TRUE only if A mets the condition. (since:3.0.0 ) Note: NULL is UNKNOWN, and because of that (UNKNOWN IS TRUE) and (UNKNOWN IS FALSE) both evaluates to FALSE. |
A [NOT] LIKE B | strings | NULL if A or B is NULL, TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions) while the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, ‘foobar’ like ‘foo’ evaluates to FALSE whereas ‘foobar’ like ‘foo_ _ _’ evaluates to TRUE and so does ‘foobar’ like ‘foo%’. |
A RLIKE B | strings | NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B, otherwise FALSE. For example, ‘foobar’ RLIKE ‘foo’ evaluates to TRUE and so does ‘foobar’ RLIKE ‘^f.*r$’. |
A REGEXP B | strings | Same as RLIKE. |
算术操作符
+、-、*、\、%、&、|
Operator | Operand types | Description |
---|---|---|
A + B | All number types | Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. For example since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B | All number types | Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B | All number types | Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B | All number types | Gives the result of dividing A by B. The result is a double type in most cases. When A and B are both integers, the result is a double type except when the hive.compat configuration parameter is set to “0.13” or “latest” in which case the result is a decimal type. |
A DIV B | Integer types | Gives the integer part resulting from dividing A by B. E.g 17 div 3 results in 5. |
A % B | All number types | Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B | All number types | Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B | All number types | Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B | All number types | Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A | All number types | Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
逻辑操作符
Operator | Operand types | Description |
---|---|---|
A AND B | boolean | TRUE if both A and B are TRUE, otherwise FALSE. NULL if A or B is NULL. |
A OR B | boolean | TRUE if either A or B or both are TRUE, FALSE OR NULL is NULL, otherwise FALSE. |
NOT A | boolean | TRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE. |
! A | boolean | Same as NOT A. |
A IN (val1, val2, …) | boolean | TRUE if A is equal to any of the values. As of Hive 0.13 subqueries are supported in IN statements. |
A NOT IN (val1, val2, …) | boolean | TRUE if A is not equal to any of the values. As of Hive 0.13 subqueries are supported in NOT IN statements. |
[NOT] EXISTS (subquery) | TRUE if the the subquery returns at least one row. Supported as of Hive 0.13. |
复杂类型操作符
Constructor Function | Operands | Description |
---|---|---|
map | (key1, value1, key2, value2, …) | Creates a map with the given key/value pairs. |
struct | (val1, val2, val3, …) | Creates a struct with the given field values. Struct field names will be col1, col2, … |
named_struct | (name1, val1, name2, val2, …) | Creates a struct with the given field names and values. (As of Hive 0.8.0.) |
array | (val1, val2, …) | Creates an array with the given elements. |
create_union | (tag, val1, val2, …) | Creates a union type with the value that is being pointed to by the tag parameter. |
下面的操作符提供了访问复杂类型元素的机制:
Operator | Operand types | Description |
---|---|---|
A[n] | A is an Array and n is an int | Returns the nth element in the array A. The first element has index 0. For example, if A is an array comprising of [‘foo’, ‘bar’] then A[0] returns ‘foo’ and A[1] returns ‘bar’. |
M[key] | M is a Map<K, V> and key has type K | Returns the value corresponding to the key in the map. For example, if M is a map comprising of {‘f’ -> ‘foo’, ‘b’ -> ‘bar’, ‘all’ -> ‘foobar’} then M[‘all’] returns ‘foobar’. |
S.x | S is a struct | Returns the x field of S. For example for the struct foobar {int foo, int bar}, foobar.foo returns the integer stored in the foo field of the struct. |
内置的UDF函数
-
显示hive的函数
show functions;
-
查看函数的描述信息
desc function 函数名;
-
查看函数的扩展信息-----如果有案例,可以显示案例信息
desc function extended 函数名;
UDF(User-Defined Function)用户定义函数
特点: 一对一。数据输入一个,输出一个
数学函数
- round:四舍五入
- floor:向下取整
- ceil:向上取整
集合函数
- size(Map<K.V>)
- size(Array)
- map_keys(Map<K.V>)
- map_values(Map<K.V>)
- array_contains(Array, value)
- sort_array(Array)
类型转换函数
-
cast(expr as ) : For example, cast(‘1’ as BIGINT)
select cast('1' as int) + 1;
日期函数
- current_date
- current_timestamp
条件函数
- if(boolean testCondition, T valueTrue, TvalueFalseOrNull)
- CASE a WHEN b THEN c [WHEN d THEN e]* [ELSEf] END
相当于java的switch
select id,name,age,
case sex
when 0 then '男生'
when 1 then '女生'
else '未知'
end as sex
from t_stu01;
-
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
相当于java的if…else if
select id,name,age,
case
when sex=0 then '男生'
when sex=1 then '女生'
else '未知'
end as sex
from t_stu01;
字符串函数
- concat(string|binary A, string|binary B…) 字符串拼接
- concat_ws(string SEP, string A, string B…) 使用指定的分隔符拼接字符串
- length(string A)
- split(string str, string pat)
- substr(string|binary A, int start)
- substring(string|binary A, int start)
混合函数
- current_user()
- current_database()
自定义UDF函数
案例:自定义一个udf函数,返回日期对应的季度。
添加依赖:
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.2</version>
</dependency>
-
创建一个类继承UDF,使用一个或多个名为evaluate 的方法。
package com.bigdata.hive; import org.apache.hadoop.hive.ql.exec.UDF; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; /** * 需求:自定义一个udf函数,返回日期对应的季度。 * 1、创建一个类继承UDF * 2、 使用一个或多个名为evaluate的方法。 */ public class MyQuarter extends UDF { //返回日期对应的季度。 public int evaluate(String date){ int quarter = 1; SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd"); try { Date dt = sf.parse(date); Calendar calendar = Calendar.getInstance(); calendar.setTime(dt); int m = calendar.get(Calendar.MONTH) + 1; if(m <= 3){ quarter = 1; }else if(m <= 6){ quarter = 2; }else if(m <= 9){ quarter = 3; }else{ quarter = 4; } } catch (ParseException e) { throw new RuntimeException("日期格式:yyyy-MM-dd"); } return quarter; } }
-
生成jar包,并添加到hive的classpath中
hive (default)> add; Usage: add [FILE|JAR|ARCHIVE] <value> [<value>]* hive (default)> add jar /home/hadoop/jars/hive.jar; Added [/home/hadoop/jars/hive.jar] to class path Added resources:[/home/hadoop/jars/hive.jar]
-
注册函数
create temporary function my_qua as 'com.bigdata.hive.MyQuarter';
-
使用函数
hive (default)> > select my_qua('2020-10-11'); OK _c0 4
UDAF聚合函数
特点: 多对一
- collect_set: 返回一组去掉重复元素的对象。
- collect_list:返回具有重复项的对象列表。
zhangsan,100001
wangwu,200001
zhangsan,100002
wangwu,200002
zhangsan,100003
wangwu,200003
zhangsan,100004
## 将同一个用户的订单id组合在一起
select name,concat_ws(' ',collect_list(cast(oid as string))) as oids from t_order group by name;
Table-Generating Functions (UDTF) 表生成函数
一般的用户定义函数,如concat(),接受单个输入行并 输出单个输出行。相反,表生成函数将单个输入行转换 为多个输出行。
- explode(ARRAY a) :炸裂函数
- json_tuple(string jsonStr,string k1,…,string kn)
Lateral View 侧面视图
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
- 横向视图与用户定义的表生成函数(如explode())一起使用。
- 横向视图首先将UDTF应用于基表的每一行,然后将产生的输出行连接到输入行,以形成一个具有提供的表别名的虚拟表。
hive (default)> select * from t_array;
t_array.name t_array.oids
wangwu [200001,200001,200001]
zhangsan [100001,100002,100003,100004]
select name,t.*
from t_array
LATERAL VIEW explode(oids) t as oid;
查询结果:
name t.oid
wangwu 200001
wangwu 200001
wangwu 200001
zhangsan 100001
zhangsan 100002
zhangsan 100003
zhangsan 100004
窗口函数和分析函数
over(): 开窗函数
- 使用 PARTITION BY 开窗
- 在窗口规范。Windows可以在WINDOW子句中单独定义。窗口规范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED |[num]) FOLLOWING)
(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
OVER 可以使用聚合函数:
- COUNT
- SUM
- MIN
- MAX
- AVG
Analytics functions 分析函数
-
ROW_NUMBER
-
RANK
-
DENSE_RANK
-
案例一:分区TopN的模板(这里是每个订单价格最高的两个商品):
1018318373,华为P8,3876 1982734213,iPhoneX,8765 1298373232,小米,2765 2018273771,充电宝,128 2134875573,扫地机器人,2319 1018318373,手机壳,87 2018273771,U盘,128 1298373232,移动硬盘,675 1982734213,iPhone8,6745 1018318373,华为笔记本,6547 2134875573,智能音响,4541 1298373232,体感游戏机,786 2018273771,Oppo手机,4312 select oid,name,price from ( select *, row_number() over(partition by oid order by price desc) as num from t_goods) t where num < 3;
查询结果:
+-------------+----------+--------+--+ | oid | name | price | +-------------+----------+--------+--+ | 1018318373 | 华为笔记本 | 6547 | | 1018318373 | 华为P8 | 3876 | | 1298373232 | 小米 | 2765 | | 1298373232 | 体感游戏机 | 786 | | 1982734213 | iPhoneX | 8765 | | 1982734213 | iPhone8 | 6745 | | 2018273771 | Oppo手机 | 4312 | | 2018273771 | 充电宝 | 128 | | 2134875573 | 智能音响 | 4541 | | 2134875573 | 扫地机器人 | 2319 | +-------------+----------+--------+--+
-
案例二:计算出截至到当天日期的总销售额:
1,2020-12-10,1000 2,2020-12-11,2000 3,2020-12-12,5000 4,2020-12-13,3000 5,2020-12-14,4000 6,2020-12-15,6000 select id,time,money, sum(money) over(order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as total from t_sale;
查询结果:
+-----+-------------+--------+--------+--+ | id | time | money | sum | +-----+-------------+--------+--------+--+ | 1 | 2020-12-10 | 1000 | 1000 | | 2 | 2020-12-11 | 2000 | 3000 | | 3 | 2020-12-12 | 5000 | 8000 | | 4 | 2020-12-13 | 3000 | 11000 | | 5 | 2020-12-14 | 4000 | 15000 | | 6 | 2020-12-15 | 6000 | 21000 | +-----+-------------+--------+--------+--+
-
案例三:统计出分数前两个名次的学生:
1,zhangsan,90 2,lisi,89 3,wangwu,90 4,zhaoliu,78 5,tianqi,60 6,wangba,55 select id,name,score, RANK() over(order by score desc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as level from t_student01;
分桶
分区:分目录
分桶:分文件 — 数据量很大
-
开启分桶:
set hive.enforce.bucketing=true;
-
创建分桶表
create table t_bucket( id int, name string, score int )clustered by(id) into 3 buckets row format delimited fields terminated by ',';
-
使用insert into 插入数据
insert into table t_bucket select * from t_student01;
-
在hdfs上/user/hive/warehouse/t_bucket/目录下就会生成三个文件
Join
join_table:
table_reference [INNER] JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON expression
map join
小表 join 大表
原理:先启动MapredLocal task将小表的数据上传到到 分布式缓存中(hahtable的数据结构),本地任务完成 后,启动MapReduce,MapTask就加载大表数据和分 布式缓存中小表的数据进行join。join上的结果直接输 出。这里就没有reduce阶段,也就没有shuffle。所以性能更高。
<property>
<name>hive.mapjoin.smalltable.filesize</name>
<value>25000000</value>
<description>
小表的输入文件大小的阈值;如果文件大小更小与此阈值相比,它将尝试将commom join转换为map join
</description>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
<description>Hive是否支持基于输入文件大小将通用连接转换为mapjoin的优化</description>
</property>
reduce join / common join / shuffle join
大表 join 大表
LEFT SEMI JOIN 左半连接
使用左半连接的限制是,右边的表只能在连接条件(on - 子句)中引用,而不能在WHERE-或select -子句等中引 用。
select e.*
from
t_emp e
left
semi
join
t_dept d
on
e.did = d.did;
左半连接只有左表join到的数据,没有右表的数据。
join优化
Predicate Pushdown in Outer Joins 外连接的谓词下推
Predicate Pushdown(ppd)谓词下推
谓词:boolean表达式 / 过滤条件
谓词下推规则:
1、主表的谓词在where之后,会发生谓词下推
2、从表的谓词在on之后,会发生谓词下推
动态分区
网上资料:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts
在动态分区插入中,用户可以指定部分分区规范,这意 味着只需在partition子句中指定分区列名列表。列值是 可选的。如果指定了分区列值,我们称其为静态分区, 否则它就是动态分区。每个动态分区列都有一个来自 select语句的对应输入列。这意味着动态分区的创建由 输入列的值决定。动态分区列必须在SELECT语句中最后指定,并且顺序与它们在partition()子句中出现的顺序相同
<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
<description>是否允许DML/DDL中的动态分区。</description>
</property>
<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>strict</value>
<description>
在严格的模式下,用户必须指定至少一个静态分区以防用户无意覆盖所有分区。在不严格的模式下,所有的分区都可以是动态的。
</description>
</property>
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
create table t1(
id int,
web string,
time string
)row format delimited fields terminated by ',';
load data local inpath '/home/hadoop/data/t1.txt' into table t1;
create table t_part(
id int,
web string,
time string
)partitioned by (access_time string)
row format delimited fields terminated by ',';
insert into table t_part partition (access_time) select id,web,time,time from t1;
hive优化
-
select查询
<property> <name>hive.fetch.task.conversion</name> <value>more</value> <description> Expects one of [none, minimal, more]. Some select queries can be converted to single FETCH task minimizing latency. Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins. 0. none : disable hive.fetch.task.conversion 1. minimal : SELECT, FILTER on partition columns, LIMIT only 2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns) </description> </property>
-
开启本地模式
<property> <name>hive.exec.mode.local.auto</name> <value>false</value> <description> 让Hive决定是否以本地模式自动运行 </description> </property> <property> <name>hive.exec.mode.local.auto.inputbytes.max</name> <value>134217728</value> <description> 对于本地模式,输入字节应小于此值。 </description> </property> <property> <name>hive.exec.mode.local.auto.input.files.max</name> <value>4</value> <description> 当hive.exec.mode.local。auto为true,对于本地模式,任务数量应少于此值 </description> </property>
-
jvm重用
MapReduce是基于进程的。MapTask、ReduceTask 都是JVM进程,JVM进程的启动和回收都是需要时间的。
mapred-site.xml
hive (default)> set mapred.job.reuse.jvm.num.tasks; mapred.job.reuse.jvm.num.tasks=1 hive (default)> set mapred.job.reuse.jvm.num.tasks=10;
-
推测执行
<property> <name>hive.mapred.reduce.tasks.speculative.execution</name> <value>true</value> </property>
如果是因为数据不均匀造成其中一个maptask运行 时间过长,此时是不能开始推测执行机制的。