Hive 数据操作

Hive 数据操作

插入(加载)数据

方式一:使用load加载数据

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1,partcol2=val2...)]
  • LOAD DATA LOCAL
    • 加载本地文件
    • 复制文件
  • LOAD DATA
    • 加载hdfs文件
    • 移动文件

演示一:

## 加载本地文件到hive表中
load data local inpath '/home/hadoop/data/t_load.txt' into table t_load;
## 加载hdfs文件到hive表中
load data inpath '/t_load01.txt' into table t_load;
## 加载本地文件覆盖hive表中的数据
load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load;

## 加载本地数据到分区中
create table t_load_part(
id int,
name string,
age int
)partitioned by (time string)
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load_part partition (time='2020-12-12');

演示二:

create table t_load_part_orc(
id int,
name string,
age int
)partitioned by (time string)
row format delimited fields terminated by ','
stored as orc;

load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load_part_orc partition (time='2020-12-12');

以上代码中创建的t_load_part_orc表的文件格式是orc 格式。但是加载的数据格式是txt文本,所以在查询的时 候抛出异常:

Failed with exception java.io.IOException:java.io.IOException:
Malformed(畸形的) ORC file hdfs://ns1/user/hive/warehouse/t_load_part_orc/time=2020-12-12/t_load.txt. Invalid postscript.

我们发现在执行load data命令的时候是没有启动MR程 序的。我们可以理解load data 的工作是:

1、将文件直 接加载到表的目录中

2、将相关的元数据信息加到 MySQL中。

没有涉及到文件格式的转换,所以不能成功。 那么,如果创建表的文件格式和加载的数据文件格式不 一致的时候,如何将文件数据加载到表中呢???此时,不能使用load命令,我们可以使用insert命令。

方式二:使用insert插入数据

— 执行MR,需要中间表

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

演示:

1、创建原始数据 - TextFile

1,zhangsan,18
2,lisi,19
3,wangwu,21

2、创建原始数据表

create table t_load_part_txt(
id int,
name string,
age int
)
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/data/t_load.txt' overwrite into table t_load_part_txt;

3、创建一个orc格式的数据表

create table t_load_part_orc(
id int,
name string,
age int
)partitioned by (time string)
row format delimited fields terminated by ','
stored as orc;

4、使用insert将txt的数据存储在orc文件中

insert into table t_load_part_orc partition (time='2020-12-12') select * from t_load_part_txt;

Inserting values into tables from SQL

INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [,values_row ...]

方式三:使用location关联已存在的数据

导出数据

insert导出数据

INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] 
SELECT ... FROM ...

将表中的数据导出到本地保存成parquet格式的文件:

insert overwrite local directory '/home/hadoop/data/load001/' 
row format delimited fields terminated by '\t'
stored as parquet 
select * from t_load;

Import/Export

import导入的数据必须是export导出的数据。

EXPORT命令将表或分区的数据以及元数据导出到指定 的输出位置。然后可以将此输出位置移到另一个 Hadoop或Hive实例,并使用IMPORT命令从那里导入。

Export Syntax
EXPORT TABLE tablename [PARTITION (part_column="value"[, ...])] TO 'export_target_path';
export table t_load to "/export";
Import Syntax
IMPORT [[EXTERNAL] TABLE new_or_original_tablename [PARTITION (part_column="value"[, ...])]] 
FROM 'source_path' [LOCATION 'import_target_path']
## t_import 必须是空表或者不存在
import table t_import from "/export";

查询数据

SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT [offset,] rows]

Group by

select age,count(1) as num from t_student group by age having num > 3;
## 注意以下参数:
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
## 没有设置number值就是获取当前值
0: jdbc:hive2://hadoop101:10000> set mapreduce.job.reduces;
+---------------------------+--+
|            set            |
+---------------------------+--+
| mapreduce.job.reduces=-1  |
+---------------------------+--+

## 有number就是设置值
0: jdbc:hive2://hadoop101:10000> set mapreduce.job.reduces=3;
0: jdbc:hive2://hadoop101:10000> set mapreduce.job.reduces;
+---------------------------+--+
|            set            |
+---------------------------+--+
| mapreduce.job.reduces=3  |
+---------------------------+--+

order by

全局排序。只有一个reducer。

insert overwrite local directory '/home/hadoop/data/orderby/' 
row format delimited fields terminated by '\t'
select * from t_student order by age desc;

Limit

全局的。

insert overwrite local directory '/home/hadoop/data/orderby01/' 
row format delimited fields terminated by '\t'
select * from t_student order by age desc limit 3;

严格模式

hive执行操作的模式。默认是:nonstrict

set hive.mapred.mode = strict;

在严格模式下,不允许运行一些有风险的查询。 不允许的操作如下:

Cartesian Product. 笛卡儿积

No partition being picked up for a query. 查询没有指定分区字段

Comparing bigints and strings. 比较bigints andstrings.

Comparing bigints and doubles. 比较bigints and doubles

Orderby without limit. orderby 没有 limit

Sort by

sort by 是分区排序。不受严格模式的限制。

insert overwrite local directory '/home/hadoop/data/orderby04/' 
row format delimited fields terminated by '\t'
select * from t_student sort by age desc;

如果在sory by 后面使用limit。limit是全局的,会新启动一个MR程序完成全局的limit。

Distribute by

distribute by 用来指定分区字段,联合sort by使用的。

insert overwrite local directory '/home/hadoop/data/orderby07/' 
row format delimited fields terminated by '\t'
select * from t_stu distribute by id sort by id;

如果分区字段和排序字段一样,且是升序,那么 distribute by + sort by可以使用 cluster by 替换。

cluster by

insert overwrite local directory '/home/hadoop/data/orderby08/' 
row format delimited fields terminated by '\t'
select * from t_stu cluster by id;

Hive Operators 操作符

操作符优先级

ExampleOperatorsDescription
A[B] , A.identifierbracket_op([]), dot(.)element selector, dot
-Aunary(+), unary(-), unary(~)unary prefix operators
A IS [NOT] (NULL|TRUE|FALSE)IS NULL,IS NOT NULL, …unary suffix
A ^ Bbitwise xor(^)bitwise xor
A * Bstar(*), divide(/), mod(%), div(DIV)multiplicative operators
A + Bplus(+), minus(-)additive operators
A || Bstring concatenate(||)string concatenate
A & Bbitwise and(&)bitwise and
A | Bbitwise or(|)bitwise or

关系操作符

OperatorOperand typesDescription
OperatorOperand typesDescription
A = BAll primitive typesTRUE if expression A is equal to expression B otherwise FALSE.
A == BAll primitive typesSynonym for the = operator.
A <=> BAll primitive typesReturns same result with EQUAL(=) operator for non-null operands, but returns TRUE if both are NULL, FALSE if one of the them is NULL. (As of version 0.9.0.)
A <> BAll primitive typesNULL if A or B is NULL, TRUE if expression A is NOT equal to expression B, otherwise FALSE.
A != BAll primitive typesSynonym for the <> operator.
A < BAll primitive typesNULL if A or B is NULL, TRUE if expression A is less than expression B, otherwise FALSE.
A <= BAll primitive typesNULL if A or B is NULL, TRUE if expression A is less than or equal to expression B, otherwise FALSE.
A > BAll primitive typesNULL if A or B is NULL, TRUE if expression A is greater than expression B, otherwise FALSE.
A >= BAll primitive typesNULL if A or B is NULL, TRUE if expression A is greater than or equal to expression B, otherwise FALSE.
A [NOT] BETWEEN B AND CAll primitive typesNULL if A, B or C is NULL, TRUE if A is greater than or equal to B AND A less than or equal to C, otherwise FALSE. This can be inverted by using the NOT keyword. (As of version 0.9.0.)
A IS NULLAll typesTRUE if expression A evaluates to NULL, otherwise FALSE.
A IS NOT NULLAll typesFALSE if expression A evaluates to NULL, otherwise TRUE.
A IS [NOT] (TRUE|FALSE)Boolean typesEvaluates to TRUE only if A mets the condition. (since:3.0.0 ) Note: NULL is UNKNOWN, and because of that (UNKNOWN IS TRUE) and (UNKNOWN IS FALSE) both evaluates to FALSE.
A [NOT] LIKE BstringsNULL if A or B is NULL, TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions) while the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, ‘foobar’ like ‘foo’ evaluates to FALSE whereas ‘foobar’ like ‘foo_ _ _’ evaluates to TRUE and so does ‘foobar’ like ‘foo%’.
A RLIKE BstringsNULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B, otherwise FALSE. For example, ‘foobar’ RLIKE ‘foo’ evaluates to TRUE and so does ‘foobar’ RLIKE ‘^f.*r$’.
A REGEXP BstringsSame as RLIKE.

算术操作符

+、-、*、\、%、&、|

OperatorOperand typesDescription
A + BAll number typesGives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. For example since every integer is a float, therefore float is a containing type of integer so the + operator on a float and an int will result in a float.
A - BAll number typesGives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A * BAll number typesGives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.
A / BAll number typesGives the result of dividing A by B. The result is a double type in most cases. When A and B are both integers, the result is a double type except when the hive.compat configuration parameter is set to “0.13” or “latest” in which case the result is a decimal type.
A DIV BInteger typesGives the integer part resulting from dividing A by B. E.g 17 div 3 results in 5.
A % BAll number typesGives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A & BAll number typesGives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A | BAll number typesGives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
A ^ BAll number typesGives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.
~AAll number typesGives the result of bitwise NOT of A. The type of the result is the same as the type of A.

逻辑操作符

OperatorOperand typesDescription
A AND BbooleanTRUE if both A and B are TRUE, otherwise FALSE. NULL if A or B is NULL.
A OR BbooleanTRUE if either A or B or both are TRUE, FALSE OR NULL is NULL, otherwise FALSE.
NOT AbooleanTRUE if A is FALSE or NULL if A is NULL. Otherwise FALSE.
! AbooleanSame as NOT A.
A IN (val1, val2, …)booleanTRUE if A is equal to any of the values. As of Hive 0.13 subqueries are supported in IN statements.
A NOT IN (val1, val2, …)booleanTRUE if A is not equal to any of the values. As of Hive 0.13 subqueries are supported in NOT IN statements.
[NOT] EXISTS (subquery)TRUE if the the subquery returns at least one row. Supported as of Hive 0.13.

复杂类型操作符

Constructor FunctionOperandsDescription
map(key1, value1, key2, value2, …)Creates a map with the given key/value pairs.
struct(val1, val2, val3, …)Creates a struct with the given field values. Struct field names will be col1, col2, …
named_struct(name1, val1, name2, val2, …)Creates a struct with the given field names and values. (As of Hive 0.8.0.)
array(val1, val2, …)Creates an array with the given elements.
create_union(tag, val1, val2, …)Creates a union type with the value that is being pointed to by the tag parameter.

下面的操作符提供了访问复杂类型元素的机制:

OperatorOperand typesDescription
A[n]A is an Array and n is an intReturns the nth element in the array A. The first element has index 0. For example, if A is an array comprising of [‘foo’, ‘bar’] then A[0] returns ‘foo’ and A[1] returns ‘bar’.
M[key]M is a Map<K, V> and key has type KReturns the value corresponding to the key in the map. For example, if M is a map comprising of {‘f’ -> ‘foo’, ‘b’ -> ‘bar’, ‘all’ -> ‘foobar’} then M[‘all’] returns ‘foobar’.
S.xS is a structReturns the x field of S. For example for the struct foobar {int foo, int bar}, foobar.foo returns the integer stored in the foo field of the struct.

内置的UDF函数

  • 显示hive的函数

    show functions;
    
  • 查看函数的描述信息

    desc function 函数名;
    
  • 查看函数的扩展信息-----如果有案例,可以显示案例信息

    desc function extended 函数名;
    

    UDF(User-Defined Function)用户定义函数
    特点: 一对一。数据输入一个,输出一个

数学函数

  • round:四舍五入
  • floor:向下取整
  • ceil:向上取整

集合函数

  • size(Map<K.V>)
  • size(Array)
  • map_keys(Map<K.V>)
  • map_values(Map<K.V>)
  • array_contains(Array, value)
  • sort_array(Array)

类型转换函数

  • cast(expr as ) : For example, cast(‘1’ as BIGINT)

    select cast('1' as int) + 1;
    

日期函数

  • current_date
  • current_timestamp

条件函数

  • if(boolean testCondition, T valueTrue, TvalueFalseOrNull)
  • CASE a WHEN b THEN c [WHEN d THEN e]* [ELSEf] END
    相当于java的switch
select id,name,age,
case sex
when 0 then '男生'
when 1 then '女生'
else '未知'
end as sex
from t_stu01;
  • CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END

    相当于java的if…else if

select id,name,age,
case
when sex=0 then '男生'
when sex=1 then '女生'
else '未知'
end as sex
from t_stu01;

字符串函数

  • concat(string|binary A, string|binary B…) 字符串拼接
  • concat_ws(string SEP, string A, string B…) 使用指定的分隔符拼接字符串
  • length(string A)
  • split(string str, string pat)
  • substr(string|binary A, int start)
  • substring(string|binary A, int start)

混合函数

  • current_user()
  • current_database()

自定义UDF函数

案例:自定义一个udf函数,返回日期对应的季度。
添加依赖:

<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.2</version>
</dependency>
  1. 创建一个类继承UDF,使用一个或多个名为evaluate 的方法。

    package com.bigdata.hive;
    
    import org.apache.hadoop.hive.ql.exec.UDF;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    import java.util.Calendar;
    import java.util.Date;
    /**
    * 需求:自定义一个udf函数,返回日期对应的季度。
    * 1、创建一个类继承UDF
    * 2、 使用一个或多个名为evaluate的方法。
    */
    public class MyQuarter extends UDF {
        //返回日期对应的季度。
        public int evaluate(String date){
            int quarter = 1;
            SimpleDateFormat sf = new SimpleDateFormat("yyyy-MM-dd");
            try {
                Date dt = sf.parse(date);
                Calendar calendar = Calendar.getInstance();
                calendar.setTime(dt);
                int m = calendar.get(Calendar.MONTH) + 1;
                
                if(m <= 3){
                	quarter = 1;
                }else if(m <= 6){
                	quarter = 2;
                }else if(m <= 9){
                	quarter = 3;
                }else{
                	quarter = 4;
            }
            } catch (ParseException e) {
                throw new RuntimeException("日期格式:yyyy-MM-dd");
            }
            return quarter;
        }
    }
    
  2. 生成jar包,并添加到hive的classpath中

    hive (default)> add;
    Usage: add [FILE|JAR|ARCHIVE] <value> [<value>]*
    
    hive (default)> add jar /home/hadoop/jars/hive.jar;
    Added [/home/hadoop/jars/hive.jar] to class path
    Added resources:[/home/hadoop/jars/hive.jar]
    
  3. 注册函数

    create temporary function my_qua as 'com.bigdata.hive.MyQuarter';
    
  4. 使用函数

    hive (default)>
    > select my_qua('2020-10-11');
    OK
    _c0
    4
    

UDAF聚合函数

特点: 多对一

  • collect_set: 返回一组去掉重复元素的对象。
  • collect_list:返回具有重复项的对象列表。
zhangsan,100001
wangwu,200001
zhangsan,100002
wangwu,200002
zhangsan,100003
wangwu,200003
zhangsan,100004

## 将同一个用户的订单id组合在一起
select name,concat_ws(' ',collect_list(cast(oid as string))) as oids from t_order group by name;

Table-Generating Functions (UDTF) 表生成函数

一般的用户定义函数,如concat(),接受单个输入行并 输出单个输出行。相反,表生成函数将单个输入行转换 为多个输出行。

  • explode(ARRAY a) :炸裂函数
  • json_tuple(string jsonStr,string k1,…,string kn)

Lateral View 侧面视图

lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
  • 横向视图与用户定义的表生成函数(如explode())一起使用。
  • 横向视图首先将UDTF应用于基表的每一行,然后将产生的输出行连接到输入行,以形成一个具有提供的表别名的虚拟表。
hive (default)> select * from t_array;
t_array.name   t_array.oids
wangwu         [200001,200001,200001]
zhangsan       [100001,100002,100003,100004]

select name,t.*
from t_array
LATERAL VIEW explode(oids) t as oid;

查询结果:

name     t.oid
wangwu   200001
wangwu   200001
wangwu   200001
zhangsan 100001
zhangsan 100002
zhangsan 100003
zhangsan 100004

窗口函数和分析函数

over(): 开窗函数
  • 使用 PARTITION BY 开窗
  • 在窗口规范。Windows可以在WINDOW子句中单独定义。窗口规范支持以下格式:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED |[num]) FOLLOWING)

(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING

OVER 可以使用聚合函数:

  • COUNT
  • SUM
  • MIN
  • MAX
  • AVG
Analytics functions 分析函数
  • ROW_NUMBER

  • RANK

  • DENSE_RANK

  • 案例一:分区TopN的模板(这里是每个订单价格最高的两个商品):

    1018318373,华为P8,3876
    1982734213,iPhoneX,8765
    1298373232,小米,2765
    2018273771,充电宝,128
    2134875573,扫地机器人,2319
    1018318373,手机壳,87
    2018273771,U盘,128
    1298373232,移动硬盘,675
    1982734213,iPhone8,6745
    1018318373,华为笔记本,6547
    2134875573,智能音响,4541
    1298373232,体感游戏机,786
    2018273771,Oppo手机,4312
    
    select oid,name,price
    from (
    select *,
    row_number() over(partition by oid order by price desc) as num from t_goods) t 
    where num < 3;
    

    查询结果:

    +-------------+----------+--------+--+
    |     oid     |   name   | price  |
    +-------------+----------+--------+--+
    | 1018318373  | 华为笔记本    | 6547   |
    | 1018318373  | 华为P8     | 3876   |
    | 1298373232  | 小米       | 2765   |
    | 1298373232  | 体感游戏机    | 786    |
    | 1982734213  | iPhoneX  | 8765   |
    | 1982734213  | iPhone8  | 6745   |
    | 2018273771  | Oppo手机   | 4312   |
    | 2018273771  | 充电宝      | 128    |
    | 2134875573  | 智能音响     | 4541   |
    | 2134875573  | 扫地机器人    | 2319   |
    +-------------+----------+--------+--+
    
  • 案例二:计算出截至到当天日期的总销售额:

    1,2020-12-10,1000
    2,2020-12-11,2000
    3,2020-12-12,5000
    4,2020-12-13,3000
    5,2020-12-14,4000
    6,2020-12-15,6000
    
    select id,time,money,
    sum(money) over(order by id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as total 
    from t_sale;
    

    查询结果:

    +-----+-------------+--------+--------+--+
    | id  |    time     | money  |  sum   |
    +-----+-------------+--------+--------+--+
    | 1   | 2020-12-10  | 1000   | 1000   |
    | 2   | 2020-12-11  | 2000   | 3000   |
    | 3   | 2020-12-12  | 5000   | 8000   |
    | 4   | 2020-12-13  | 3000   | 11000  |
    | 5   | 2020-12-14  | 4000   | 15000  |
    | 6   | 2020-12-15  | 6000   | 21000  |
    +-----+-------------+--------+--------+--+
    
  • 案例三:统计出分数前两个名次的学生:

    1,zhangsan,90
    2,lisi,89
    3,wangwu,90
    4,zhaoliu,78
    5,tianqi,60
    6,wangba,55
    
    select id,name,score,
    RANK() over(order by score desc ROWS
    BETWEEN UNBOUNDED PRECEDING AND
    UNBOUNDED FOLLOWING) as level 
    from t_student01;
    

分桶

分区:分目录
分桶:分文件 — 数据量很大

  1. 开启分桶:

    set hive.enforce.bucketing=true;
    
  2. 创建分桶表

    create table t_bucket(
    id int,
    name string,
    score int
    )clustered by(id) into 3 buckets
    row format delimited fields terminated by ',';
    
  3. 使用insert into 插入数据

    insert into table t_bucket select * from t_student01;
    
  4. 在hdfs上/user/hive/warehouse/t_bucket/目录下就会生成三个文件

Join

join_table:
    table_reference [INNER] JOIN table_factor [join_condition]
  | table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
  | table_reference LEFT SEMI JOIN table_reference join_condition
  | table_reference CROSS JOIN table_reference [join_condition] (as of Hive 0.10)
 
table_reference:
    table_factor
  | join_table
 
table_factor:
    tbl_name [alias]
  | table_subquery alias
  | ( table_references )
 
join_condition:
    ON expression

map join

小表 join 大表

原理:先启动MapredLocal task将小表的数据上传到到 分布式缓存中(hahtable的数据结构),本地任务完成 后,启动MapReduce,MapTask就加载大表数据和分 布式缓存中小表的数据进行join。join上的结果直接输 出。这里就没有reduce阶段,也就没有shuffle。所以性能更高。

<property>
<name>hive.mapjoin.smalltable.filesize</name>
<value>25000000</value>
<description>
	小表的输入文件大小的阈值;如果文件大小更小与此阈值相比,它将尝试将commom join转换为map join
</description>
</property>

<property>
<name>hive.auto.convert.join</name>
<value>true</value>
<description>Hive是否支持基于输入文件大小将通用连接转换为mapjoin的优化</description>
</property>

reduce join / common join / shuffle join

大表 join 大表

在这里插入图片描述

LEFT SEMI JOIN 左半连接

使用左半连接的限制是,右边的表只能在连接条件(on - 子句)中引用,而不能在WHERE-或select -子句等中引 用。

select e.*
from
	t_emp e
left
	semi
join
	t_dept d
on
	e.did = d.did;

左半连接只有左表join到的数据,没有右表的数据。

join优化

Predicate Pushdown in Outer Joins 外连接的谓词下推

Predicate Pushdown(ppd)谓词下推
谓词:boolean表达式 / 过滤条件

在这里插入图片描述

谓词下推规则:

1、主表的谓词在where之后,会发生谓词下推
2、从表的谓词在on之后,会发生谓词下推

动态分区

网上资料:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts

在动态分区插入中,用户可以指定部分分区规范,这意 味着只需在partition子句中指定分区列名列表。列值是 可选的。如果指定了分区列值,我们称其为静态分区, 否则它就是动态分区。每个动态分区列都有一个来自 select语句的对应输入列。这意味着动态分区的创建由 输入列的值决定。动态分区列必须在SELECT语句中最后指定,并且顺序与它们在partition()子句中出现的顺序相同

<property>
<name>hive.exec.dynamic.partition</name>
<value>true</value>
<description>是否允许DML/DDL中的动态分区。</description>
</property>

<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>strict</value>
<description>
	在严格的模式下,用户必须指定至少一个静态分区以防用户无意覆盖所有分区。在不严格的模式下,所有的分区都可以是动态的。
</description>
</property>
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
1,www.baidu.com,2020-10-11
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12
2,www.baidu.com,2020-10-12

create table t1(
id int,
web string,
time string
)row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/data/t1.txt' into table t1;

create table t_part(
id int,
web string,
time string
)partitioned by (access_time string)
row format delimited fields terminated by ',';

insert into table t_part partition (access_time) select id,web,time,time from t1;

hive优化

  • select查询

    <property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
        Expects one of [none, minimal, more].
        Some select queries can be
        converted to single FETCH task
        minimizing latency.
        Currently the query should be
        single sourced not having any subquery
        and should not have
        any aggregations or distincts
        (which incurs RS), lateral views and joins.
        0. none : disable hive.fetch.task.conversion
        1. minimal : SELECT, FILTER on partition columns, LIMIT only
        2. more : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
    </property>
    
  • 开启本地模式

    <property>
    <name>hive.exec.mode.local.auto</name>
    <value>false</value>
    <description>
    	让Hive决定是否以本地模式自动运行
    </description>
    </property>
    
    <property>
    <name>hive.exec.mode.local.auto.inputbytes.max</name>
    <value>134217728</value>
    <description>
    对于本地模式,输入字节应小于此值。
    </description>
    </property>
    
    <property>
    <name>hive.exec.mode.local.auto.input.files.max</name>
    <value>4</value>
    <description>
    当hive.exec.mode.local。auto为true,对于本地模式,任务数量应少于此值
    </description>
    </property>
    
  • jvm重用

    MapReduce是基于进程的。MapTask、ReduceTask 都是JVM进程,JVM进程的启动和回收都是需要时间的。

    mapred-site.xml

    hive (default)> set mapred.job.reuse.jvm.num.tasks;
    mapred.job.reuse.jvm.num.tasks=1
    hive (default)> set mapred.job.reuse.jvm.num.tasks=10;
    
  • 推测执行

    <property>
    <name>hive.mapred.reduce.tasks.speculative.execution</name>
    <value>true</value>
    </property>
    

    如果是因为数据不均匀造成其中一个maptask运行 时间过长,此时是不能开始推测执行机制的。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值