Hive高级教程

最新推荐文章于 2021-03-15 22:26:11 发布

手撕机

最新推荐文章于 2021-03-15 22:26:11 发布

阅读量447

点赞数

原创文章，未经授权请勿转载。

本文链接：https://blog.csdn.net/guolindonggld/article/details/82380573

版权

动态分区

创建分区表

create table student(id string, name string) 
partitioned by (inc_day string)

当我们向分区表插入数据时，通常需要指定分区：

INSERT INTO TABLE student PARTITION(inc_day='20180929')
SELECT id,name FROM another_table;

当分区很多时，这样一个个分区的插法会令人窒息！

-- 默认是false，动态分区开关
set hive.exec.dynamic.partition=true;   
-- 默认是strick，即不允许分区列全部是动态的
set hive.exec.dynamic.partition.mode=nonstrict; 

INSERT OVERWRITE TABLE student PARTITION(inc_day)
SELECT id, name, inc_day FROM another_table;

需要注意的是，Hive是根据字段的位置推断分区名的，而不是字段名称。比如改为SELECT id, inc_day, name FROM another_table;，则student表中分区内容就是name了。
而且，字段的类型不一致的话，则会使用NULL值填充，不会报错。

Error: GC overhead limit exceeded

set hive.optimize.sort.dynamic.partition=true;

Hive Insert to Dynamic Partition Query Generating Too Many Small Files

Caused by: org.apache.hadoop.hive.ql.metadata.HiveFatalException: [Error 20004]: Fatal error occurred when node tried to create too many dynamic partitions. The maximum number of dynamic partitions is controlled by hive.exec.max.dynamic.partitions and hive.exec.max.dynamic.partitions.pernode. Maximum was set to 100 partitions per node, number of dynamic partitions on this node: 101

-- 默认是1000
set hive.exec.max.dynamic.partitions=1000000;
-- 默认是100
set hive.exec.max.dynamic.partitions.pernode=10000;

[Fatal Error] total number of created files now is 100057, which exceeds 100000. Killing the job.

Hive中简单介绍分区表

总结起来就是头部加上下面代码：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=1000000;
set hive.exec.max.dynamic.partitions.pernode=10000;

Join操作存在的坑

create table student(
id string comment '学号',
name string comment '姓名',
sex string comment '性别',
age string comment '年龄',
dept_id string comment '学院ID'
) comment '学生表';

create table department(
id string comment '学院ID',
name string comment '学院名字'
) comment '学院表 ';

insert into student values('1','孙悟空','男','18','01');
insert into student values('2','明世隐','男','19','01');
insert into student values('3','高渐离','男','20','02');
insert into student values('4','孙尚香','女','21','02');
insert into student values('5','安琪拉','女','22','03');

insert into department values('01','信息学院');
insert into department values('02','法学院');
insert into department values('03','艺术学院');
insert into department values('01','信息学院');

JOIN(即INNER JOIN)

LEFT JOIN

坑1：右边的表ID不唯一，造成

select * from student 
left join department on student.dept_id=department.id
order by student.id;

1	孙悟空	男	18	01	01	信息学院
1	孙悟空	男	18	01	01	信息学院
2	明世隐	男	19	01	01	信息学院
2	明世隐	男	19	01	01	信息学院
3	高渐离	男	20	02	02	法学院
4	孙尚香	女	21	02	02	法学院
5	安琪拉	女	22	03	03	艺术学院

设置执行引擎和队列

Hive 设置队列需要根据所使用的引擎进行对应的设置才会有效果，否则无效。

设置引擎

set hive.execution.engine=mr;  
set hive.execution.engine=spark;  
set hive.execution.engine=tez;

设置队列
etl为队列名称，默认为default

-- 如果使用的引擎是mr(原生mapreduce)
set mapreduce.job.queuename=etl;

-- 如果使用的引擎是tez
set tez.queue.name=etl；

-- 如果使用的引擎是spark
set spark.yarn.queue=etl;

设置Reducer的数量

-- In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>

-- In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>

-- In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>

打印选项

-- 打印列名
set hive.cli.print.header=true;

-- 列名前不带表名前缀
set hive.resultset.use.unique.column.names=false;

数据倾斜

坑：reduce卡在99%，一直完成不了。
2020-05-20 23:48:22,965 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 697184.92 sec
2020-05-20 23:49:23,115 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 698600.58 sec
2020-05-20 23:50:23,468 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 699852.83 sec
2020-05-20 23:51:23,760 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 700922.2 sec
2020-05-20 23:52:23,799 Stage-1 map = 100%, reduce = 99%, Cumulative CPU 701941.37 sec

当HQL语句中有GROUP BY时，默认情况下，Hive将具有相同分组键的数据放入同一个Reducer。如果分组键的不同值具有数据倾斜，则一个Reducer可能会获得大部分数据，与其他Reducer相比，这个Reducer需要更长的时间才能完成。

通过设定set hive.groupby.skewindata=true;，Hive会触发一个额外的MR作业使得Maper的结果随机分配给Reducer。

set hive.groupby.skewindata=true;

更多数据倾斜参考：https://www.cnblogs.com/kongcong/p/7777092.html

调试

# Hive启动时用该命令替代
hive --hiveconf hive.root.logger=DEBUG,console

抽样

参考：https://blog.csdn.net/baidu_20183817/article/details/84099049

手撕机

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hive高级教程

set hive.groupby.skewindata=true;当HQL语句中有GROUP BY时，默认情况下，Hive将具有相同分组键的数据放入同一个Reducer。如果分组键的不同值具有数据倾斜，则一个Reducer可能会获得大部分数据，与其他Reducer相比，这个Reducer需要更长的时间才能完成。通过设定set hive.groupby.skewindata=true;，Hi...
复制链接

扫一扫