Hive练习（三）

最新推荐文章于 2022-04-25 16:25:01 发布

谁说大象不能跳舞

最新推荐文章于 2022-04-25 16:25:01 发布

阅读量221

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/jiahonhyu0609/article/details/88806820

版权

hive 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1.创建内部表：

 create table if not exists innert_test(
 aisle_id string,
 aisle_name string
 )
 row format delimited fields terminated by ',' lines terminated by '\n' 
 stored as textfile location '/data/inner';
 
LOAD DATA LOCAL INPATH '/home/wl/hive/data/data/aisles.csv'  OVERWRITE INTO TABLE innert_test;
 然后查看：
 hadoop fs -ls /data/inner

drop table inner_test 时；
发现hdfs中没有这个表的文件了

2.创建外部表：

create external table if not exists ext_test(
 aisle_id string,
 aisle_name string
 )
 row format delimited fields terminated by ',' lines terminated by '\n' 
 stored as textfile location '/data/ext';
 
LOAD DATA LOCAL INPATH '/home/wl/hive/data/data/aisles.csv'  OVERWRITE INTO TABLE ext_test;

drop table ext_test时；
发现hdfs中还是会存在这个文件

3.分区表（动态分区）

动态分区指不需要为不同的分区添加不同的插入语句，分区不确定，需要从数据中获取。
set hive.exec.dynamic.partition=true;//使用动态分区
set hive.exec.dynamic.partition.mode=nonstrict;//无限制模式
如果模式是strict，则必须有一个静态分区，且放在最前面。

-- 建分区表
create table partition_test(
order_id string,                                      
user_id string,                                      
eval_set string,                                      
order_number string,                                                                            
order_hour_of_day string,                                      
days_since_prior_order string
)partitioned by(order_dow string)
row format delimited fields terminated by '\t';

--动态插入分区表
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

下面两句是一起执行的
insert overwrite table partition_test partition (order_dow='1')
select order_id,user_id,eval_set,order_number,order_hour_of_day,days_since_prior_order from orders where order_dow='1';

-- 分区表查询，必须是要加上where条件
select * from partition_test where order_dow='0' limit 10;

-- 查看表的分区
show partitions partition_test

4.Hive优化
4.1Reduce优化

hive.exec.reducers.bytes.per.reducer这个参数控制一个job会有多少个reducer来处理，依据多少的是输入文件的总大小。默认为1G。
hive.exec.reducers.max 这个参数控制最大的reducer的数量，如果input / bytes per reduce > max则会启动这个参数所指定的reduce个数。这个并不会影响mapreduce.job.reduces参数的设置。default=999
mapreduce.job.reduces这个参数如果设定了，hive就不会用它的estimation函数来自动计算reduce的个数，而是用这个参数来启动reducer。默认是-1.

为什么要有这么多参数围绕着reduce？

如果reduce太少：数据量很大时，会导致这个reduce异常的慢，从而导致任务运行时间长，影响依赖任务执行延迟，也可能会OOM。
如果reduce太多：产生的小文件太多，合并起来代价太高，namenode的内存占用也会增大

###什么情况下只有一个reduce
不管数据量多大，不管有没有设置调整reduce个数的参数，任务中一直保持只有一个reduce任务。
除了数据量未达到hive.exec.recudcer.bytes.per.reducer情况外，还有别的可能：
1.进行group by操作但是没有进行group by 汇总，属于写代码问题,做全局统计数据量只有一个reduce：
2. 用了order by ，如上
3. 有笛卡尔积（不叫on时，hive只能使用1个reduce来完成笛卡尔积）
什么是笛卡尔积？做join时没有on的字段关联。

select * from tmp_d join (select * from tmp_d)t;

union all / distinct
先做union all 在做group by 等操作可以有效的减少MR过程，尽管多个select 最终只有一个mr
union all + distinct == union

--运行时间：74.712 seconds 2job
select count(distinct *) 
from (
select order_id,user_id,order_dow from orders where order_dow='0' union all
select order_id,user_id,order_dow from orders where order_dow='0' union all 
select order_id,user_id,order_dow from orders where order_dow='1'
)t;

--运行时间122.996 seconds 3 job
select count(*) 
from(
select order_id,user_id,order_dow from orders where order_dow='0' union 
select order_id,user_id,order_dow from orders where order_dow='0' union 
select order_id,user_id,order_dow from orders where order_dow='1')t;

5.####数据倾斜

hive.groupby.skewindata
- 当选项设置为true：生成两个MR job。
- 第一个MR job中，Map的输出结果集合会随机分布到Reduce中，每个Reduce做部分聚合操作，并输出结果，这样处理的结果是相同的Group By Key有可能被分到不同的Reduce中，从而达到负载均衡的目的；
- 第二个MR job再根据预处理的数据结果按照Group By Key分布到Reduce中（这个过程可以保证相同的Group By Key 被分配到相同的reduce中），最终完成聚合的操作。

select add_to_cart_order,count(1) as cnt 
from priors 
group by add_to_cart_order
limit 10;

set hive.exec.parallel=true
1：map执行完 reduce在执行 1+2=》3：reduce
2：map 并行执行reduce

左连接：

select od.order_id,tr.product_id,od.user_id
from
(select order_id,user_id,order_dow from orders limit 100) od
left outer join
(select order_id,product_id,reordered from trains) tr 
on (od.order_id=tr.order_id and od.order_dow='0' and tr.reordered=1)
limit 30;

怎么定位具体哪几个key发生倾斜？
采样一下
长尾数据（互联网都是）图形类似-log类型的