Hive使用技巧总结

最新推荐文章于 2024-05-23 19:00:48 发布

LightsUpW

最新推荐文章于 2024-05-23 19:00:48 发布

阅读量1.8k

点赞数 1

分类专栏：代码积累

本文链接：https://blog.csdn.net/lightsupw/article/details/80916740

版权

本文总结了Hive的一些实用技巧，包括正则表达式的使用、显示列名、排序优化（order by vs sort by vs cluster by）、join操作优化、分区插入、抽样方法以及时间戳格式处理。同时对比了Hive与Presto在数据类型和时间函数上的差异，并探讨了数据拆分行的策略。

摘要由CSDN通过智能技术生成

set hive.support.quoted.identifiers=None; 
select a.pin, `(pin)?+.+` from Table

set hive.cli.print.header=true;

order by全局排序，一个reduce实现，不能并行故效率偏低；
sort by部分有序，配合distribute by使用；
cluster by col1 == distribute by col1 sort by col1，但不能指定排序规则；

多表join的key值统一则可以归为一个reduce；
先过滤后join；
小表在前读入内存，大表在后；
使用left semi join 代替in功能，效率更高；
小表join大表时数据倾斜优化：

select t1.a,t1.b from table t1 join table2 t2  on ( t1.a=t2.a)
select /*+ mapjoin(t1)*/ t1.a,t1.b from table t1 join table2 t2  on ( t1.a=t2.a)

静态插入:需要指定插入的分区dt，name的值；

insert overwrite table test partition (dt='2018-10-17', name='a') 
select col1, col2 from data_table where dt='2018-10-17'

关注

专栏目录