hive的udtf函数：inline+实现动态分区+小文件处理+杂记

最新推荐文章于 2023-11-28 21:08:04 发布

Anla Likes Sunshine

最新推荐文章于 2023-11-28 21:08:04 发布

阅读量885

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/AnlaGodness/article/details/111872216

版权

大数据专栏收录该内容

48 篇文章 1 订阅

订阅专栏

1、inline 函数说明：
在这里插入图片描述
翻译：将结构数组分解为多行。返回一个包含N列的行集（N=结构中顶级元素的数目），数组中每个结构一行。
场景：源表粒度为班级，目标表粒度为班级+老师。

with t as (
    select 'class1' as class_id, '物联网工程' as class_name,'Anla' as chinese_teacher,200 as chinese_teacher_donate_amount
        ,'Satellite' as math_teacher,500 as math_teacher_donate_amount,'Patty' as english_teacher,288 as english_teacher_donate_amount
    union all
    select 'class2' as class_id, '软件工程' as class_name,'Aron' as chinese_teacher,500 as chinese_teacher_donate_amount
        ,'Lisa' as math_teacher,300 as math_teacher_donate_amount,'Yoga' as english_teacher,168 as english_teacher_donate_amount
)
select
    t.class_id
    ,t.class_name
    ,t_inline.teacher
    ,t_inline.donate_amount
from t 
lateral view inline(array(struct(chinese_teacher,chinese_teacher_donate_amount),
                        struct(math_teacher,math_teacher_donate_amount),
                        struct(english_teacher,english_teacher_donate_amount)
)) t_inline as teacher,donate_amount;

在这里插入图片描述
2、动态分区如何实现：

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict; 
set hive.exec.max.dynamic.partitions=100000; --可选
set hive.exec.max.dynamic.partitions.pernode=100000; --可选
insert overwrite table table_name partition(rand_num)
select field1,...,fieldn,cast(cast(rand() * 10 as int) as string) as rand_num
from t

这样就实现动态分区了。
注意：
1、distribute by解决的是小文件的问题，而不是实现动态分区，这点很容易混淆。
2、如果分区数超过限定值，会报错：Fatal error occured when node tried to create too many dynamic partitions.
报错则考虑是否设置参数： hive.exec.max.dynamic.partitions、hive.exec.max.dynamic.partitions.pernode

3、小文件如何处理：
3.1、distribute by

insert overwrite table table_name partition(statdate)
distribute by cast(rand() * 10 as int)；

这种方法会在分区文件夹statdate=‘xxxx’ 下生成至多10个文件夹。
3.2、repartition，常使用在给一个分区插入数据上

select /*+ repartition(1) */

4、杂记
4.1、sqoop导出新经验：指定分区导出，分区字段得是 string 类型，不然会报错。
4.2、查询字段小数位长度： length(substring_index(cast(field as string),'.',-1))
效果：1.614->3; 12.8292->4

Anla Likes Sunshine

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
hive的udtf函数：inline+实现动态分区+小文件处理+杂记

1、inline 函数说明：翻译：将结构数组分解为多行。返回一个包含N列的行集（N=结构中顶级元素的数目），数组中每个结构一行。场景：源表粒度为班级，目标表粒度为班级+老师。with t as ( select 'class1' as class_id, '物联网工程' as class_name,'Anla' as chinese_teacher,200 as chinese_teacher_donate_amount ,'Satellite' as math_tea
复制链接

扫一扫