背景
需要把设备id,依据其类型(包括cuid、imei、oaid、idfa)和mumuhash的值(salt)写到对应的分区中,我们定义的是type, salt分区,在写的时候也是用动态分区的方法,type和salt都是在程序中计算出来的,sql如下:
insert overwrite table ugc_test_new.dwd_cpa_act_user_df_txt partition (dt = '{@date}', app = 'zuoyebang', type, salt)
select
device_id,
type,
salt
from
(
select
'cuid' as type,
cuid as device_id,
cast(mumuhash(cuid, 512) as string) as salt
from
device_tb
where
cuid not in (
'0',
'',
'00000000-0000-0000-0000-000000000000',
'00000000000000000000000000000000'
)
group by
cuid
union all
select
'oaid' as type,
oaid as device_id,
cast(mumuhash(oaid, 512) as string) as salt
from
device_tb
where
oaid not in (
'0',
'',
'00000000-0000-0000-0000-000000000000',
'00000000000000000000000000000000'
)
group by
oaid
union all
select
'imei' as type,
imei as device_id,
cast(mumuhash(imei, 512) as string) as salt
from
device_tb
where
imei not in (
'0',
'',
'00000000-0000-0000-0000-000000000000',
'00000000000000000000000000000000'
)
group by
imei
union all
select
'idfa' as type,
idfa as device_id,
cast(mumuhash(idfa, 512) as string) as salt
from
device_tb
where
idfa not in (
'0',
'',
'00000000-0000-0000-0000-000000000000',
'00000000000000000000000000000000'
)
group by
idfa
) t3
group by
device_id,
type,
salt
问题
为了保证device_id的唯一性,我们在最后用device_id + type + salt做了group by,这导致在最后生成了512个task(依据数据量的不同,这个值可能不一样)
因为这是最后一步,所以在这里个stage spark会把数据落盘,这些task每个都会写一份动态分区目录下的文件。假设task个数是t,salt是512,type个数是4,那么此时生成的文件个数是 t * 512 * 4,在本例中t是512,所以会生成1048576个小文件,因为生成的文件个数太多,任务一直卡着跑不完,最后失败了。
如下图是3个不同的task输出的相同的type 和 salt的文件,可以看到,这三个task输出的是三个文件:
解决方式
在程序的最后,按照动态分区的字段在程序最后加一个distribute by,在本例中,就是加distribute by type, salt,此时就可以保证相同的type 和 salt的数据,会在同一个task中执行,最后输出的文件个数也是在可接受范围内的,输出文件个数为 type * salt * ceil(file_size / 256m),在本例中,因为每个文件都小于256m,所以生成的文件个数就是4 * 512 * 1 = 2048