spark sql多维分析优化——提高读取文件的并行度

本文探讨了一个Spark SQL多维分析优化问题,基础表数据量虽小,但处理时间较长。通过分析日志和代码,发现由于task数量少,导致处理速度慢。解决方案是调整`spark.sql.files.maxPartitionBytes`和`parquet.block.size`参数,增加task并行度,确保数据均匀分配。实验结果显示,调整后任务执行时间从60分钟缩短到2分40秒。
摘要由CSDN通过智能技术生成

这次分享多维分析优化的另一种情况

本文大纲

1、描述问题背景

2、讲一下解决思路

3、解决办法(spark sql处理parquet row group原理及分区原理,参数测试,解决方案)

4、效果

1、描述问题

代码如下:

select 
  netease_user, 
    if(campaign_id is null, 'all', campaign_id) as campaign_id, 
    if(spec_id is null, 'all', spec_id) as spec_id, 
    if(app_bundle is null, 'all', app_bundle) as app_bundle, 
    if(render_name is null, 'all', render_name) as render_name, 
    platform,
    sum(bidfloor) as success_bidfloor,
    count(distinct clk_request_id) as click_pv, 
    count(distinct exp_deviceid) as exp_uv,
    count(distinct exp_request_id) as exp_pv, 
    count(distinct clk_deviceid) as click_uv,
  round(sum(case when winprice<0 then 0 else winprice end)/1000, 4) as cost
from
(
select  distinct
    nvl(netease_user , 'true') as netease_user, 
    nvl(render_name , 'null') as render_name, 
    platform, 
    nvl(campaign_id, 'null') as campaign_id,
    nvl(spec_id, 'null') as spec_id, 
    nvl(app_bundle , 'null') as app_bundle,
    clk_request_id, exp_deviceid, exp_request_id, clk_deviceid, winprice, bidfloor
from table_a where day = '20190815' and platform is not null
) tmp
group by  netease_user, campaign_id, spec_id, app_bundle, render_name, platform
grouping sets(
  ( netease_user, platform),
​
  ( netease_user, platform, campaign_id),
  ( netease_user, platform, spec_id),
  ( netease_user, platform,app_bundle),
  ( netease_user, platform,render_name),
​
    ( netease_user, platform,campaign_id, spec_id),
    ( netease_user, platform,campaign_id, app_bundle), 
    ( netease_user, platform,campaign_id, render_name), 
    ( netease_user, platform, spec_id, app_bundle), 
    ( netease_user, platform, spec_id, render_name), 
    ( netease_user, platform, app_bundle, render_name),
    
    ( netease_user, platform, campaign_id, spec_id, app_bundle), 
    ( netease_user, platform, spec_id, app_bundle, render_name), 
    ( netease_user, platform, campaign_id, app_bundle, render_name), 
    ( netease_user, platform, campaign_id, spec_id, render_name),
    
    ( netease_user, campaign_id, spec_id, app_bundle, render_name, platform)
)
;

整体逻辑与上一篇:【spark sql多维分析优化——细节是魔鬼】  差不多。

不同的是上一篇的基础表 table_a的总量很大&#

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小萝卜算子

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值