spark广播导致的问题

1. 背景介绍

2022Q2上线的功能,用spark-sql跑一个大任务,Q3从业务和技术上做了一些功能扩展

  • 业务在Q3增加了功能,sql变复杂了
  • 技术也Q3做了些优化,sql更复杂了

2. 分析步骤

2.1 任务yarn log

Application application_1664181807056_6528008 failed 1 times (global limit =8; local limit is =1) due to AM Container for appattempt_1664181807056_6528008_000001 exited with exitCode: 134

根据日志判断是OOM了。

2.2 任务执行计划

再看看stage执行计划,扫描table_xx3后进行广播,就挂了

 

2.3 spark SQL

SELECT t3.*
FROM (
    SELECT *
    FROM (
       SELECT *
       FROM (
           SELECT *, ROW_NUMBER() OVER (PARTITION BY region, version_id, shop_id, item_id, model_id ORDER BY enqueue_time DESC) AS rk
           FROM table_xx1
           WHERE version_id IN (1489, 1485, 1486)
              AND region = 'ID'
              AND rule = 0
              AND enqueue_time >= 1666832400
              AND enqueue_time < 1666834200
              AND dt IN (current_date, date_sub(current_date, 1))
       ) event
       WHERE event.rk = 1
    ) t1
       INNER JOIN (
           SELECT *
           FROM table_xx2
           WHERE time_range = '01-07'
              AND dt = '2022-10-26'
              AND region IN ('ID')
              AND view_count >= 5
       ) t2
       ON t1.shop_id = t2.shop_id
           AND t1.item_id = t2.item_id
           AND t1.region = t2.region
) t3
LEFT JOIN (
    SELECT *
    FROM table_xx3
    WHERE expiration_time > unix_timestamp(current_timestamp)
       AND expiration_date >= current_date
       AND region IN ('ID')
) t4 ON t3.user_id = t4.user_id
    AND t3.region = t4.region
    AND t3.journey_id = t4.journey_id
WHERE t4.user_id IS NULL

2.4 spark-sql校验

登陆spark driver机器,启动spark-sql,重新运行异常SQL,报143错误

22/10/27 10:23:52 ERROR YarnScheduler: Lost executor 182 on ip-10-130-81-171.idata-server.shopee.io: Container from a bad node: container_e257_1664260775537_6335017_01_000294 on host: ip-10-130-81-171.idata-server.shopee.io. Exit status: 143. Diagnostics: [2022-10-27 10:23:48.385]Container killed on request. Exit code is 143
[2022-10-27 10:23:48.463]Container exited with a non-zero exit code 143.
[2022-10-27 10:23:48.491]Killed by external signal
.
Interrupting... Be patient, this might take some time.
Press Ctrl+C again to kill JVM

同样可以判断是OOM异常了

3. 原因分析

3.1 table_xx1数据量太大

table_xx1动态数据关联上table_xx2静态数据(每天生成t-1数据,1-30天的数据大概80亿),由于table_xx2固定且非常大,table_xx1数据波动会影响join复杂度。

统计table_xx1最近半小时数据量,近400w,太大了!

SELECT *
    FROM (
        SELECT *
        FROM (
            SELECT*, ROW_NUMBER() OVER (PARTITION BY region, version_id, shop_id, item_id, model_id ORDER BY enqueue_time DESC) AS rk
            FROM table_xx1
            WHERE version_id IN (1489, 1485, 1486)
                AND region = 'ID'
                AND rule = 0
                AND enqueue_time >= 1666832400
                AND enqueue_time < 1666834200
                AND dt IN (current_date, date_sub(current_date, 1))
        ) event
        WHERE event.rk = 1
    ) t1

3.2 开启自动广播

看执行计划和日志,最先处理最后一部分数据,并进行广播

 

22/10/27 10:19:15 WARN InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Project [user_id#20L, view_most#21, view_count#22, region#25, item_id#6L, model_id#8L, shop_id#7L, event_type#11, related_database_type#12, event_time#13L, version_id#9L, cur_element_id#1L, journey_task_id#2L, journey_id#3L, flash_sale_id#4L]
+- Filter isnull(user_id#26L)
   +- BroadcastHashJoin [user_id#20L, region#25, journey_id#3L], [user_id#26L, region#34, journey_id#30L], LeftOuter, BuildRight, false
      :- Project [user_id#20L, view_most#21, view_count#22, region#25, item_id#6L, model_id#8L, shop_id#7L, event_type#11, related_database_type#12, event_time#13L, version_id#9L, cast(get_json_object(get_json_object(extra#15, $.get_matched_version_biz_info), $.cur_element_id) as bigint) AS cur_element_id#1L, cast(get_json_object(get_json_object(extra#15, $.get_matched_version_biz_info), $.journey_task_id) as bigint) AS journey_task_id#2L, cast(get_json_object(get_json_object(extra#15, $.get_matched_version_biz_info), $.journey_id) as bigint) AS journey_id#3L, cast(get_json_object(extra#15, $.flash_sale_id) as bigint) AS flash_sale_id#4L]
      :  +- SortMergeJoin [shop_id#7L, item_id#6L, region#10], [shop_id#19L, item_id#18L, region#25], Inner
      :     :- Project [item_id#6L, model_id#8L, shop_id#7L, version_id#9L, region#10, event_type#11, related_database_type#12, event_time#13L, extra#15]
      :     :  +- Filter (isnotnull(rk#0) AND (rk#0 = 1))
      :     :     +- Window [row_number() windowspecdefinition(region#10, version_id#9L, shop_id#7L, item_id#6L, model_id#8L, enqueue_time#14L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#0], [region#10, version_id#9L, shop_id#7L, item_id#6L, model_id#8L], [enqueue_time#14L DESC NULLS LAST]
      :     :        +- Project [item_id#6L, model_id#8L, shop_id#7L, version_id#9L, region#10, event_type#11, related_database_type#12, event_time#13L, extra#15, enqueue_time#14L]
      :     :           +- Filter (((((((isnotnull(region#10) AND isnotnull(enqueue_time#14L)) AND version_id#9L IN (1489,1485,1486)) AND (region#10 = ID)) AND (enqueue_time#14L >= 1666832400)) AND (enqueue_time#14L < 1666834200)) AND isnotnull(shop_id#7L)) AND isnotnull(item_id#6L))
      :     :              +- FileScan parquet opa_crm_datasend.dwd_event_trigger_rt_live[item_id#6L,shop_id#7L,model_id#8L,version_id#9L,region#10,event_type#11,related_database_type#12,event_time#13L,enqueue_time#14L,extra#15,dt#16,rule#17] Batched: true, DataFilters: [isnotnull(region#10), isnotnull(enqueue_time#14L), version_id#9L IN (1489,1485,1486), (region#10..., Format: Parquet, Location: InMemoryFileIndex[hdfs://R2/projects/opa_crm_datasend/hive/opa_crm_datasend/dwd_event_trigger_rt_..., PartitionFilters: [isnotnull(rule#17), (rule#17 = 0), dt#16 IN (2022-10-27,2022-10-26)], PushedFilters: [IsNotNull(region), IsNotNull(enqueue_time), In(version_id, [1489,1485,1486]), EqualTo(region,ID)..., ReadSchema: struct<item_id:bigint,shop_id:bigint,model_id:bigint,version_id:bigint,region:string,event_type:i...
      :     +- Project [item_id#18L, shop_id#19L, user_id#20L, view_most#21, view_count#22, region#25]
      :        +- Filter (((isnotnull(view_count#22) AND (view_count#22 >= 5)) AND isnotnull(shop_id#19L)) AND isnotnull(item_id#18L))
      :           +- FileScan parquet opa_crm_datasend.ads_view_item_agg_live[item_id#18L,shop_id#19L,user_id#20L,view_most#21,view_count#22,dt#23,time_range#24,region#25] Batched: true, DataFilters: [isnotnull(view_count#22), (view_count#22 >= 5), isnotnull(shop_id#19L), isnotnull(item_id#18L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://R2/projects/opa_crm_datasend/hive/opa_crm_datasend/ads_view_item_agg_liv..., PartitionFilters: [isnotnull(time_range#24), isnotnull(dt#23), isnotnull(region#25), (time_range#24 = 01-07), (dt#2..., PushedFilters: [IsNotNull(view_count), GreaterThanOrEqual(view_count,5), IsNotNull(shop_id), IsNotNull(item_id)], ReadSchema: struct<item_id:bigint,shop_id:bigint,user_id:bigint,view_most:int,view_count:int>
      :                 +- Project [item_id#6L, model_id#8L, shop_id#7L, version_id#9L, region#10, event_type#11, related_database_type#12, event_time#13L, extra#15]
      :                    +- Filter (isnotnull(rk#0) AND (rk#0 = 1))
      :                       +- Window [row_number() windowspecdefinition(region#10, version_id#9L, shop_id#7L, item_id#6L, model_id#8L, enqueue_time#14L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#0], [region#10, version_id#9L, shop_id#7L, item_id#6L, model_id#8L], [enqueue_time#14L DESC NULLS LAST]
      :                          +- Project [item_id#6L, model_id#8L, shop_id#7L, version_id#9L, region#10, event_type#11, related_database_type#12, event_time#13L, extra#15, enqueue_time#14L]
      :                             +- Filter ((((((((((isnotnull(region#10) AND isnotnull(rule#17)) AND isnotnull(enqueue_time#14L)) AND version_id#9L IN (1489,1485,1486)) AND (region#10 = ID)) AND (rule#17 = 0)) AND (enqueue_time#14L >= 1666832400)) AND (enqueue_time#14L < 1666834200)) AND dt#16 IN (2022-10-27,2022-10-26)) AND isnotnull(shop_id#7L)) AND isnotnull(item_id#6L))
      :                                +- Relation[item_id#6L,shop_id#7L,model_id#8L,version_id#9L,region#10,event_type#11,related_database_type#12,event_time#13L,enqueue_time#14L,extra#15,dt#16,rule#17] parquet
      +- Project [user_id#26L, region#34, journey_id#30L]
         +- Filter (((isnotnull(expiration_time#27L) AND (expiration_time#27L > 1666837153)) AND isnotnull(user_id#26L)) AND isnotnull(journey_id#30L))
            +- Scan hive opa_crm_datasend.dwd_crm_user_reentry_rt_live [expiration_time#27L, journey_id#30L, region#34, user_id#26L], HiveTableRelation [`opa_crm_datasend`.`dwd_crm_user_reentry_rt_live`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [user_id#26L, expiration_time#27L, re_entry_type#28, re_entry_key#29L, journey_id#30L, version_id..., Partition Cols: [expiration_date#33, region#34], Pruned Partitions: [(expiration_date=2022-10-27, region=ID), (expiration_date=2022-10-28, region=ID), (expiration_da...], [isnotnull(expiration_date#33), isnotnull(region#34), (cast(expiration_date#33 as date) >= 19292), (region#34 = ID)]

4. 解决方案

4.1 控制table_xx1数据

4.2 禁止广播功能

方法一:设置参数

set spark.sql.autoBroadcastJoinThreshold = -1

再执行SQL,BroadcastHashJoin已经改成SortMergeJoin:

22/10/27 10:30:04 WARN InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Project [user_id#106L, view_most#107, view_count#108, region#111, item_id#92L, model_id#94L, shop_id#93L, event_type#97, related_database_type#98, event_time#99L, version_id#95L, cur_element_id#87L, journey_task_id#88L, journey_id#89L, flash_sale_id#90L]
+- Filter isnull(user_id#112L)
   +- SortMergeJoin [user_id#106L, region#111, journey_id#89L], [user_id#112L, region#120, journey_id#116L], LeftOuter
      :- Project [user_id#106L, view_most#107, view_count#108, region#111, item_id#92L, model_id#94L, shop_id#93L, event_type#97, related_database_type#98, event_time#99L, version_id#95L, cast(get_json_object(get_json_object(extra#101, $.get_matched_version_biz_info), $.cur_element_id) as bigint) AS cur_element_id#87L, cast(get_json_object(get_json_object(extra#101, $.get_matched_version_biz_info), $.journey_task_id) as bigint) AS journey_task_id#88L, cast(get_json_object(get_json_object(extra#101, $.get_matched_version_biz_info), $.journey_id) as bigint) AS journey_id#89L, cast(get_json_object(extra#101, $.flash_sale_id) as bigint) AS flash_sale_id#90L]
      :  +- SortMergeJoin [shop_id#93L, item_id#92L, region#96], [shop_id#105L, item_id#104L, region#111], Inner
      :     :- Project [item_id#92L, model_id#94L, shop_id#93L, version_id#95L, region#96, event_type#97, related_database_type#98, event_time#99L, extra#101]
      :     :  +- Filter (isnotnull(rk#86) AND (rk#86 = 1))
      :     :     +- Window [row_number() windowspecdefinition(region#96, version_id#95L, shop_id#93L, item_id#92L, model_id#94L, enqueue_time#100L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#86], [region#96, version_id#95L, shop_id#93L, item_id#92L, model_id#94L], [enqueue_time#100L DESC NULLS LAST]
      :     :        +- Project [item_id#92L, model_id#94L, shop_id#93L, version_id#95L, region#96, event_type#97, related_database_type#98, event_time#99L, extra#101, enqueue_time#100L]
      :     :           +- Filter (((((((isnotnull(region#96) AND isnotnull(enqueue_time#100L)) AND version_id#95L IN (1489,1485,1486)) AND (region#96 = ID)) AND (enqueue_time#100L >= 1666832400)) AND (enqueue_time#100L < 1666834200)) AND isnotnull(shop_id#93L)) AND isnotnull(item_id#92L))
      :     :              +- FileScan parquet opa_crm_datasend.dwd_event_trigger_rt_live[item_id#92L,shop_id#93L,model_id#94L,version_id#95L,region#96,event_type#97,related_database_type#98,event_time#99L,enqueue_time#100L,extra#101,dt#102,rule#103] Batched: true, DataFilters: [isnotnull(region#96), isnotnull(enqueue_time#100L), version_id#95L IN (1489,1485,1486), (region#..., Format: Parquet, Location: InMemoryFileIndex[hdfs://R2/projects/opa_crm_datasend/hive/opa_crm_datasend/dwd_event_trigger_rt_..., PartitionFilters: [isnotnull(rule#103), (rule#103 = 0), dt#102 IN (2022-10-27,2022-10-26)], PushedFilters: [IsNotNull(region), IsNotNull(enqueue_time), In(version_id, [1489,1485,1486]), EqualTo(region,ID)..., ReadSchema: struct<item_id:bigint,shop_id:bigint,model_id:bigint,version_id:bigint,region:string,event_type:i...
      :     +- Project [item_id#104L, shop_id#105L, user_id#106L, view_most#107, view_count#108, region#111]
      :        +- Filter (((isnotnull(view_count#108) AND (view_count#108 >= 5)) AND isnotnull(shop_id#105L)) AND isnotnull(item_id#104L))
      :           +- FileScan parquet opa_crm_datasend.ads_view_item_agg_live[item_id#104L,shop_id#105L,user_id#106L,view_most#107,view_count#108,dt#109,time_range#110,region#111] Batched: true, DataFilters: [isnotnull(view_count#108), (view_count#108 >= 5), isnotnull(shop_id#105L), isnotnull(item_id#104L)], Format: Parquet, Location: InMemoryFileIndex[hdfs://R2/projects/opa_crm_datasend/hive/opa_crm_datasend/ads_view_item_agg_liv..., PartitionFilters: [isnotnull(time_range#110), isnotnull(dt#109), isnotnull(region#111), (time_range#110 = 01-07), (..., PushedFilters: [IsNotNull(view_count), GreaterThanOrEqual(view_count,5), IsNotNull(shop_id), IsNotNull(item_id)], ReadSchema: struct<item_id:bigint,shop_id:bigint,user_id:bigint,view_most:int,view_count:int>
      :                 +- Project [item_id#92L, model_id#94L, shop_id#93L, version_id#95L, region#96, event_type#97, related_database_type#98, event_time#99L, extra#101]
      :                    +- Filter (isnotnull(rk#86) AND (rk#86 = 1))
      :                       +- Window [row_number() windowspecdefinition(region#96, version_id#95L, shop_id#93L, item_id#92L, model_id#94L, enqueue_time#100L DESC NULLS LAST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rk#86], [region#96, version_id#95L, shop_id#93L, item_id#92L, model_id#94L], [enqueue_time#100L DESC NULLS LAST]
      :                          +- Project [item_id#92L, model_id#94L, shop_id#93L, version_id#95L, region#96, event_type#97, related_database_type#98, event_time#99L, extra#101, enqueue_time#100L]
      :                             +- Filter ((((((((((isnotnull(region#96) AND isnotnull(rule#103)) AND isnotnull(enqueue_time#100L)) AND version_id#95L IN (1489,1485,1486)) AND (region#96 = ID)) AND (rule#103 = 0)) AND (enqueue_time#100L >= 1666832400)) AND (enqueue_time#100L < 1666834200)) AND dt#102 IN (2022-10-27,2022-10-26)) AND isnotnull(shop_id#93L)) AND isnotnull(item_id#92L))
      :                                +- Relation[item_id#92L,shop_id#93L,model_id#94L,version_id#95L,region#96,event_type#97,related_database_type#98,event_time#99L,enqueue_time#100L,extra#101,dt#102,rule#103] parquet
      +- Project [user_id#112L, region#120, journey_id#116L]
         +- Filter (((isnotnull(expiration_time#113L) AND (expiration_time#113L > 1666837803)) AND isnotnull(user_id#112L)) AND isnotnull(journey_id#116L))
            +- Scan hive opa_crm_datasend.dwd_crm_user_reentry_rt_live [expiration_time#113L, journey_id#116L, region#120, user_id#112L], HiveTableRelation [`opa_crm_datasend`.`dwd_crm_user_reentry_rt_live`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [user_id#112L, expiration_time#113L, re_entry_type#114, re_entry_key#115L, journey_id#116L, versi..., Partition Cols: [expiration_date#119, region#120], Pruned Partitions: [(expiration_date=2022-10-27, region=ID), (expiration_date=2022-10-28, region=ID), (expiration_da...], [isnotnull(expiration_date#119), isnotnull(region#120), (cast(expiration_date#119 as date) >= 19292), (region#120 = ID)]

目前线上的解决方案是采用这个方法。

方法二:优化流程

生成两个中间表,再进行join

create table view_item_test
SELECT *
FROM (
    SELECT *
    FROM (
        SELECT item_id, model_id, shop_id, version_id, region
            , event_type, related_database_type, event_time, extra, ROW_NUMBER() OVER (PARTITION BY region, version_id, shop_id, item_id, model_id ORDER BY enqueue_time DESC) AS rk
        FROM table_xx1
        WHERE version_id IN (1489, 1485, 1486)
            AND region = 'ID'
            AND rule = 0
            AND enqueue_time >= 1666832400
            AND enqueue_time < 1666834200
            AND dt IN (current_date, date_sub(current_date, 1))
    ) event
    WHERE event.rk = 1
) t1
INNER JOIN (
    SELECT item_id, shop_id, user_id, view_most, view_count, region
    FROM table_xx2
    WHERE time_range = '01-07'
        AND dt = '2022-10-26'
        AND region IN ('ID')
        AND view_count >= 5
) t2
ON t1.shop_id = t2.shop_id
    AND t1.item_id = t2.item_id
    AND t1.region = t2.region
;

create table user_reentry_test
SELECT user_id, region, journey_id
FROM table_xx3
WHERE expiration_time > unix_timestamp(current_timestamp)
    AND expiration_date >= current_date
    AND region IN ('ID')
;

create table view_item_result
select aa.* from (
    select * from view_item_test
) as aa
left join (select * from user_reentry_test) as bb 
    on aa.user_id = bb.user_id and aa.region = bb.region and aa.journey_id = bb.journey_id
where bb.user_id is null;

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值