workflow经常无故报错SQL

23 篇文章 1 订阅

生产集群workflow经常随机报错,但是报错信息都差不多,SQL解析,内存不足的问题,报错信息如下:

21/01/29 04:02:54 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange RoundRobinPartitioning(1)
+- *Project [RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, cast(city_id#276 as int) AS city_id#175, city_name#278, cast(province_id#282 as int) AS province_id#176, province_name#283, regexp_replace(REPORT_DATE#160, -, ) AS REPORT_DATE#177, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8]
   +- *BroadcastHashJoin [province_id#181], [province_id#282], LeftOuter, BuildRight
      :- *Project [RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8, province_id#181, city_id#276, city_name#278]
      :  +- *BroadcastHashJoin [city_id#234], [city_id#276], LeftOuter, BuildRight
      :     :- *Project [RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8, province_id#181, city_id#234]
      :     :  +- *BroadcastHashJoin [sst_code#16], [sst_code#185], LeftOuter, BuildRight
      :     :     :- *HashAggregate(keys=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160], functions=[sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 汇总) && (REPAIR_TYPE_NAME#21 = 汇总)) THEN coalesce(SUM_COME_COUNT#103, 0) ELSE 0 END), sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 质量担保保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END), sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 常规保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END)], output=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8])
      :     :     :  +- Exchange hashpartitioning(RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, 200)
      :     :     :     +- *HashAggregate(keys=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160], functions=[partial_sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 汇总) && (REPAIR_TYPE_NAME#21 = 汇总)) THEN coalesce(SUM_COME_COUNT#103, 0) ELSE 0 END), partial_sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 质量担保保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END), partial_sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 常规保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END)], output=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, sum#328, sum#329, sum#330])
      :     :     :        +- *Project [rssc_name#11, rssc_code#12, sst_name#13, sst_code#16, group_repair_type_name#19, repair_type_name#21, sum_come_count#103, sum_repair_type_count#105, report_date#160]
      :     :     :           +- *Filter ((((isnotnull(biz_group#18) && isnotnull(brand_code#15)) && (biz_group#18 = 维修业务)) && (substring(sst_code#16, -1, 2147483647) = 0)) && (brand_code#15 = VW))
      :     :     :              +- HiveTableScan [sum_repair_type_count#105, group_repair_type_name#19, rssc_name#11, sum_come_count#103, sst_name#13, brand_code#15, rssc_code#12, biz_group#18, repair_type_name#21, report_date#160, sst_code#16], MetastoreRelation asmp, wd_tt_manage_business_date, [((partition_date#9 = 202012) || (partition_date#9 = 202101))]
      :     :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]))
      :     :        +- HiveTableScan [province_id#181, sst_code#185, city_id#234], MetastoreRelation ods___asmp2___sbpopt, tm_sst
      :     +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
      :        +- HiveTableScan [city_id#276, city_name#278], MetastoreRelation ods___asmp2___sbpopt, tm_city
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
         +- HiveTableScan [province_id#282, province_name#283], MetastoreRelation ods___asmp2___sbpopt, tm_province

初步判断是,生产集群最近配置有改动,导致原先某些配置初始化了(因为之前一直没有出现这样的错误)
于是网上找找修改下spark配置参数,暂时解决了问题。。。

--conf spark.yarn.driver.memoryOverhead=4096m \
--conf spark.yarn.executor.memoryOverhead=4096m \
--conf spark.sql.broadcastTimeout=1000 \
--conf spark.network.timeout=10000000 \

事后总结,集群资源有限的前提下,同一时间提交两个workflow,很容易造成资源不够的情况,等待时间过长就会造成任务失败,所以配置参数增加等待时间。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值