生产集群workflow经常随机报错,但是报错信息都差不多,SQL解析,内存不足的问题,报错信息如下:
21/01/29 04:02:54 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange RoundRobinPartitioning(1)
+- *Project [RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, cast(city_id#276 as int) AS city_id#175, city_name#278, cast(province_id#282 as int) AS province_id#176, province_name#283, regexp_replace(REPORT_DATE#160, -, ) AS REPORT_DATE#177, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8]
+- *BroadcastHashJoin [province_id#181], [province_id#282], LeftOuter, BuildRight
:- *Project [RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8, province_id#181, city_id#276, city_name#278]
: +- *BroadcastHashJoin [city_id#234], [city_id#276], LeftOuter, BuildRight
: :- *Project [RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8, province_id#181, city_id#234]
: : +- *BroadcastHashJoin [sst_code#16], [sst_code#185], LeftOuter, BuildRight
: : :- *HashAggregate(keys=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160], functions=[sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 汇总) && (REPAIR_TYPE_NAME#21 = 汇总)) THEN coalesce(SUM_COME_COUNT#103, 0) ELSE 0 END), sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 质量担保保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END), sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 常规保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END)], output=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, SUM_COME_COUNT#6, WARRANTY_SUM_REPAIR_COUNT#7, MAINTENANCE_SUM_REPAIR_COUNT#8])
: : : +- Exchange hashpartitioning(RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, 200)
: : : +- *HashAggregate(keys=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160], functions=[partial_sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 汇总) && (REPAIR_TYPE_NAME#21 = 汇总)) THEN coalesce(SUM_COME_COUNT#103, 0) ELSE 0 END), partial_sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 质量担保保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END), partial_sum(CASE WHEN ((GROUP_REPAIR_TYPE_NAME#19 = 保养) && (REPAIR_TYPE_NAME#21 = 常规保养)) THEN coalesce(SUM_REPAIR_TYPE_COUNT#105, 0) ELSE 0 END)], output=[RSSC_NAME#11, RSSC_CODE#12, SST_NAME#13, SST_CODE#16, REPORT_DATE#160, sum#328, sum#329, sum#330])
: : : +- *Project [rssc_name#11, rssc_code#12, sst_name#13, sst_code#16, group_repair_type_name#19, repair_type_name#21, sum_come_count#103, sum_repair_type_count#105, report_date#160]
: : : +- *Filter ((((isnotnull(biz_group#18) && isnotnull(brand_code#15)) && (biz_group#18 = 维修业务)) && (substring(sst_code#16, -1, 2147483647) = 0)) && (brand_code#15 = VW))
: : : +- HiveTableScan [sum_repair_type_count#105, group_repair_type_name#19, rssc_name#11, sum_come_count#103, sst_name#13, brand_code#15, rssc_code#12, biz_group#18, repair_type_name#21, report_date#160, sst_code#16], MetastoreRelation asmp, wd_tt_manage_business_date, [((partition_date#9 = 202012) || (partition_date#9 = 202101))]
: : +- BroadcastExchange HashedRelationBroadcastMode(List(input[1, string, true]))
: : +- HiveTableScan [province_id#181, sst_code#185, city_id#234], MetastoreRelation ods___asmp2___sbpopt, tm_sst
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- HiveTableScan [city_id#276, city_name#278], MetastoreRelation ods___asmp2___sbpopt, tm_city
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
+- HiveTableScan [province_id#282, province_name#283], MetastoreRelation ods___asmp2___sbpopt, tm_province
初步判断是,生产集群最近配置有改动,导致原先某些配置初始化了(因为之前一直没有出现这样的错误)
于是网上找找修改下spark配置参数,暂时解决了问题。。。
--conf spark.yarn.driver.memoryOverhead=4096m \
--conf spark.yarn.executor.memoryOverhead=4096m \
--conf spark.sql.broadcastTimeout=1000 \
--conf spark.network.timeout=10000000 \
事后总结,集群资源有限的前提下,同一时间提交两个workflow,很容易造成资源不够的情况,等待时间过长就会造成任务失败,所以配置参数增加等待时间。