记一次Spark报错:Failed to allocate a page (67108864 bytes), try again.

项目场景:

业务那边给了个需求,我们这边完成的话需要两个表进行join操作,小表(4800万条)大表(26亿条)。典型的小表和大表join,首先想到的就是 Broadcast Join对其进行一个最佳处理。


问题描述

1,开干。
//sc是小表
select /*+  BROADCASTJOIN(sc) */ 
  sc.courseid,
  csc.courseid
from sale_course sc join course_shopping_cart csc
on sc.courseid=csc.courseid
2,打包上集群run,开始出bug
2022-06-22 19:36:56 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:36:57 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:36:59 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:00 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:00 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:03 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:03 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:04 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 139818 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 5 with no recent heartbeats: 178273 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 7 with no recent heartbeats: 162256 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 3 with no recent heartbeats: 154289 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 2
3,看了下应该是内存不足,打印一下GC日志再瞅瞅
2022-06-22T19:32:04.731+0800: [GC (Allocation Failure) [PSYoungGen: 994157K->47291K(1377280K)] 1061069K->240591K(4076032K), 0.2125657 secs] [Times: user=4.51 sys=0.35, real=0.21 secs] 
2022-06-22T19:32:12.667+0800: [GC (Allocation Failure) [PSYoungGen: 1298524K->69107K(1380352K)] 1491823K->776885K(4079104K), 0.4118997 secs] [Times: user=12.93 sys=1.20, real=0.41 secs] 
2022-06-22T19:32:30.661+0800: [GC (Allocation Failure) [PSYoungGen: 1363073K->305779K(1643520K)] 2070852K->1248436K(4342272K), 0.2067380 secs] [Times: user=6.53 sys=0.68, real=0.21 secs] 
2022-06-22T19:32:49.327+0800: [GC (Allocation Failure) [PSYoungGen: 1583420K->380843K(1685504K)] 2526077K->1558689K(4384256K), 0.2134726 secs] [Times: user=6.50 sys=1.14, real=0.21 secs] 
2022-06-22T19:32:57.628+0800: [GC (Allocation Failure) [PSYoungGen: 1677943K->386985K(1469440K)] 2855790K->1938110K(4168192K), 0.1938505 secs] [Times: user=6.17 sys=0.87, real=0.19 secs] 
2022-06-22T19:33:10.943+0800: [GC (Allocation Failure) [PSYoungGen: 1424669K->489773K(1547776K)] 2975793K->2158027K(4246528K), 0.1824065 secs] [Times: user=6.34 sys=0.27, real=0.19 secs] 
2022-06-22T19:33:18.556+0800: [GC (Allocation Failure) [PSYoungGen: 1523628K->501866K(1313280K)] 4240457K->3578994K(5061120K), 0.1838270 secs] [Times: user=5.74 sys=0.84, real=0.18 secs] 
2022-06-22T19:33:19.956+0800: [GC (Allocation Failure) [PSYoungGen: 1214502K->632842K(1397248K)] 4291630K->3972122K(5145088K), 0.2161871 secs] [Times: user=7.20 sys=0.64, real=0.21 secs] 
2022-06-22T19:33:20.172+0800: [Full GC (Ergonomics) [PSYoungGen: 632842K->0K(1397248K)] [ParOldGen: 3339280K->3514303K(4194304K)] 3972122K->3514303K(5591552K), [Metaspace: 136487K->136476K(1177600K)], 0.6284626 secs] [Times: user=6.74 sys=3.98, real=0.63 secs] 
2022-06-22T19:33:22.153+0800: [GC (Allocation Failure) [PSYoungGen: 726892K->459232K(1398272K)] 4241195K->3973535K(5592576K), 0.0348947 secs] [Times: user=0.96 sys=0.00, real=0.04 secs] 
2022-06-22T19:33:23.347+0800: [GC (Allocation Failure) [PSYoungGen: 1158624K->656153K(1398272K)] 4672927K->4367065K(5592576K), 0.1967581 secs] [Times: user=6.70 sys=0.44, real=0.19 secs] 
2022-06-22T19:33:23.544+0800: [Full GC (Ergonomics) [PSYoungGen: 656153K->131072K(1398272K)] [ParOldGen: 3710911K->4169346K(4194304K)] 4367065K->4300418K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 1.7445365 secs] [Times: user=46.91 sys=10.81, real=1.75 secs] 
2022-06-22T19:33:26.442+0800: [Full GC (Ergonomics) [PSYoungGen: 830464K->524355K(1398272K)] [ParOldGen: 4169346K->4169283K(4194304K)] 4999810K->4693638K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.5643075 secs] [Times: user=14.75 sys=0.14, real=0.57 secs] 
2022-06-22T19:33:27.323+0800: [Full GC (Ergonomics) [PSYoungGen: 664059K->589892K(1398272K)] [ParOldGen: 4169283K->4169282K(4194304K)] 4833342K->4759175K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.3743719 secs] [Times: user=10.16 sys=0.05, real=0.38 secs] 
2022-06-22T19:33:27.909+0800: [Full GC (Ergonomics) [PSYoungGen: 699392K->655430K(1398272K)] [ParOldGen: 4169282K->4169282K(4194304K)] 4868674K->4824713K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.4272478 secs] [Times: user=11.16 sys=0.05, real=0.43 secs] 
2022-06-22T19:33:28.382+0800: [Full GC (Ergonomics) [PSYoungGen: 668779K->655430K(1398272K)] [ParOldGen: 4169282K->4169282K(4194304K)] 4838062K->4824713K(5592576K), [Metaspace: 136486K->136486K(1177600K)], 0.2751700 secs] [Times: user=6.67 sys=0.03, real=0.28 secs] 
2022-06-22T19:33:28.657+0800: [Full GC (Allocation Failure) [PSYoungGen: 655430K->655430K(1398272K)] [ParOldGen: 4169282K->4162677K(4194304K)] 4824713K->4818107K(5592576K), [Metaspace: 136486K->135746K(1177600K)], 0.6008903 secs] [Times: user=17.76 sys=0.08, real=0.60 secs] 
2022-06-22T19:33:29.260+0800: [Full GC (Ergonomics) [PSYoungGen: 659800K->655438K(1398272K)] [ParOldGen: 4162677K->4162674K(4194304K)] 4822477K->4818112K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 1.4037111 secs] [Times: user=46.99 sys=0.27, real=1.40 secs] 
2022-06-22T19:33:30.664+0800: [Full GC (Allocation Failure) [PSYoungGen: 655438K->655431K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4818112K->4818105K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 0.1268273 secs] [Times: user=1.35 sys=0.02, real=0.13 secs] 
2022-06-22T19:33:30.792+0800: [Full GC (Ergonomics) [PSYoungGen: 658317K->655447K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4820992K->4818121K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 1.2769239 secs] [Times: user=42.48 sys=0.27, real=1.28 secs] 
2022-06-22T19:33:32.069+0800: [Full GC (Allocation Failure) [PSYoungGen: 655447K->655440K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4818121K->4818114K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 0.2098295 secs] [Times: user=2.81 sys=0.02, real=0.21 secs] 
2022-06-22T19:33:32.282+0800: [Full GC (Ergonomics) [PSYoungGen: 657391K->655457K(1398272K)] [ParOldGen: 4162674K->4162673

原因分析:

其实看到这我就知道问题出现那里了,内存不足,调整下executor内存和driver内存,一般就可以解决了
但是还是在复习一下广播join吧

1.广播join原理

Spark join策略中,如果当一张小表足够小并且可以先缓存到内存中,那么可以使用Broadcast Hash Join,其原理就是先将小表聚合到driver端,再广播到各个大表分区中,那么再次进行join的时候,就相当于大表的各自分区的数据与小表进行本地join,从而规避了shuffle。

#1,通过参数指定自动广播
广播join默认值为10MB,由spark.sql.autoBroadcastJoinThreshold参数控制。
SparkConf().set("spark.sql.autoBroadcastJoinThreshold","10m")  //开启
SparkConf().set("spark.sql.autoBroadcastJoinThreshold","-1")   //禁用

#2,强制开启广播join
#SQL Hint方式
#sc 必须是join的小表
select /*+  BROADCASTJOIN(sc) *//*+  BROADCAST(sc) *//*+  MAPJOIN(sc) */
2,说下我的问题

上面说了广播join 会把小表的数据 拉到driver段,所以driver的内存不能给的太小了,给太小就会报错
但是,我把driver内存调大之后还是没解决
因为我的小表的数据量太大了,我们集群内存又没办法给太多,无奈


解决方案:

那咋办?
那就别广播join了,就普通join吧虽然慢些 但是硬件资源就在那摆着那也没办法
最后两个表join了两个小时QAQ

  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值