在sparksql中显示的指定了mapjoin,导致广播的数据量太大,导致序列化超过指定大小。去除显示指定mapjoin
sql如下:
with einfo as
(
select
E6.EMP_NO ,
E6.TEAM_ID ,
E6.TEAM_NAME
from mids.sys_org_cnl_emp_info E6
where E6.dt = '20190220' AND E6.EMP_NO is not null
)
SELECT /*+ MAPJOIN(E2) */ C.CUST_NO AS CUST_NO,
C.CUST_MANAGER AS CUST_MANAGER,
C.CUST_COMMENDER AS CUST_COMMENDER,
E2.TEAM_ID AS CUST_COMMENDER_TEAM,
E2.TEAM_NAME AS CUST_COMMENDER_TEAM_NAME,
C.PROMOTION_ROLE AS PROMOTION_ROLE,
'' AS PROMOTION_ROLE_QY
FROM ODS.CUST_ADVALLINFO21 C
LEFT JOIN einfo E2
ON C.CUST_COMMENDER = E2.EMP_NO
AND E2.EMP_NO in ('V99','VM2101')
WHERE C.dt='20190220'
and c.CUST_COMMENDER in ('V99','V101')
union all
SELECT /*+ MAPJOIN(E2) */ C.CUST_NO AS CUST_NO,
C.CUST_MANAGER AS CUST_MANAGER,
C.CUST_COMMENDER AS CUST_COMMENDER,
E2.TEAM_ID AS CUST_COMMENDER_TEAM,
E2.TEAM_NAME AS CUST_COMMENDER_TEAM_NAME,
C.PROMOTION_ROLE AS PROMOTION_ROLE,
'' AS PROMOTION_ROLE_QY
FROM ODS.CUST_ADVALLINFO21 C
LEFT JOIN einfo E2
ON C.CUST_COMMENDER = E2.EMP_NO
WHERE C.dt='20190220'
AND C.CUST_COMMENDER <>'V99'
异常如下
19/02/21 21:24:29 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, CHDD183, executor 1): org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 134217728. To avoid this, increase spark.kryoserializer.buffer.max value.
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:350)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:393)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0, required: 134217728
at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:246)
at com.esotericsoftware.kryo.io.Output.writeBytes(Output.java:232)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:54)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.write(DefaultArraySerializers.java:43)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:366)
at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:307)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:347)
... 4 more
问题定位:
第二个union all里的mapjoin有15824019行数据(3856m)。数据量太大不适合mapjoin。第二阶段mapjoin去掉不加序列化参数就自动跑过去了
第一阶段mapjoin就2条数据
解决办法:
1、增加序列化参数的大小(没有根本解决,如果这个广播数据10G甚至更大的呢)
2、去除第二阶段的mapjoin(根本解决办法)