spark shuffle数据倾斜

1.背景

autocomplete业务每天的定期刷新连续几天都失败了,发现在业务方改了joinsource之后开始失败的。autocomplete一共有22个provider,Main和Movie也是每天定期刷新的,main有MultiDataReader。业务方把Main的最后一个子datareader和Movie这个provider的datareader的sql都换了一张表,并且新表没有cityid和shopid字段,业务方把cityid和shopid都写死为-1,而改joinsource之前那张老表是有cityid和shopid字段的,字段值都是有效的值,不是都写死成-1。

  • Main的子datareader修改前
<DataReader>
  <Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
  <Database>xxx</Database>
  <Sql>
    <![CDATA[
                    SELECT
                        A.CityID,
                        A.Keyword,
                        A.shopID,
                        A.datatype,
                        1 as ResultCount,
                        50 as SearchCount
                    FROM aaa A
                    WHERE A.VersionID = (Select VersionID from bbb order by ID ASC limit 1)
                    AND {KeyConditions A.CityID=$Value1 AND A.Keyword="$Value2" AND A.shopID=$Value3}
                    ]]>
  </Sql>
  <BatchTableName>aaa</BatchTableName>
  <BatchTableAlias>A</BatchTableAlias>
  <BatchTableKey>ID</BatchTableKey>
  <BatchTableCount>10000</BatchTableCount>
</DataReader>
  • Main的子datareader修改后
<DataReader>
  <Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
  <Database>xxx</Database>
  <Sql>
    <![CDATA[
                    SELECT
                        -1 as CityID,
                        MovieName as Keyword,
                        -1 as shopID,
                        2 as datatype,
                        1 as ResultCount,
                        50 as SearchCount
                        from aaa A
                    WHERE A.VersionID = (Select VersionID from bbb order by ID ASC limit 1)
                    AND {KeyConditions -1=$Value1 AND A.MovieName="$Value2" AND -1=$Value3}
                    ]]>
  </Sql>
  <BatchTableName>aaa</BatchTableName>
  <BatchTableAlias>A</BatchTableAlias>
  <BatchTableKey>ID</BatchTableKey>
  <BatchTableCount>10000</BatchTableCount>
</DataReader>
  • Movie的datareader修改前

<ExtraDataProvider description="电影信息">
  <Class>com.dp.arts.joiner.core.join.provider.impl.ExtraSimpleDataProvider</Class>
  <Name>Movie</Name>
  <keyFieldNames>CityID,Keyword,shopID,datatype</keyFieldNames>
  <DependOnProviderName>Main</DependOnProviderName>
  <CacheType>INTERVAL</CacheType>
  <DataReader>
    <Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
    <Database>xxx</Database>
    <Sql>
      <![CDATA[
     SELECT
         D.ID,
         D.CityID,
         D.Keyword,
         D.shopID,
         D.MovieID,
         D.DisplayInfo,
         D.datatype,
         50 as Weight
                    FROM xxx D
                    Where D.VersionID = (Select VersionID from DP_Search_Suggestshop_Movie order by ID ASC limit 1)
                    AND {KeyConditions D.CityID=$Value1 AND D.Keyword="$Value2" AND D.shopID=$Value3 AND D.datatype=$Value4}
                ]]>
    </Sql>
    <BatchTableName>DP_Search_Suggestshop_Movie</BatchTableName>
    <BatchTableAlias>D</BatchTableAlias>
    <BatchTableKey>ID</BatchTableKey>
    <BatchTableCount>10000</BatchTableCount>
  </DataReader>
  <List name="dataFields" element_class="com.dp.arts.joiner.producer.data.provider.ProviderDataField">
    <Field type="long" column="ID"/>
    <Field type="int" column="CityID"/>
    <Field type="string" column="Keyword"/>
    <Field type="long" column="shopID"/>
    <Field type="long" column="MovieID"/>
    <Field type="string" column="DisplayInfo"/>
    <Field type="int" column="datatype"/>
    <Field type="int" column="Weight"/>
  </List>
</ExtraDataProvider>
  • Movie的datareader修改后
<ExtraDataProvider description="电影信息">
  <Class>com.dp.arts.joiner.core.join.provider.impl.ExtraSimpleDataProvider</Class>
  <Name>Movie</Name>
  <keyFieldNames>CityID,Keyword,shopID,datatype</keyFieldNames>
  <DependOnProviderName>Main</DependOnProviderName>
  <CacheType>INTERVAL</CacheType>
  <DataReader>
    <Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
    <Database>LuceneDB</Database>
    <Sql>
      <![CDATA[
     SELECT
         ID,
         -1 as CityID,
         MovieName as Keyword,
         -1 as shopID,
         MovieId,
         '热映影片' as DisplayInfo,
         2 as datatype,
         50 as Weight
                    FROM DP_Search_Knowledge_Graph_Movie D
                    Where D.VersionID = (Select VersionID from DP_Search_Knowledge_Graph_Movie order by ID ASC limit 1)
                    AND {KeyConditions -1=$Value1 AND D.MovieName="$Value2" AND -1=$Value3 AND 2=$Value4}
                ]]>
    </Sql>
    <BatchTableName>DP_Search_Knowledge_Graph_Movie</BatchTableName>
    <BatchTableAlias>D</BatchTableAlias>
    <BatchTableKey>ID</BatchTableKey>
    <BatchTableCount>10000</BatchTableCount>
  </DataReader>
  <List name="dataFields" element_class="com.dp.arts.joiner.producer.data.provider.ProviderDataField">
    <Field type="long" column="ID"/>
    <Field type="int" column="CityID"/>
    <Field type="string" column="Keyword"/>
    <Field type="long" column="shopID"/>
    <Field type="long" column="MovieId"/>
    <Field type="string" column="DisplayInfo"/>
    <Field type="int" column="datatype"/>
    <Field type="int" column="Weight"/>
  </List>
</ExtraDataProvider>

2.原因

Main和Movie join的key由4个字段构成的CityID/Keyword/shopID/datatype,main provider的3个子datareader加起来共有4800w条记录,Movie只有11120条记录,main的最后一个datareader和Movie其实是差不多的,是从同一张表取的,也是11120条记录。Movie的CityID固定为-1,shopID固定为-1,datatype固定为2,那么怀疑某个值的Keyword有太多条记录了。

 

上面看出确实将近一半条数的记录的Keyword都是 '星球大战:命运的力量 第三辑 Star Wars Forces of Destiny: Volume 3'

main provider的最后1个子datareader和Movie的datareader中有4773条记录的key是一样的,意味着left join的时候关于这条记录会生成4773*4773=22781529条记录,而每条记录大概是450B,关于这条记录就大概消耗9776MB内存。

所以会OOM,1800个task,前面1799个task几秒钟就完成了,最后1个task重试4次,几十分钟后oom失败。

3.解决方案

provider的key按道理都是唯一的,对于simple的provider可以UniqueIndex或者在sql里用group by等使得写进mongo的provider key是唯一的。或者在spark join里先对provider按key去重之后再join。

定位数据倾斜:对各表都按join的key做group by count,看数据分布

附:https://www.jianshu.com/p/06b67a3c61a9

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值