1.背景
autocomplete业务每天的定期刷新连续几天都失败了,发现在业务方改了joinsource之后开始失败的。autocomplete一共有22个provider,Main和Movie也是每天定期刷新的,main有MultiDataReader。业务方把Main的最后一个子datareader和Movie这个provider的datareader的sql都换了一张表,并且新表没有cityid和shopid字段,业务方把cityid和shopid都写死为-1,而改joinsource之前那张老表是有cityid和shopid字段的,字段值都是有效的值,不是都写死成-1。
- Main的子datareader修改前
<DataReader>
<Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
<Database>xxx</Database>
<Sql>
<![CDATA[
SELECT
A.CityID,
A.Keyword,
A.shopID,
A.datatype,
1 as ResultCount,
50 as SearchCount
FROM aaa A
WHERE A.VersionID = (Select VersionID from bbb order by ID ASC limit 1)
AND {KeyConditions A.CityID=$Value1 AND A.Keyword="$Value2" AND A.shopID=$Value3}
]]>
</Sql>
<BatchTableName>aaa</BatchTableName>
<BatchTableAlias>A</BatchTableAlias>
<BatchTableKey>ID</BatchTableKey>
<BatchTableCount>10000</BatchTableCount>
</DataReader>
- Main的子datareader修改后
<DataReader>
<Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
<Database>xxx</Database>
<Sql>
<![CDATA[
SELECT
-1 as CityID,
MovieName as Keyword,
-1 as shopID,
2 as datatype,
1 as ResultCount,
50 as SearchCount
from aaa A
WHERE A.VersionID = (Select VersionID from bbb order by ID ASC limit 1)
AND {KeyConditions -1=$Value1 AND A.MovieName="$Value2" AND -1=$Value3}
]]>
</Sql>
<BatchTableName>aaa</BatchTableName>
<BatchTableAlias>A</BatchTableAlias>
<BatchTableKey>ID</BatchTableKey>
<BatchTableCount>10000</BatchTableCount>
</DataReader>
-
Movie的datareader修改前
<ExtraDataProvider description="电影信息">
<Class>com.dp.arts.joiner.core.join.provider.impl.ExtraSimpleDataProvider</Class>
<Name>Movie</Name>
<keyFieldNames>CityID,Keyword,shopID,datatype</keyFieldNames>
<DependOnProviderName>Main</DependOnProviderName>
<CacheType>INTERVAL</CacheType>
<DataReader>
<Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
<Database>xxx</Database>
<Sql>
<![CDATA[
SELECT
D.ID,
D.CityID,
D.Keyword,
D.shopID,
D.MovieID,
D.DisplayInfo,
D.datatype,
50 as Weight
FROM xxx D
Where D.VersionID = (Select VersionID from DP_Search_Suggestshop_Movie order by ID ASC limit 1)
AND {KeyConditions D.CityID=$Value1 AND D.Keyword="$Value2" AND D.shopID=$Value3 AND D.datatype=$Value4}
]]>
</Sql>
<BatchTableName>DP_Search_Suggestshop_Movie</BatchTableName>
<BatchTableAlias>D</BatchTableAlias>
<BatchTableKey>ID</BatchTableKey>
<BatchTableCount>10000</BatchTableCount>
</DataReader>
<List name="dataFields" element_class="com.dp.arts.joiner.producer.data.provider.ProviderDataField">
<Field type="long" column="ID"/>
<Field type="int" column="CityID"/>
<Field type="string" column="Keyword"/>
<Field type="long" column="shopID"/>
<Field type="long" column="MovieID"/>
<Field type="string" column="DisplayInfo"/>
<Field type="int" column="datatype"/>
<Field type="int" column="Weight"/>
</List>
</ExtraDataProvider>
- Movie的datareader修改后
<ExtraDataProvider description="电影信息">
<Class>com.dp.arts.joiner.core.join.provider.impl.ExtraSimpleDataProvider</Class>
<Name>Movie</Name>
<keyFieldNames>CityID,Keyword,shopID,datatype</keyFieldNames>
<DependOnProviderName>Main</DependOnProviderName>
<CacheType>INTERVAL</CacheType>
<DataReader>
<Class>com.dp.arts.joiner.core.join.reader.impl.DbDataReader</Class>
<Database>LuceneDB</Database>
<Sql>
<![CDATA[
SELECT
ID,
-1 as CityID,
MovieName as Keyword,
-1 as shopID,
MovieId,
'热映影片' as DisplayInfo,
2 as datatype,
50 as Weight
FROM DP_Search_Knowledge_Graph_Movie D
Where D.VersionID = (Select VersionID from DP_Search_Knowledge_Graph_Movie order by ID ASC limit 1)
AND {KeyConditions -1=$Value1 AND D.MovieName="$Value2" AND -1=$Value3 AND 2=$Value4}
]]>
</Sql>
<BatchTableName>DP_Search_Knowledge_Graph_Movie</BatchTableName>
<BatchTableAlias>D</BatchTableAlias>
<BatchTableKey>ID</BatchTableKey>
<BatchTableCount>10000</BatchTableCount>
</DataReader>
<List name="dataFields" element_class="com.dp.arts.joiner.producer.data.provider.ProviderDataField">
<Field type="long" column="ID"/>
<Field type="int" column="CityID"/>
<Field type="string" column="Keyword"/>
<Field type="long" column="shopID"/>
<Field type="long" column="MovieId"/>
<Field type="string" column="DisplayInfo"/>
<Field type="int" column="datatype"/>
<Field type="int" column="Weight"/>
</List>
</ExtraDataProvider>
2.原因
Main和Movie join的key由4个字段构成的CityID/Keyword/shopID/datatype,main provider的3个子datareader加起来共有4800w条记录,Movie只有11120条记录,main的最后一个datareader和Movie其实是差不多的,是从同一张表取的,也是11120条记录。Movie的CityID固定为-1,shopID固定为-1,datatype固定为2,那么怀疑某个值的Keyword有太多条记录了。
上面看出确实将近一半条数的记录的Keyword都是 '星球大战:命运的力量 第三辑 Star Wars Forces of Destiny: Volume 3'
main provider的最后1个子datareader和Movie的datareader中有4773条记录的key是一样的,意味着left join的时候关于这条记录会生成4773*4773=22781529条记录,而每条记录大概是450B,关于这条记录就大概消耗9776MB内存。
所以会OOM,1800个task,前面1799个task几秒钟就完成了,最后1个task重试4次,几十分钟后oom失败。
3.解决方案
provider的key按道理都是唯一的,对于simple的provider可以UniqueIndex或者在sql里用group by等使得写进mongo的provider key是唯一的。或者在spark join里先对provider按key去重之后再join。
定位数据倾斜:对各表都按join的key做group by count,看数据分布