hbsae表使用VIN(车架号 每个车辆唯一的编号 17位)
作为rowkey,但是VIN分布并不均匀,因此在数据量很大的时候需要考虑自定义预分region间隔区间:
use asmp;
select sub,count(1) n from (select substring(vin,1,5) as sub from tt_repair_deed_tmp where partition_brand='vw') a
group by sub order by n desc
select sub,count(1) n from (select substring(vin,1,7) as sub from tt_repair_deed_tmp where partition_brand='vw') a
group by sub order by n desc
select sub,count(1) n from (select substring(vin,1,9) as sub from tt_repair_deed_tmp where partition_brand='vw') a
group by sub order by n desc
select sub,count(1) n from (select substring(vin,1,11) as sub from tt_repair_deed_tmp where partition_brand='vw') a
group by sub order by n desc
按照4位划分region
按照5位划分region
按照7位划分region
按照9位划分region
正常来说有多少region会产生多少reduce,如果reduce太多占用资源也会很多,因此选择按照5位划分region。
然后自定义vin_split文件