语法: rand(),rand(int seed)函数
返回值: double随机数
说明:返回一个0到1范围内的随机数。若是指定种子seed,则会等到一个稳定的随机数序列。
> select rand();
0.9629742951434543
> select rand(0);
0.8446490682263027
> select rand(null);
0.8446490682263027
ps:如果想要取的0-9或者1-10之间的随机数,x10后向下向上取整即可
select cast(floor(rand() * 10) as int)
select cast(ceiling(rand() * 10) as int)
-- 单独匹配地理信息
drop table if exists wedw_tmp.t_wy_zh_user_df_${DATA_DATE}_03;
create table wedw_tmp.t_wy_zh_user_df_${DATA_DATE}_03 as
select
profile.*
,case when area.area_id is not null then area.area_name else '-99' end as write_province_name
,case when area1.area_id is not null then area1.area_name else '-99' end as write_city_name
from wedw_tmp.t_wy_zh_user_df_${DATA_DATE}_03_01 profile
left join wedw_dw.wy_zh_area_df area
on profile.new_province_id = area.area_id
left join wedw_dw.wy_zh_area_df area1
on profile.new_city_id = area1.area_id
distribute by rand(1);
SparkSql 控制输出文件数量且大小均匀(distribute by rand())
Q:Spark如何控制文件你输出数量?
A:这个简单,用 coalesce或者repartition,num=(1.0*(df.count())/7000000).ceil.toInt
Q:Spark让输出文件大小均匀?
A:在sparksql的查询最后加上distribute by rand()
本文重点:distribute by 关键字控制map输出结果的分发,相同字段的map输出会发到一个reduce节点处理,如果字段是rand()一个随机数,能能保证每个分区的数量基本一致