1.如何使大数据量利用合适的map数2.如何使单个map处理合适的数据记录数
合并小文件是会消耗资源,何时合并小文件,哪些数据需要合并小文件一定要考虑全面点。1.设置HIVE的参数,使HIVE在数据生成时自动进行小文件合并,方法请参看: http://hugh-wangp.iteye.com/blog/15085062.设置HIVE的参数,使HIVE在获取数据是先进行小文件合并,set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat3.自己开发小文件合并工具,这个是自由度最高的方法
1.set mapred.reduce.tasks(预计一个文件包含的记录数,确定计划要输出的文件数)2.利用distribute by rand(123)把数据分散到各个reduce里
处理这种复杂程度的如下SQL,按现在的数据分布是够呛的,一个job半个小时都没有执行完
hive> select count(1) from shaka01;
OK
29022319
Time taken: 92.263 seconds
--HDFS的block.size
hive> set dfs.block.size;
dfs.block.size=268435456
--shaka01对应的文件,每个文件大于block.size,利用此表时会有4个map执行,平均1个map处理750W条记录
336495295 /group/hive/shaka01/attempt_201206121523_2432943_m_000000_0
338594924 /group/hive/shaka01/attempt_201206121523_2432943_m_000001_0
改变数据存储的文件数量,为了使1个map处理的记录数减少
--在第一个job(number of mappers: 4; number of reducers: 1)就执行将近30分钟,但是还处于map = 71%, reduce = 0%
hive> create table shaka02 as
> select '2012-06-26' col1
> ,a.col3
> ,a.col4
> ,a.col5
> ,count(distinct case when date_add(to_date(col2),7) > '2012-06-26' then a.col7 else null END) as col6
> ,count(distinct case when date_add(to_date(col2),30)> '2012-06-26' then a.col7 else null END ) as col7
> ,count(distinct case when date_add(to_date(col2),90)> '2012-06-26' then a.col7 else null END ) as col8
> ,count(distinct case when date_add(to_date(col2),7) >'2012-06-26' and date_add(to_date(col2),7)>to_date(a.col8) and to_date(a.col8)<='2012-06-26' then a.col7 else null END) col9
> ,count(distinct case when date_add(to_date(col2),30)>'2012-06-26' and date_add(to_date(col2),7)>to_date(a.col8) and to_date(a.col8)<='2012-06-26' then a.col7 else null END) col10
> ,count(distinct case when date_add(to_date(col2),90)>'2012-06-26' and date_add(to_date(col2),7)>to_date(a.col8) and to_date(a.col8)<='2012-06-26' then a.col7 else null END) col11
> from shaka01 a
> group by a.col3
> ,a.col4
> ,a.col5;
执行刚提到的复杂SQL
hive> set mapred.reduce.tasks = 10;
hive> set mapred.reduce.tasks;
mapred.reduce.tasks=10
--这样就能使shaka01的数据打散到10个文件中
hive> create table shaka01
> as
> select stat_date col1
> ,mc_sent_time col2
> ,receiver_admin_member_id col3
> ,receiver_company_id col4
> ,category_id col5
> ,sender_country_id col6
> ,sender_email col7
> ,first_reply_time col8
> ,dw_ins_date col9
> from test
> distribute by rand(123);
--每个文件都少于256MB,所以利用这张表的时候会起10个map
88469251 /group/hive/shaka01/attempt_201206121523_2440189_r_000000_0
89634660 /group/hive/shaka01/attempt_201206121523_2440189_r_000001_0
88117390 /group/hive/shaka01/attempt_201206121523_2440189_r_000002_0
87820171 /group/hive/shaka01/attempt_201206121523_2440189_r_000003_0
89219537 /group/hive/shaka01/attempt_201206121523_2440189_r_000004_0
90928398 /group/hive/shaka01/attempt_201206121523_2440189_r_000005_0
86772252 /group/hive/shaka01/attempt_201206121523_2440189_r_000006_0
87524942 /group/hive/shaka01/attempt_201206121523_2440189_r_000007_0
88125909 /group/hive/shaka01/attempt_201206121523_2440189_r_000008_0
86613799 /group/hive/shaka01/attempt_201206121523_2440189_r_000009_0
--第一个job多了6个map,达到10个map资源,26分钟就执行完毕整个SQL
hive> create table shaka02 as
> select '2012-06-26' col1
> ,a.col3
> ,a.col4
> ,a.col5
> ,count(distinct case when date_add(to_date(col2),7) > '2012-06-26' then a.col7 else null END) as col6
> ,count(distinct case when date_add(to_date(col2),30)> '2012-06-26' then a.col7 else null END ) as col7
> ,count(distinct case when date_add(to_date(col2),90)> '2012-06-26' then a.col7 else null END ) as col8
> ,count(distinct case when date_add(to_date(col2),7) >'2012-06-26' and date_add(to_date(col2),7)>to_date(a.col8) and to_date(a.col8)<='2012-06-26' then a.col7 else null END) col9
> ,count(distinct case when date_add(to_date(col2),30)>'2012-06-26' and date_add(to_date(col2),7)>to_date(a.col8) and to_date(a.col8)<='2012-06-26' then a.col7 else null END) col10
> ,count(distinct case when date_add(to_date(col2),90)>'2012-06-26' and date_add(to_date(col2),7)>to_date(a.col8) and to_date(a.col8)<='2012-06-26' then a.col7 else null END) col11
> from shaka01 a
> group by a.col3
> ,a.col4
> ,a.col5;
OK
Time taken: 1580.314 seconds
PS:
补充一种方法,工作中遇到的,而且已经作为我们的最终方案。
set dfs.block.size
不管是32MB,64MB,128MB或者256MB,只要保证一个block处理的记录数是自身环境所适应的就OK了。