Hadoop MapReduce处理海量小文件:压缩文件

在HDFS上存储文件,大量的小文件是非常消耗NameNode内存的,因为每个文件都会分配一个文件描述符,NameNode需要在启动的时候加载全部文件的描述信息,所以文件越多,对

NameNode来说开销越大。
我们可以考虑,将小文件压缩以后,再上传到HDFS中,这时只需要一个文件描述符信息,自然大大减轻了NameNode对内存使用的开销。MapReduce计算中,Hadoop内置提供了如下几

种压缩格式:

  • DEFLATE
  • gzip
  • bzip2
  • LZO

使用压缩文件进行MapReduce计算,它的开销在于解压缩所消耗的时间,在特定的应用场景中这个也是应该考虑的问题。不过对于海量小文件的应用场景,我们压缩了小文件,却换

来的Locality特性。
假如成百上千的小文件压缩后只有一个Block,那么这个Block必然存在一个DataNode节点上,在计算的时候输入一个InputSplit,没有网络间传输数据的开销,而且是在本地进行

运算。倘若直接将小文件上传到HDFS上,成百上千的小Block分布在不同DataNode节点上,为了计算可能需要“移动数据”之后才能进行计算。文件很少的情况下,除了NameNode内

存使用开销以外,可能感觉不到网络传输开销,但是如果小文件达到一定规模就非常明显了。
下面,我们使用gzip格式压缩小文件,然后上传到HDFS中,实现MapReduce程序进行任务处理。
使用一个类实现了基本的Map任务和Reduce任务,代码如下所示:

01 package org.shirdrn.kodz.inaction.hadoop.smallfiles.compression;
02  
03 import java.io.IOException;
04 import java.util.Iterator;
05  
06 import org.apache.hadoop.conf.Configuration;
07 import org.apache.hadoop.fs.Path;
08 import org.apache.hadoop.io.LongWritable;
09 import org.apache.hadoop.io.Text;
10 import org.apache.hadoop.io.compress.CompressionCodec;
11 import org.apache.hadoop.io.compress.GzipCodec;
12 import org.apache.hadoop.mapreduce.Job;
13 import org.apache.hadoop.mapreduce.Mapper;
14 import org.apache.hadoop.mapreduce.Reducer;
15 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
16 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
17 import org.apache.hadoop.util.GenericOptionsParser;
18  
19 public class GzipFilesMaxCostComputation {
20  
21     public static class GzipFilesMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
22  
23         private final static LongWritable costValue = new LongWritable(0);
24         private Text code = new Text();
25  
26         @Override
27         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
28             // a line, such as 'SG 253654006139495 253654006164392 619850464'
29             String line = value.toString();
30             String[] array = line.split("\\s");
31             if (array.length == 4) {
32                 String countryCode = array[0];
33                 String strCost = array[3];
34                 long cost = 0L;
35                 try {
36                     cost = Long.parseLong(strCost);
37                 catch (NumberFormatException e) {
38                     cost = 0L;
39                 }
40                 if (cost != 0) {
41                     code.set(countryCode);
42                     costValue.set(cost);
43                     context.write(code, costValue);
44                 }
45             }
46         }
47     }
48  
49     public static class GzipFilesReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
50  
51         @Override
52         protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
53             long max = 0L;
54             Iterator<LongWritable> iter = values.iterator();
55             while (iter.hasNext()) {
56                 LongWritable current = iter.next();
57                 if (current.get() > max) {
58                     max = current.get();
59                 }
60             }
61             context.write(key, new LongWritable(max));
62         }
63  
64     }
65  
66     public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
67  
68         Configuration conf = new Configuration();
69         String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
70         if (otherArgs.length != 2) {
71             System.err.println("Usage: gzipmaxcost <in> <out>");
72             System.exit(2);
73         }
74  
75         Job job = new Job(conf, "gzip maxcost");
76  
77         job.getConfiguration().setBoolean("mapred.output.compress"true);
78         job.getConfiguration().setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);
79  
80         job.setJarByClass(GzipFilesMaxCostComputation.class);
81         job.setMapperClass(GzipFilesMapper.class);
82         job.setCombinerClass(GzipFilesReducer.class);
83         job.setReducerClass(GzipFilesReducer.class);
84  
85         job.setMapOutputKeyClass(Text.class);
86         job.setMapOutputValueClass(LongWritable.class);
87         job.setOutputKeyClass(Text.class);
88         job.setOutputValueClass(LongWritable.class);
89  
90         job.setNumReduceTasks(1);
91  
92         FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
93         FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
94  
95         int exitFlag = job.waitForCompletion(true) ? 0 1;
96         System.exit(exitFlag);
97  
98     }
99 }

上面程序就是计算最大值的问题,实现比较简单,而且使用gzip压缩文件。另外,如果考虑Mapper输出后,需要向Reducer拷贝的数据量比较大,可以考虑在配置Job的时候,指定

压缩选项,详见上面代码中的配置。

下面看运行上面程序的过程:

  • 准备数据
01 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ du -sh ../dataset/gzipfiles/*
02 147M     ../dataset/gzipfiles/data_10m.gz
03 43M     ../dataset/gzipfiles/data_50000_1.gz
04 16M     ../dataset/gzipfiles/data_50000_2.gz
05 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -mkdir /user/xiaoxiang/datasets/gzipfiles
06 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal ../dataset/gzipfiles/* /user/xiaoxiang/datasets/gzipfiles
07 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -ls /user/xiaoxiang/datasets/gzipfiles
08 Found 3 items
09 -rw-r--r--   3 xiaoxiang supergroup  153719349 2013-03-24 12:56 /user/xiaoxiang/datasets/gzipfiles/data_10m.gz
10 -rw-r--r--   3 xiaoxiang supergroup   44476101 2013-03-24 12:56 /user/xiaoxiang/datasets/gzipfiles/data_50000_1.gz
11 -rw-r--r--   3 xiaoxiang supergroup   15935178 2013-03-24 12:56 /user/xiaoxiang/datasets/gzipfiles/data_50000_2.gz
  • 运行程序
01 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop jar gzip-compression.jar
02  
03 org.shirdrn.kodz.inaction.hadoop.smallfiles.compression.GzipFilesMaxCostComputation /user/xiaoxiang/datasets/gzipfiles /user/xiaoxiang/output/smallfiles/gzip
04 13/03/24 13:06:28 INFO input.FileInputFormat: Total input paths to process : 3
05 13/03/24 13:06:28 INFO util.NativeCodeLoader: Loaded the native-hadoop library
06 13/03/24 13:06:28 WARN snappy.LoadSnappy: Snappy native library not loaded
07 13/03/24 13:06:28 INFO mapred.JobClient: Running job: job_201303111631_0039
08 13/03/24 13:06:29 INFO mapred.JobClient:  map 0% reduce 0%
09 13/03/24 13:06:55 INFO mapred.JobClient:  map 33% reduce 0%
10 13/03/24 13:07:04 INFO mapred.JobClient:  map 66% reduce 11%
11 13/03/24 13:07:13 INFO mapred.JobClient:  map 66% reduce 22%
12 13/03/24 13:07:25 INFO mapred.JobClient:  map 100% reduce 22%
13 13/03/24 13:07:31 INFO mapred.JobClient:  map 100% reduce 100%
14 13/03/24 13:07:36 INFO mapred.JobClient: Job complete: job_201303111631_0039
15 13/03/24 13:07:36 INFO mapred.JobClient: Counters: 29
16 13/03/24 13:07:36 INFO mapred.JobClient:   Job Counters
17 13/03/24 13:07:36 INFO mapred.JobClient:     Launched reduce tasks=1
18 13/03/24 13:07:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=78231
19 13/03/24 13:07:36 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
20 13/03/24 13:07:36 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
21 13/03/24 13:07:36 INFO mapred.JobClient:     Launched map tasks=3
22 13/03/24 13:07:36 INFO mapred.JobClient:     Data-local map tasks=3
23 13/03/24 13:07:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=34413
24 13/03/24 13:07:36 INFO mapred.JobClient:   File Output Format Counters
25 13/03/24 13:07:36 INFO mapred.JobClient:     Bytes Written=1337
26 13/03/24 13:07:36 INFO mapred.JobClient:   FileSystemCounters
27 13/03/24 13:07:36 INFO mapred.JobClient:     FILE_BYTES_READ=288127
28 13/03/24 13:07:36 INFO mapred.JobClient:     HDFS_BYTES_READ=214131026
29 13/03/24 13:07:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=385721
30 13/03/24 13:07:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1337
31 13/03/24 13:07:36 INFO mapred.JobClient:   File Input Format Counters
32 13/03/24 13:07:36 INFO mapred.JobClient:     Bytes Read=214130628
33 13/03/24 13:07:36 INFO mapred.JobClient:   Map-Reduce Framework
34 13/03/24 13:07:36 INFO mapred.JobClient:     Map output materialized bytes=9105
35 13/03/24 13:07:36 INFO mapred.JobClient:     Map input records=14080003
36 13/03/24 13:07:36 INFO mapred.JobClient:     Reduce shuffle bytes=6070
37 13/03/24 13:07:36 INFO mapred.JobClient:     Spilled Records=22834
38 13/03/24 13:07:36 INFO mapred.JobClient:     Map output bytes=154878493
39 13/03/24 13:07:36 INFO mapred.JobClient:     CPU time spent (ms)=90200
40 13/03/24 13:07:36 INFO mapred.JobClient:     Total committed heap usage (bytes)=688193536
41 13/03/24 13:07:36 INFO mapred.JobClient:     Combine input records=14092911
42 13/03/24 13:07:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=398
43 13/03/24 13:07:36 INFO mapred.JobClient:     Reduce input records=699
44 13/03/24 13:07:36 INFO mapred.JobClient:     Reduce input groups=233
45 13/03/24 13:07:36 INFO mapred.JobClient:     Combine output records=13747
46 13/03/24 13:07:36 INFO mapred.JobClient:     Physical memory (bytes) snapshot=765448192
47 13/03/24 13:07:36 INFO mapred.JobClient:     Reduce output records=233
48 13/03/24 13:07:36 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2211237888
49 13/03/24 13:07:36 INFO mapred.JobClient:     Map output records=14079863
  • 运行结果
001 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -ls /user/xiaoxiang/output/smallfiles/gzip
002 Found 3 items
003 -rw-r--r--   3 xiaoxiang supergroup          0 2013-03-24 13:07 /user/xiaoxiang/output/smallfiles/gzip/_SUCCESS
004 drwxr-xr-x   - xiaoxiang supergroup          0 2013-03-24 13:06 /user/xiaoxiang/output/smallfiles/gzip/_logs
005 -rw-r--r--   3 xiaoxiang supergroup       1337 2013-03-24 13:07 /user/xiaoxiang/output/smallfiles/gzip/part-r-00000.gz
006 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyToLocal /user/xiaoxiang/output/smallfiles/gzip/part-r-00000.gz ./
007 xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ gunzip -c ./part-r-00000.gz
008 AD     999974516
009 AE     999938630
010 AF     999996180
011 AG     999991085
012 AI     999989595
013 AL     999998489
014 AM     999978568
015 AO     999989628
016 AQ     999995031
017 AR     999999563
018 AS     999935982
019 AT     999999909
020 AU     999937089
021 AW     999965784
022 AZ     999996557
023 BA     999994828
024 BB     999992177
025 BD     999992272
026 BE     999925057
027 BF     999999220
028 BG     999971528
029 BH     999994900
030 BI     999982573
031 BJ     999977886
032 BM     999991925
033 BN     999986630
034 BO     999995482
035 BR     999989947
036 BS     999983475
037 BT     999992685
038 BW     999984222
039 BY     999998496
040 BZ     999997173
041 CA     999991096
042 CC     999969761
043 CD     999978139
044 CF     999995342
045 CG     999957938
046 CH     999997524
047 CI     999998864
048 CK     999968719
049 CL     999967083
050 CM     999998369
051 CN     999975367
052 CO     999999167
053 CR     999980097
054 CU     999976352
055 CV     999990543
056 CW     999996327
057 CX     999987579
058 CY     999982925
059 CZ     999993908
060 DE     999985416
061 DJ     999997438
062 DK     999963312
063 DM     999941706
064 DO     999992176
065 DZ     999973610
066 EC     999971018
067 EE     999960984
068 EG     999980522
069 ER     999980425
070 ES     999949155
071 ET     999987033
072 FI     999989788
073 FJ     999990686
074 FK     999977799
075 FM     999994183
076 FO     999988472
077 FR     999988342
078 GA     999982099
079 GB     999970658
080 GD     999996318
081 GE     999991970
082 GF     999982024
083 GH     999941039
084 GI     999995295
085 GL     999948726
086 GM     999984872
087 GN     999992209
088 GP     999996090
089 GQ     999988635
090 GR     999999672
091 GT     999981025
092 GU     999975956
093 GW     999962551
094 GY     999999881
095 HK     999970084
096 HN     999972628
097 HR     999986688
098 HT     999970913
099 HU     999997568
100 ID     999994762
101 IE     999996686
102 IL     999982184
103 IM     999987831
104 IN     999973935
105 IO     999984611
106 IQ     999990126
107 IR     999986780
108 IS     999973585
109 IT     999997239
110 JM     999986629
111 JO     999982595
112 JP     999985598
113 KE     999996012
114 KG     999991556
115 KH     999975644
116 KI     999994328
117 KM     999989895
118 KN     999991068
119 KP     999967939
120 KR     999992162
121 KW     999924295
122 KY     999985907
123 KZ     999992835
124 LA     999989151
125 LB     999989233
126 LC     999994793
127 LI     999986863
128 LK     999989876
129 LR     999984906
130 LS     999957706
131 LT     999999688
132 LU     999999823
133 LV     999981633
134 LY     999992365
135 MA     999993880
136 MC     999978886
137 MD     999997483
138 MG     999996602
139 MH     999989668
140 MK     999983468
141 ML     999990079
142 MM     999989010
143 MN     999969051
144 MO     999978283
145 MP     999995848
146 MQ     999913110
147 MR     999982303
148 MS     999997548
149 MT     999982604
150 MU     999988632
151 MV     999975914
152 MW     999991903
153 MX     999978066
154 MY     999995010
155 MZ     999981189
156 NA     999976735
157 NC     999961053
158 NE     999990091
159 NF     999989399
160 NG     999985037
161 NI     999965733
162 NL     999988890
163 NO     999993122
164 NP     999972410
165 NR     999956464
166 NU     999987046
167 NZ     999998214
168 OM     999967428
169 PA     999944775
170 PE     999998598
171 PF     999959978
172 PG     999987347
173 PH     999981534
174 PK     999954268
175 PL     999996619
176 PM     999998975
177 PR     999978127
178 PT     999993404
179 PW     999991278
180 PY     999993590
181 QA     999995061
182 RE     999998518
183 RO     999994148
184 RS     999999923
185 RU     999995809
186 RW     999980184
187 SA     999973822
188 SB     999972832
189 SC     999991021
190 SD     999963744
191 SE     999972256
192 SG     999977637
193 SH     999999068
194 SI     999980580
195 SK     999998152
196 SL     999999269
197 SM     999941188
198 SN     999990278
199 SO     999978960
200 SR     999997483
201 ST     999980447
202 SV     999999945
203 SX     999938671
204 SY     999990666
205 SZ     999992537
206 TC     999969904
207 TD     999999303
208 TG     999977640
209 TH     999979255
210 TJ     999983666
211 TK     999971131
212 TM     999958998
213 TN     999979170
214 TO     999959971
215 TP     999986796
216 TR     999996679
217 TT     999984435
218 TV     999974536
219 TW     999975092
220 TZ     999992734
221 UA     999972948
222 UG     999980070
223 UM     999998377
224 US     999918442
225 UY     999989662
226 UZ     999982762
227 VA     999987372
228 VC     999991495
229 VE     999997971
230 VG     999954576
231 VI     999990063
232 VN     999974393
233 VU     999976113
234 WF     999961299
235 WS     999970242
236 YE     999984650
237 YT     999994707
238 ZA     999998692
239 ZM     999993331
240 ZW     999943540
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值