hadoop0.20.2
1.使用streaming命令(摘至hadoop开发文档):
- 除了纯文本格式的输出,你还可以生成gzip文件格式的输出,你只需设置streaming作业中的选项‘-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode’。
2.使用程序:
输入文件:
- $ bin/hadoop fs -ls /temp/in
- Found 2 items
- -rw-r--r-- 1 Administrator supergroup 52 2012-02-09 10:02 /temp/in/t1.txt
- -rw-r--r-- 1 Administrator supergroup 35 2012-02-09 10:02 /temp/in/t2.txt
调试代码:
- public class ZipFile {
- public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
- private final static IntWritable one = new IntWritable(1);
- private Text word = new Text();
- public void map(LongWritable key, Text value,
- OutputCollector<Text, IntWritable> output, Reporter reporter)
- throws IOException {
- output.collect((Text)value, null);
- }
- }
- public static void main(String[] args) {
- JobClient client = new JobClient();
- JobConf conf = new JobConf(com.hadoop.test.ZipFile.class);
- // TODO: specify output types
- // conf.setOutputKeyClass(Text.class);
- // conf.setOutputValueClass(IntWritable.class);
- // TODO: specify input and output DIRECTORIES (not files)
- FileInputFormat.setInputPaths(conf, new Path("/temp/in"));
- FileOutputFormat.setOutputPath(conf, new Path("/temp/out-" + System.currentTimeMillis()));
- // TODO: specify a mapper
- conf.setMapperClass(Map.class);
- // TODO: specify a reducer
- // conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
- FileOutputFormat.setCompressOutput(conf, true);
- FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
- // conf.setOutputFormat(NonSplitableTextInputFormat.class);
- // conf.setInputFormat(TextInputFormat.class);
- // conf.setOutputFormat(TextOutputFormat.class);
- conf.setNumReduceTasks(0);
- client.setConf(conf);
- try {
- JobClient.runJob(conf);
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- }
输出文件:
- $ bin/hadoop fs -ls /temp/out-1328857284203
- Found 2 items
- -rw-r--r-- 3 Administrator supergroup 67 2012-02-10 15:01 /temp/out-1328857284203/part-00000.gz
- -rw-r--r-- 3 Administrator supergroup 53 2012-02-10 15:01 /temp/out-1328857284203/part-00001.gz
使用命令:
$ bin/hadoop fs -get /temp/out-1328857284203/part-00000.gz out1.gz
把压缩后的文件下载到本地也是zip格式的文件,打开,解压打开跟原文件一致。