hadoop0.20.2
1.使用streaming命令(摘至hadoop开发文档):
除了纯文本格式的输出,你还可以生成gzip文件格式的输出,你只需设置streaming作业中的选项‘-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode’。
2.使用程序:
输入文件:
$ bin/hadoop fs -ls /temp/in
Found 2 items
-rw-r--r-- 1 Administrator supergroup 52 2012-02-09 10:02 /temp/in/t1.txt
-rw-r--r-- 1 Administrator supergroup 35 2012-02-09 10:02 /temp/in/t2.txt
调试代码:
public class ZipFile {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
output.collect((Text)value, null);
}
}
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf = new JobConf(com.hadoop.test.ZipFile.class);
// TODO: specify output types
// conf.setOutputKeyClass(Text.class);
// conf.setOutputValueClass(IntWritable.class);
// TODO: specify input and output DIRECTORIES (not files)
FileInputFormat.setInputPaths(conf, new Path("/temp/in"));
FileOutputFormat.setOutputPath(conf, new Path("/temp/out-" + System.currentTimeMillis()));
// TODO: specify a mapper
conf.setMapperClass(Map.class);
// TODO: specify a reducer
// conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class);
FileOutputFormat.setCompressOutput(conf, true);
FileOutputFormat.setOutputCompressorClass(conf, org.apache.hadoop.io.compress.GzipCodec.class);
// conf.setOutputFormat(NonSplitableTextInputFormat.class);
// conf.setInputFormat(TextInputFormat.class);
// conf.setOutputFormat(TextOutputFormat.class);
conf.setNumReduceTasks(0);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
}
输出文件:
$ bin/hadoop fs -ls /temp/out-1328857284203
Found 2 items
-rw-r--r-- 3 Administrator supergroup 67 2012-02-10 15:01 /temp/out-1328857284203/part-00000.gz
-rw-r--r-- 3 Administrator supergroup 53 2012-02-10 15:01 /temp/out-1328857284203/part-00001.gz
使用命令:
$ bin/hadoop fs -get /temp/out-1328857284203/part-00000.gz out1.gz
把压缩后的文件下载到本地也是zip格式的文件,打开,解压打开跟原文件一致。