hadoop compress file

compress files in directory to another directory

use ‘cut -f 2’

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input /home/houzhizhen/defaultfs/test/input \
  -output /home/houzhizhen/defaultfs/test/outputcut \
  -mapper "cut -f 2"

This produces one file in output directory for one file in input directory. After unzip the file using command ‘gunzip’, the file length is not equals to the source file lenght, the file length reduce by 1 for every line in the file, probably because is replace ‘\n\r’ with ‘\n’.

[houzhizhen@localhost outputcut]$ ll
总用量 12
-rw-r--r--. 1 houzhizhen root 2938 5月  16 10:07 part-00000.gz
-rw-r--r--. 1 houzhizhen root  325 5月  16 10:07 part-00001.gz
-rw-r--r--. 1 houzhizhen root  128 5月  16 10:07 part-00002.gz
-rw-r--r--. 1 houzhizhen root    0 5月  16 10:07 _SUCCESS

use ‘/bin/cat’

The output result is identical to the previous test.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input /home/houzhizhen/defaultfs/test/input \
            -output /home/houzhizhen/defaultfs/test/output-gz \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat

reduce into one compressed file directly

Notice: this will cause all the data to single reduce task, and runs very slow if the input size is large.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -input /home/houzhizhen/defaultfs/test/input \
        -output /home/houzhizhen/defaultfs/test/archive \
        -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat
  • decompress
/home/houzhizhen/defaultfs/test/archive
bunzip2 part-00000.bz2
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Hadoop输出文件乱码可能是因为编码不一致导致的。可以尝试在输出时使用UTF-8编码,例如: ``` job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setOutputFormatClass(TextOutputFormat.class); job.getConfiguration().set("mapreduce.output.textoutputformat.separator", ","); job.getConfiguration().set("mapreduce.output.fileoutputformat.compress.type", "BLOCK"); job.getConfiguration().set("mapreduce.output.fileoutputformat.compress", "false"); job.getConfiguration().set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.GzipCodec"); job.getConfiguration().set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.GzipCodec"); job.getConfiguration().set("mapreduce.task.timeout", "1800000"); job.getConfiguration().set("mapreduce.task.io.sort.mb", "2048"); job.getConfiguration().set("mapreduce.task.io.sort.factor", "30"); job.getConfiguration().set("mapreduce.job.reduces", "30"); job.getConfiguration().set("mapreduce.reduce.shuffle.input.buffer.percent", "0.2"); job.getConfiguration().set("mapreduce.reduce.shuffle.memory.limit.percent", "0.5"); job.getConfiguration().set("mapreduce.reduce.input.limit", "0"); job.getConfiguration().set("mapreduce.reduce.shuffle.merge.percent", "0.7"); job.getConfiguration().set("mapreduce.reduce.shuffle.parallelcopies", "30"); job.getConfiguration().set("mapreduce.reduce.input.buffer.percent", "0.2"); job.getConfiguration().set("mapreduce.reduce.memory.mb", "2048"); job.getConfiguration().set("mapreduce.reduce.java.opts", "-Xmx1638m"); job.getConfiguration().set("mapreduce.reduce.shuffle.memory.limit.mb", "1024"); job.getConfiguration().set("mapreduce.reduce.shuffle.input.buffer.percent", "0.2"); job.getConfiguration().set("mapreduce.output.fileoutputformat.encoding", "UTF-8"); FileOutputFormat.setOutputPath(job, outputPath); ``` 如果还是出现乱码,可以尝试在读取时指定编码方式,例如: ``` BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8")); ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值