hadoop compress file

最新推荐文章于 2024-02-29 20:21:59 发布

houzhizhen

最新推荐文章于 2024-02-29 20:21:59 发布

阅读量268

点赞数

分类专栏： hadoop-mapreduce hadoop-hdfs

本文链接：https://blog.csdn.net/houzhizhen/article/details/80332793

版权

hadoop-hdfs 同时被 2 个专栏收录

40 篇文章 0 订阅

订阅专栏

hadoop-mapreduce

5 篇文章 0 订阅

订阅专栏

compress files in directory to another directory

use ‘cut -f 2’

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input /home/houzhizhen/defaultfs/test/input \
  -output /home/houzhizhen/defaultfs/test/outputcut \
  -mapper "cut -f 2"

This produces one file in output directory for one file in input directory. After unzip the file using command ‘gunzip’, the file length is not equals to the source file lenght, the file length reduce by 1 for every line in the file, probably because is replace ‘\n\r’ with ‘\n’.

[houzhizhen@localhost outputcut]$ ll
总用量 12
-rw-r--r--. 1 houzhizhen root 2938 5月  16 10:07 part-00000.gz
-rw-r--r--. 1 houzhizhen root  325 5月  16 10:07 part-00001.gz
-rw-r--r--. 1 houzhizhen root  128 5月  16 10:07 part-00002.gz
-rw-r--r--. 1 houzhizhen root    0 5月  16 10:07 _SUCCESS

use ‘/bin/cat’

The output result is identical to the previous test.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
            -Dmapred.reduce.tasks=0 \
            -Dmapred.output.compress=true \
            -Dmapred.compress.map.output=true \
            -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
            -input /home/houzhizhen/defaultfs/test/input \
            -output /home/houzhizhen/defaultfs/test/output-gz \
            -mapper /bin/cat \
            -inputformat org.apache.hadoop.mapred.TextInputFormat \
            -outputformat org.apache.hadoop.mapred.TextOutputFormat

reduce into one compressed file directly

Notice: this will cause all the data to single reduce task, and runs very slow if the input size is large.

hadoop jar ./share/hadoop/tools/lib/hadoop-streaming-2.7.5.jar \
        -Dmapred.reduce.tasks=1 \
        -Dmapred.output.compress=true \
        -Dmapred.compress.map.output=true \
        -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
        -input /home/houzhizhen/defaultfs/test/input \
        -output /home/houzhizhen/defaultfs/test/archive \
        -mapper /bin/cat \
        -reducer /bin/cat \
        -inputformat org.apache.hadoop.mapred.TextInputFormat \
        -outputformat org.apache.hadoop.mapred.TextOutputFormat

decompress

/home/houzhizhen/defaultfs/test/archive
bunzip2 part-00000.bz2

houzhizhen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop compress file

compress files in directory to another directoryhadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \ -Dmapred.output.compress=true \ -Dmapred.compress.map.output=true...
复制链接

扫一扫

专栏目录