Snappy,Lzo,bzip2,gzip,deflate文件解压

原创 2013年12月02日 14:26:13


Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件压缩格式,各有所长,这里咱们只关注具体文件的解压

一、先贴代码:

package compress;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;

public class Decompress {

	public static final Log LOG = LogFactory.getLog(Decompress.class.getName());

	public static void main(String[] args) throws Exception {

		Configuration conf = new Configuration();
		String name = "io.compression.codecs";
		String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec";
		conf.set(name, value);
		CompressionCodecFactory factory = new CompressionCodecFactory(conf);
		for (int i = 0; i < args.length; ++i) {
			CompressionCodec codec = factory.getCodec(new Path(args[i]));
			if (codec == null) {
				System.out.println("Codec for " + args[i] + " not found.");
			} else {
				CompressionInputStream in = null;
				try {
					in = codec.createInputStream(new java.io.FileInputStream(
							args[i]));
					byte[] buffer = new byte[100];
					int len = in.read(buffer);
					while (len > 0) {
						System.out.write(buffer, 0, len);
						len = in.read(buffer);
					}
				} finally {
					if (in != null) {
						in.close();
					}
				}
			}
		}
	}
}

二、准备工作

1、准备依赖

简要说明一下,这几种压缩文件相关的核心类为:

org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,

首先我们需要这些依赖,我把解压需要的依赖都放在了 /home/apache/test/lib/ 目录下

此外还需要文件压缩需要的本地库文件,找到一台装有hadoop的环境,将 $HADOOP_HOME/lib/native  目录复制过来,我放到了 /tmp/decompress 目录下

2、准备压缩文件

2.1、Snappy 文件

因为我没安装Snappy库,所以就用hive来创建snappy压缩文件:

这只需要两个参数:

hive.exec.compress.output 设置为 true 来声明将结果文件进行压缩

mapred.output.compression.codec 用来设置具体的结果文件压缩格式

在 hive shell 中检查这两个参数,设置为我们需要的 Snappy 格式后,随便运行一个SQL将结果写到本地文件

    > set hive.exec.compress.output;
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/snappy' select * from info900m limit 20;
至此,我们获得了结果文件 /tmp/snappy/000000_0.snappy

2.2、Lzo文件

同上,我们指定压缩格式为 lzo

    > set hive.exec.compress.output;                                                 
hive.exec.compress.output=true
hive> set mapred.output.compression.codec;                                           
mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec
hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/lzo' select * from info900m limit 20;
获得了结果文件 /tmp/lzo/000000_0.lzo 

2.3、创建 bz2 文件和 gz 文件

创建bz2文件
[apache@indigo bz2]$ cp /etc/resolv.conf .
[apache@indigo bz2]$ cat resolv.conf
# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

创建 gz 文件
[apache@indigo bz2]$ tar zcf resolv.conf.gz resolv.conf

2.4、创建 deflate 文件

    > set mapred.output.compression.codec;                                        
mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec; 
hive> 
    > INSERT OVERWRITE LOCAL DIRECTORY '/tmp/deflate' select * from info900m limit 20;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_1385947742139_0006, Tracking URL = http://indigo:8088/proxy/application_1385947742139_0006/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1385947742139_0006
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2013-12-02 13:30:48,522 Stage-1 map = 0%,  reduce = 0%
2013-12-02 13:30:56,271 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 1.2 sec
2013-12-02 13:30:57,330 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
......
2013-12-02 13:31:15,508 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.85 sec
2013-12-02 13:31:16,552 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.85 sec
MapReduce Total cumulative CPU time: 4 seconds 850 msec
Ended Job = job_1385947742139_0006 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1385947742139_0006_m_000003 (and more) from job job_1385947742139_0006

Task with the most failures(4): 
-----
Task ID:
  task_1385947742139_0006_r_000000

URL:
  http://indigo:8088/taskdetails.jsp?jobid=job_1385947742139_0006&tipid=task_1385947742139_0006_r_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:270)
	at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:460)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:407)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:157)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:152)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":{"_col0":"20130526","_col1":"20130526","_col2":"SXY","_col3":"4577020","_col4":"20","_col5":"P029124","_col6":"1","_col7":"612423196707110625","_col8":"","_col9":"Y1","_col10":"20130526"},"alias":0}
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:258)
	... 7 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:479)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:543)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
	at org.apache.hadoop.hive.ql.exec.LimitOperator.processOp(LimitOperator.java:51)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:800)
	at org.apache.hadoop.hive.ql.exec.ExtractOperator.processOp(ExtractOperator.java:45)
	at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:474)
	at org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(ExecReducer.java:249)
	... 7 more
Caused by: java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCode was not found.
	at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:94)
	at org.apache.hadoop.hive.ql.exec.Utilities.getFileExtension(Utilities.java:910)
	at org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:469)
	... 16 more
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.io.compress.DefaultCode not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
	at org.apache.hadoop.mapred.FileOutputFormat.getOutputCompressorClass(FileOutputFormat.java:91)
	... 18 more


FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
MapReduce Jobs Launched: 
Job 0: Map: 4  Reduce: 1   Cumulative CPU: 4.85 sec   HDFS Read: 460084 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 4 seconds 850 msec
这里 hive 居然没读 hadoop 的 classpath ,所以只好将依赖放到 hive classpath下,重启hive ,重新查询即可

cp /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.3.0.jar /usr/lib/hive/lib

全部就绪了,咱编译好上面的类后就开始吧

三、解压

1、snappy文件

需要注意一下,参数为要解压的文件名,创建对应的 Decompression 的依据是压缩文件扩展名,所以这里扩展名不能随便改,下面解压刚才获取的snappy文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/snappy/000000_0.snappy
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件内容省略................................

2、lzo文件 

因为我安装了lzo库,所以可以直接解压 微笑

[apache@indigo lzo]$ lzop -d 000000_0.lzo 
[apache@indigo lzo]$ ll
total 8
-rw-r--r--. 1 apache apache 1650 Dec  2 13:12 000000_0
-rwxr-xr-x. 1 apache apache  848 Dec  2 13:12 000000_0.lzo
用compress.Decompress 指定lzo压缩文件名即可:
[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/test/lib/*" compress.Decompress /tmp/lzo/000000_0.lzo
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.


3、bzip2 文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.bz2
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar  apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

4、gzip 文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/bz2/resolv.conf.gz
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
resolv.conf0000644000175000017500000000012412247014427012463 0ustar  apacheapache# Generated by NetworkManager
domain dhcp
search dhcp server
nameserver 192.168.0.1

5、deflate文件

[apache@indigo decom]$ java -Djava.library.path=/tmp/decompress -classpath ".:/home/apache/diary/1202/lib/*" compress.Decompress /tmp/deflate/000000_0.deflate
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
..................文件内容省略................................



版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

hive 压缩全解读(hive表存储格式以及外部表直接加载压缩格式数据);HADOOP存储数据压缩方案对比(LZO,gz,ORC)

数据做压缩和解压缩总会增加CPU的开销,但可以最大程度的减少文件所需的磁盘空间和网络I/O的开销 最好对那些I/O密集型的作业使用数据压缩 hive表的存储格式为     TEXTFILE    ...

hive结果及中间结果压缩

转自: hadoop中常见的压缩格式及特性如下: 压缩格式 工具 算法 文件扩展名 多文件 可分割性 DEFLATE* 无 DEFLAT...

几款主流的压缩算法对比Zlib,snappy,lz4

几款主流的压缩算法对比Zlib,snappy,lz4 TODO .... https://github.com/Cyan4973/lz4 http://cyan4973.github.io/lz4/...

对TextFile格式文件的lzo压缩建立index索引

hadoop中可以对文件进行压缩,可以采用gzip、lzo、snappy等压缩算法。 对于lzo压缩,常用的有LzoCodec和lzopCodec,可以对sequenceFile和TextFile进...

查看HDFS中LZO压缩文件内容的脚本

最近常常需要查看LZO文件里面的内容,这些文件通常很大,放在hdfs上。我没有好的方法,我以前偶尔查看其中内容都是直接get到本地然后用lzop解压缩然后再more的。这样做当你偶尔使用的时候即使文件...

GZIP、LZO、Zippy Snappy压缩算法应用场景小结

这是我一个交流群里的群主共享的,放在这里是为了让自己备忘,当然,也希望能对需要的朋友有所帮助,感谢这位群主 @㊣『锋』㊣  我这里还有一篇:zip4j加密压缩、解压缩文件、文件夹 GZIP、L...
  • k21325
  • k21325
  • 2017年01月17日 10:25
  • 505

Google Snappy string 压缩/解压缩(Java)

项目中遇到的压缩/解压缩需求应该是很多的,比如典型的考虑网络传输延时而对数据进行压缩传输,又或者其他各种省空间存储需求等。这次同样是遇到了类似需求,在做一个爬虫时,因为抓取项目还未确定,所以考虑将整个...

Hadoop集群中添加Snappy解压缩库

Snappy是用C++开发的压缩和解压缩开发包,旨在提供高速压缩速度和合理的压缩率。Snappy比zlib更快,但文件相对要大20%到100%。在64位模式的Core i7处理器上,可达每秒250~5...

Snappy压缩库安装和使用之一

 近日需要在毕业设计中引入一个压缩库,要求压缩与解压缩速度快,但是压缩率可以不那么苛刻。查找资料发现Google的snappy库比较合适,而且该库开源,由C++写成。所以就拿来使用一下,下面权作...
  • djun100
  • djun100
  • 2014年05月15日 14:01
  • 7939

Linux下解压命令大全 解压缩 tar bz2 zip tar.gz gz

.tar 解包:tar xvf FileName.tar 打包:tar cvf FileName.tar DirName (注:tar是打包,不是压缩!) ——————————————— .gz 解压...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Snappy,Lzo,bzip2,gzip,deflate文件解压
举报原因:
原因补充:

(最多只允许输入30个字)