Apache Commons Compress介绍-Zip压缩解压_apache commons compress zip-CSDN博客

本文链接：https://blog.csdn.net/q1182614883/article/details/120332446

Apache Commons Compress介绍-Zip压缩解压

简述
为什么使用Apache Commons Compress
- 在使用java自带的ZipFile处理zip文件时报如下错误java.lang.IllegalArgumentException: MALFORMED
使用Apache commons-compress处理zip
总结

简述

Apache Commons Compress 官网：http://commons.apache.org/proper/commons-compress/index.html
Apache Commons Compress 库定义了一个用于处理 ar，cpio，Unix 转储，tar，zip，gzip，XZ，Pack200，bzip2、7z，arj，lzma，snappy，DEFLATE，lz4，Brotli，Zstandard，DEFLATE64 和 Z 文件的 API 。
当前 Compress 版本是 1.21，并且需要 Java 7 及以上支持。

为什么使用Apache Commons Compress

在使用java自带的ZipFile处理zip文件时报如下错误java.lang.IllegalArgumentException: MALFORMED

在这里插入图片描述
异常大致是这样，前台上传zip文档后该zip文件解压失败了。
首先网上查了下这个异常的原因，都说是因为编码的问题，要求将UTF-8改成GBK就可以了。
然后定位代码，看到有一个方法：unzip()

public static void unzip(File zipFile, String descDir) {
    try {
        File pathFile = new File(descDir);
        if (!pathFile.exists()) {
            pathFile.mkdirs();
        }
        ZipFile zip = getZipFile(zipFile);
        for (Enumeration entries = zip.entries(); entries.hasMoreElements(); ) {
            ZipEntry entry = (ZipEntry) entries.nextElement();
            String zipEntryName = entry.getName();
            if (StringUtils.isNotBlank(pre)) {
                zipEntryName = zipEntryName.substring(pre.length());
            }
            InputStream in = zip.getInputStream(entry);
            String outPath = (descDir + "/" + zipEntryName).replaceAll("\\*", "/");
            ;
            //判断路径是否存在,不存在则创建文件路径
            File file = new File(outPath.substring(0, outPath.lastIndexOf('/')));
            if (!file.exists()) {
                file.mkdirs();
            }
            //判断文件全路径是否为文件夹,如果是上面已经上传,不需要解压
            if (new File(outPath).isDirectory()) {
                continue;
            }
            //输出文件路径信息
            LOG.info("解压文件的当前路径为:{}", outPath);
            OutputStream out = new FileOutputStream(outPath);
            IOUtils.copy(in, out);
            in.close();
            out.close();
        }
        zip.close();
        LOG.info("******************解压完毕********************");

    } catch (Exception e) {
        LOG.error("[unzip] 解压zip文件出错", e);
    }
}

private static ZipFile getZipFile(File zipFile) throws Exception {
    ZipFile zip = new ZipFile(zipFile, Charset.forName("UTF-8"));
    Enumeration entries = zip.entries();
    while (entries.hasMoreElements()) {
        try {
            entries.nextElement();
            zip.close();
            zip = new ZipFile(zipFile, Charset.forName("UTF-8"));
            return zip;
        } catch (Exception e) {
            zip = new ZipFile(zipFile, Charset.forName("GBK"));
            return zip;
        }
    }
    return zip;
}

于是便将前台zip的文件拿过来然后本地调试下，发现在第9行中抛出了异常，如下代码：

ZipEntry entry = (ZipEntry) entries.nextElement();

再由最开始的异常日志找到ZipCoder中的58行:throw new IllegalArgumentException(“MALFORMED”)

String toString(byte[] ba, int length) {
    CharsetDecoder cd = decoder().reset();
    int len = (int)(length * cd.maxCharsPerByte());
    char[] ca = new char[len];
    if (len == 0)
        return new String(ca);
    // UTF-8 only for now. Other ArrayDeocder only handles
    // CodingErrorAction.REPLACE mode. ZipCoder uses
    // REPORT mode.
    if (isUTF8 && cd instanceof ArrayDecoder) {
        int clen = ((ArrayDecoder)cd).decode(ba, 0, length, ca);
        if (clen == -1)    // malformed
            throw new IllegalArgumentException("MALFORMED");
        return new String(ca, 0, clen);
    }
    ByteBuffer bb = ByteBuffer.wrap(ba, 0, length);
    CharBuffer cb = CharBuffer.wrap(ca);
    CoderResult cr = cd.decode(bb, cb, true);
    if (!cr.isUnderflow())
        throw new IllegalArgumentException(cr.toString());
    cr = cd.flush(cb);
    if (!cr.isUnderflow())
        throw new IllegalArgumentException(cr.toString());
    return new String(ca, 0, cb.position());
}

这里只有UTF-8才会进入if逻辑才会抛错？果然如网上所说，将编码格式改为GBK即可。
ZipCoder这个类似src.zip包中的，既然这里做了check当然会有它的道理，单纯的改为GBK来解决这个bug显然是不合理的。

于是便要换种思路了，线上有些zip是仍然可以预览的。我将线上的zip文件解压后，在自己电脑重新打个包（我用的是360压缩），然后又运行了上述代码，竟然解压成功？？这是为什么？

ZipFile zf = new ZipFile(file);

public ZipFile(String name) throws IOException {
        this(new File(name), OPEN_READ);
}

 public ZipFile(File file, int mode) throws IOException {
        this(file, mode, StandardCharsets.UTF_8);
}

默认指定了ZipFile编码为UTF_8，但需要解压的zip文件却不是UTF_8,就造成了这个问题，如果还有其他编码，还得处理

使用Apache commons-compress处理zip

Apache commons-compress 解压 zip 文件是件很幸福的事，可以解决 zip 包中文件名有中文时跨平台的乱码问题，不管文件是在 Windows 压缩的还是在 Mac，Linux 压缩的，解压后都没有再出现乱码问题了。

引入Apache commons-compress

<apache.commons.compress.version>1.20</apache.commons.compress.version>
 <dependency>
     <groupId>org.apache.commons</groupId>
     <artifactId>commons-compress</artifactId>
     <version>${apache.commons.compress.version}</version>
</dependency>

Compress的zip支持两种类型的解压：顺序解压和随机访问解压，听起来有点像链表（顺序）和数组（随机）的区别，实际上也确实很类似。这两种类型实际上是zip格式本身决定的，具体的原理我后面可以会详细解释一下，这里简单讲一下：zip格式的头信息，或者说解压元数据（比如第1个文件从第100个字节处开始，压缩后长度m字节，解压后长度n字节；然后第2个文件从第400个字节处开始…），提前读取了这些头信息，我们就可以指哪打哪，想解压什么就解压什么，这也就是随机访问解压的方式，即通过ZipFile来解压zip文件

ZipFile随机访问：

  /**
     * ZipFile随机访问单个文件：
     *
     * @param file
     * @throws Exception
     */
    public static void zipFileOutputFileTest(File file) throws Exception {
        ZipFile zipFile = new ZipFile(file);
        ZipArchiveEntry entry = zipFile.getEntry("targetFile"); // 我们可以根据名字，直接找到要解压的文件
        try (InputStream inputStream = zipFile.getInputStream(entry)) {
            // 这里inputStream就是一个正常的IO流，按照正常IO流的方式读取即可，这里简单给个例子
            long size = entry.getSize();
            byte[] buffer = new byte[1024];
            File outputFile = new File("/tmp/output/targetFile");
            try (FileOutputStream fos = new FileOutputStream(outputFile)) {
                while (inputStream.read(buffer) > 0) {
                    fos.write(buffer);
                }
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }

        }
    }


/**
     * ZipFile随机访问全部文件：
     *
     * @param file
     * @throws Exception
     */
    public static void zipFileOutputFilesTest(File file) throws Exception {
        ZipFile zipFile = new ZipFile(file);
        byte[] buffer = new byte[4096];
        ZipArchiveEntry entry;
        Enumeration<ZipArchiveEntry> entries = zipFile.getEntries(); // 获取全部文件的迭代器
        InputStream inputStream;
        while (entries.hasMoreElements()) {
            entry = entries.nextElement();
            if (entry.isDirectory()) {
                continue;
            }

            File outputFile = new File("C:/Users/11826/Desktop/law2/" + entry.getName());

            if (!outputFile.getParentFile().exists()) {
                outputFile.getParentFile().mkdirs();
            }

            inputStream = zipFile.getInputStream(entry);
            try (FileOutputStream fos = new FileOutputStream(outputFile)) {
                while (inputStream.read(buffer) > 0) {
                    fos.write(buffer);
                }
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

另外，ZipFile的构造函数还支持通过File、SeekableByteChannel类型传递Zip文件，其中SeekableByteChannel可以解压已经读入内存的Zip文件；还有其他一些不常用到的参数可以设置，具体可以参考ZipFile构造函数相关的注释说明。

ZipArchiveInputStream顺序访问：

除了随机解压，Compress还支持顺序解压。可能大家会有疑问，有随机解压就够了，为什么还需要顺序解压呢？这是因为Zip文件的头信息，实际上是在Zip文件的最后的，需要先读取Zip的最后一部分信息，然后再往前跳转着读取，如果Zip文件本身就在硬盘或者内存中的话，随机访问的效率还是很高的。

对于一些IO场景，比如网络IO之类，我们倒是可以把整个Zip文件读到内存中，然后再随机访问进行解压。不过如果遇到对一些比较大的zip，或者内存敏感（比如手机），这样的成本可能就太高了。

ZipArchiveInputStream就是为这种场景所设计的：它可以一个文件一个文件的读取，你在使用时可以决定解压或是不解压遍历到的文件，Demo代码如下：

/**
     * ZipArchiveInputStream顺序访问：
     *
     * @param file
     * @throws Exception
     */
    public static void zipArchiveInputStreamTest(File file) throws Exception {
        try (ZipArchiveInputStream zipInputStream = new ZipArchiveInputStream(new FileInputStream(file))) {
            ZipArchiveEntry entry ;
            while ((entry=zipInputStream.getNextZipEntry())!=null){
                if(!entry.isDirectory()){
                    String name = entry.getName();
                    long size = entry.getSize();
                    System.out.println(name+"--"+size);
                    //用输出流将当前文件所有字节保存,用于后续要业务处理
                    ByteArrayOutputStream bos = new ByteArrayOutputStream();
                    byte[] content = new byte[1024];
                    int i=0;
                    while ((i=zipInputStream.read(content))>0){
                        //将读取到的字节写入到上面创建的输出流
                        //当然也可以直接处理业务不同雪茹到输出流中
                        bos.write(content,0,i);
                    }
                    bos.close();
                    byte[] bytes = bos.toByteArray();
                    //将输出流中的输入放入缓冲输入流中，用于后续业务处理
                    BufferedInputStream bufferedInputStream = new BufferedInputStream(new ByteArrayInputStream(bytes));
                    while ((i=bufferedInputStream.read(content))>0){
                        System.out.println(new String(content,0,i));
                    }
                }
            }
        }
    }

分卷文件解压：

Compress在1.20版本以后，已经可以支持Zip分卷文件解压了，使用方式也很简单，只需要在创建channel的时候，调用ZipSplitReadOnlySeekableByteChannel进行创建，然后调用ZipFile或者ZipArchiveInputStream解压即可：

// 可以通过最后一个分卷zip文件创建channel，注意需要保证所有分卷文件都在同一目录下，并且除后缀名之外文件名相同
File lastSegmentFile = new File("/root/test.zip");
SeekableByteChannel channel = ZipSplitReadOnlySeekableByteChannel.buildFromLastSplitSegment(lastSegmentFile);

// 也可以通过指定所有zip分卷文件创建channel
File firstSegmentFile = new File("/root/test.z01");
File secondSegmentFile = new File("/root/test.z02");
File thirdSegmentFile = new File("/root/test.zip");
SeekableByteChannel channel = ZipSplitReadOnlySeekableByteChannel.forFiles(firstSegmentFile, secondSegmentFile, thirdSegmentFile);

压缩

Compress当然也支持创建zip压缩文件，主要是通过ZipArchiveOutputStream实现，Demo代码如下：

File archive = new File("/root/xx.zip");
try (ZipArchiveOutputStream outputStream = new ZipArchiveOutputStream(archive)) {
    ZipArchiveEntry entry = new ZipArchiveEntry("testdata/test1.xml");
    // 可以设置压缩等级
    outputStream.setLevel(5);
    // 可以设置压缩算法，当前支持ZipEntry.DEFLATED和ZipEntry.STORED两种
    outputStream.setMethod(ZipEntry.DEFLATED);
    // 也可以为每个文件设置压缩算法
    entry.setMethod(ZipEntry.DEFLATED);
    // 在zip中创建一个文件
    outputStream.putArchiveEntry(entry);
    // 并写入内容
    outputStream.write("abcd\n".getBytes(Charset.forName("UTF-8")));
    // 完成一个文件的写入
    outputStream.closeArchiveEntry();

    entry = new ZipArchiveEntry("testdata/test2.xml");
    entry.setMethod(ZipEntry.STORED);
    outputStream.putArchiveEntry(entry);
    outputStream.write("efgh\n".getBytes(Charset.forName("UTF-8")));
    outputStream.closeArchiveEntry();
}

如果需要创建zip分卷文件，只需要在ZipArchiveOutputStream的构造函数中，传一个希望的分卷文件大小，其他代码完全相同，这里需要简单注意一下，zip合法的分卷大小在64kb到4gb之间，超出此范围的值会报错：

// 创建一个最大1MB的分卷zip文件
ZipArchiveOutputStream outputStream = new ZipArchiveOutputStream(archive, 1024 * 1024);

另外，Compress还支持并发创建zip文件，具体使用比较复杂，可以参考Compress官网中ParallelScatterZipCreator的例子，也可以参考Compress代码中的测试用例。

总结

Compress的zip解压可以通过ZipFile和ZipArchiveInputStream实现，适用的场景为：
1. ZipFile：适用于zip文件在硬盘里或内存里的情况，可以随机访问
2. ZipArchiveInputStream：适用于通过网络IO或其他只能顺序读取zip的情况，只能顺序访问
压缩通过ZipArchiveOutputStream实现，可以传参以实现分卷压缩；
分卷解压通过ZipSplitReadOnlySeekableByteChannel实现；