java poi docx 转 html 中文乱码

Yvonne221

于 2023-11-16 13:42:42 发布

阅读量675

点赞数 5

文章标签： java 开发语言

本文链接：https://blog.csdn.net/Yvonne221/article/details/134439682

版权

在本地环境使用 poi 转 docx 为 html 正常。部署到生产环境后中文都变为问号了。

指定字符集后恢复正常。

修改前，直接输出到 html 文件的 out 中。

OutputStream out = new FileOutputStream(htmlFullPath);
XHTMLConverter.getInstance().convert(doc, out, options);

修改后，将 out 转为 gbk 编码的字符串，再将字符串保存到 html 文件。关键是 out.toString("gbk")

// 转换 html
ByteArrayOutputStream out = new ByteArrayOutputStream();
XHTMLConverter.getInstance().convert(doc, out, options);

// 将 html 字符串保存到文件
FileUtil.saveFile(out.toString("gbk"), htmlFullPath);

可以多试几种字符集。我遇到的情况是，utf-8 会输出问号，gbk 和 gb2312 正常。

修改前完整代码：

    /**
     * 将 docx 文件转换为 html 文件
     * @param docFullPath
     * @param htmlFullPath
     */
    public static void docxToHtml(String docFullPath, String htmlFullPath) {

        File docFile = new File(docFullPath);
        if (!docFile.exists()) {
            return;
        }

        XWPFDocument doc = null;
        OutputStream out = null;

        try {

            File htmlFile = new File(htmlFullPath);
            if (!htmlFile.getParentFile().exists()) {
                htmlFile.getParentFile().mkdirs();
            }
            if (!htmlFile.exists()) {
                htmlFile.createNewFile();
            }

            doc = new XWPFDocument(OPCPackage.open(docFullPath));
            out = new FileOutputStream(htmlFullPath);

            // 设置 html 中 images 文件夹
            XHTMLOptions options = XHTMLOptions.create();
            options.setImageManager(new ImageManager(htmlFile.getParentFile(), "images"));
            options.setIgnoreStylesIfUnused(false);
            options.setFragment(false);

            XHTMLConverter.getInstance().convert(doc, out, options);

        } catch (Exception e) {
            e.printStackTrace();

        } finally {
            if (doc != null) {
                try {
                    doc.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if (out != null) {
                try {
                    out.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

修改后完整代码：

/**
     * 将 docx 文件转换为 html 文件
     * @param docFullPath docx 文件全路径
     * @param htmlFullPath 生成的 html 文件全路径
     */
    public static void docxToHtml(String docFullPath, String htmlFullPath) {

        File docFile = new File(docFullPath);
        if (!docFile.exists()) {
            return;
        }

        XWPFDocument doc = null;

        try {

            File htmlFile = new File(htmlFullPath);
            if (!htmlFile.getParentFile().exists()) {
                htmlFile.getParentFile().mkdirs();
            }
            if (!htmlFile.exists()) {
                htmlFile.createNewFile();
            }

            doc = new XWPFDocument(OPCPackage.open(docFullPath));

            // 设置 html 中 images 文件夹
            XHTMLOptions options = XHTMLOptions.create();
            options.setImageManager(new ImageManager(htmlFile.getParentFile(), "images"));
            options.setIgnoreStylesIfUnused(false);
            options.setFragment(false);

            // 转换 html
            ByteArrayOutputStream out = new ByteArrayOutputStream();
            XHTMLConverter.getInstance().convert(doc, out, options);

            // 将 html 字符串保存到文件
            FileUtil.saveFile(out.toString("gbk"), htmlFullPath);

        } catch (Exception e) {
            e.printStackTrace();

        } finally {
            if (doc != null) {
                try {
                    doc.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }

参考文章：

Java使用ByteArrayOutputStream，依赖默认编码，如何解决_bytearrayoutputstream设置编码_一写代码就开心的博客-CSDN博客