Java使用POI将docx文件转为html

三文鱼先生

已于 2022-06-01 15:34:46 修改

阅读量5.2k

点赞数 2

分类专栏：常用工具类文章标签： java 后端

于 2022-04-28 16:04:31 首次发布

本文链接：https://blog.csdn.net/qq_44717657/article/details/124474931

版权

常用工具类专栏收录该内容

13 篇文章 1 订阅

订阅专栏

Java使用POI将docx文件转为html

使用到的依赖
Doc文件转为Html
使用到的类
具体代码
扩展

使用到的依赖

这里值得注意的是版本的问题，版本不一样的话会报错。

  		<dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>3.15</version>
        </dependency>

        <!-- poi 读取word doc-->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.15</version>
        </dependency>
                <!-- poi xml-->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.15</version>
        </dependency>

        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.15</version>
        </dependency>

        <dependency>
            <groupId>fr.opensagres.xdocreport</groupId>
            <artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
            <version>1.0.6</version>
        </dependency>

Doc文件转为Html

Java使用POI将doc文档转为Html

使用到的类

XWPFDocument ：用于操作Word的对象
XHTMLOptions ：转换成Html时的一些设置
XHTMLConverter ：转换类
具体信息大家可以去POI的官网上看看，poi官网文档。

说出来你可能不信，后两个官网文档里没有。。。。唯一找到的有关WordToHtml的只有两个，一个WordToHtmlConverter，另一个是WordToHtmlUtils。但是前者是用来转换doc文件的（此文是转换docx），后者是用来操作doc文档的工具类，比如在文件里新增表格、图片啊什么的。。。。。

具体代码

MyDocUtil.classsss

package com;

import org.apache.poi.xwpf.converter.core.BasicURIResolver;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

import java.io.*;
import java.text.SimpleDateFormat;
import java.util.Date;

/**
 * @author 三文鱼
 * @title
 * @description
 * @date 2022/4/27
 **/
public class MyDocUtil {

    /*
     * @description
     * @author 三文鱼
     * @date 11:48 2022/4/28
     * @param filePath   需要转换的docx文件的位置
     * @param htmlPath 输出的html和word中图片的位置
     * @return void
     **/
    public static void docxToHtml(String filePath , String htmlPath) throws IOException {
        //获取文件名称
        String myFileName = getFileNameInfo(filePath , 0);

        //图片存放路径
        String imagePath = htmlPath + File.separator + myFileName + getDataTime();
        //转换的html文件路径 与图片在同目录中
        String fileOutName = imagePath + File.separator + myFileName + ".html";

        //long startTime = System.currentTimeMillis();
        //获取一个用操作Word的对象
        XWPFDocument document = new XWPFDocument(new FileInputStream(filePath));
        //导出为html时的一些基本设置类
        XHTMLOptions options = null;
        //判断word文件中是否有图片
        if(document.getAllPictures().size() > 0) {
            //获取默认的对象，设置缩进indent
            options = XHTMLOptions.getDefault().indent(4);
            // 如果包含图片的话，要设置图片的导出位置
            File imageFolder = new File(imagePath);
            //设置图片抽取器的目的地文件夹 用于存放图片文件
            options.setExtractor(new FileImageExtractor(imageFolder));
            // URI resolver  word的html中图片的目录路径
            options.URIResolver(new BasicURIResolver(imagePath));
        }

        //获取输出的html文件对象
        File outFile = new File(fileOutName);
        //创建所有的父路径，如果不存在父目录的话
        outFile.getParentFile().mkdirs();
        //创建一个输出流
        OutputStream out = new FileOutputStream(outFile);

        //html转换器
        XHTMLConverter.getInstance().convert(document, out, options);
        //System.out.println("转换用时： " + fileOutName + " with " + (System.currentTimeMillis() - startTime) + " ms.");
    }


    /*
     * @description 获取当前时间
     * @author 三文鱼
     * @date 10:29 2022/4/28 
     * @return java.lang.String
     **/
    public static String getDataTime() {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd-HH-mm-ss");
        Date date =new Date();
        return simpleDateFormat.format(date);
    }

    /*
     * @description 获取文件名称和后缀
     * @author 三文鱼
     * @date 14:59 2022/4/28 
     * @param filePath
     * @return java.lang.String
     **/
    public static String getFileNameAndExtension(String filePath) {
        String str[] = filePath.split("\\\\");
        return str[str.length - 1];
    }

    /*
     * @description 获取文件名称
     * @author 三文鱼
     * @date 14:59 2022/4/28
     * @param str 文件名称以及后缀
     * @param position 获取文件名称--0或者后缀--1
     * @return java.lang.String
     **/
    public static String getFileNameInfo(String str , int position) {
        String fileNameAndExtension = getFileNameAndExtension(str);
        String str1[] = fileNameAndExtension.split("\\.");
        return str1[position];
    }
}

MyDocTest.class

import com.MyDocUtil;

import java.io.IOException;

/**
 * @author 三文鱼
 * @title
 * @description
 * @date 2022/4/28
 **/
public class MyDocTest {
    public static void main(String[] args) {
        //String filePath = "F:\\学习记录\\测试数据\\word\\docx\\无图片文档.docx";
        String filePath = "F:\\学习记录\\测试数据\\word\\docx\\测试文档.docx";
        //String filePath = "F:\\学习记录\\测试数据\\word\\doc\\test.doc";

        String htmlPath = "F:\\学习记录\\测试数据\\word\\html\\";

        try {
            MyDocUtil.docxToHtml(filePath , htmlPath);
        }catch (IOException exception) {
            exception.printStackTrace();
        }

    }
}

结果

word

在这里插入图片描述

Html

在这里插入图片描述

文件结构

在这里插入图片描述

扩展

java 使用 POI 操作 XWPFDocumen 创建和读取 Office Word 文档基础篇：https://www.cnblogs.com/mh-study/p/9747945.html
Poi之Word文档结构介绍：https://www.cnblogs.com/Springmoon-venn/p/5494602.html?utm_source=ld246.com
下面这个讲得更加简单明了
poi-tl: http://deepoove.com/poi-tl/apache-poi-guide.html#_%E7%94%9F%E6%88%90%E6%96%87%E6%A1%A3