对于03版doc文档公式（MathType）latax公式解析，并转为html

最新推荐文章于 2023-12-01 23:26:51 发布

huibinwei

最新推荐文章于 2023-12-01 23:26:51 发布

阅读量1.7k

点赞数 1

分类专栏： util 文章标签： word html 公式转换

本文链接：https://blog.csdn.net/whb3299065/article/details/78523175

版权

util 专栏收录该内容

22 篇文章 1 订阅

订阅专栏

对于03版doc文档公式公式解析

虽然我知道，很多人需要的并不是03版的公式解析，而是07版的公式解析……但题主的水平也就这些了，07版的MathML公式看了很久，但依旧没有任何思路……（也不算没思路，不知道直接从document.xml中解析算不算）。
好了言归正传，首先我们要先了解一下我们说一下我们使用的工具我们采用poi来对word进行解析
由于上传jar相对麻烦，就直接方maven了

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml-schemas</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-examples</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-excelant</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>fr.opensagres.xdocreport</groupId>
    <artifactId>org.apache.poi.xwpf.converter.xhtml</artifactId>
    <version>1.0.6</version>
    <exclusions>
        <exclusion>
            <artifactId>ooxml-schemas</artifactId>
            <groupId>org.apache.poi</groupId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>net.sf.jxls</groupId>
    <artifactId>jxls-core</artifactId>
    <version>1.0.6</version>
</dependency>
<dependency>
    <groupId>cn.wanghaomiao</groupId>
    <artifactId>JsoupXpath</artifactId>
    <version>0.3.2</version>
</dependency>
<dependency>
    <groupId>commons-io</groupId>
    <artifactId>commons-io</artifactId>
    <version>2.4</version>
</dependency>
<dependency>
    <groupId>fr.opensagres.xdocreport</groupId>
    <artifactId>fr.opensagres.xdocreport.document</artifactId>
    <version>2.0.1</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>ooxml-schemas</artifactId>
    <version>1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.xmlgraphics</groupId>
    <artifactId>batik-parser</artifactId>
    <version>1.9.1</version>
</dependency>
<dependency>
    <groupId>net.arnx</groupId>
    <artifactId>wmf2svg</artifactId>
    <version>0.9.8</version>
</dependency>
<dependency>
 <groupId>org.codeartisans.thirdparties.swing</groupId>
    <artifactId>batik-all</artifactId>
    <version>1.8pre-r1084380</version>
</dependency>

我的代码注释比较详细，直接站出来

        //初始化word对象
        HWPFDocument wordDocument = new HWPFDocument(new FileInputStream(fileName));//WordToHtmlUtils.loadDoc(new FileInputStream(inputFile));
        //初始化word转换对象
        WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder()
                        .newDocument());
        //在转换时，对图片的处理方式，需要传入一个匿名对象建议这里改为lam表达式
        wordToHtmlConverter.setPicturesManager(new PicturesManager() {
            /**
             * 该方法会将图片的详细信息传递给我们，然后我们对图片进行操作后，将图片位置信息返回
             * 我们在实现这个方法时进行了随机生成一个uuid作为图片的名称，
             * 然后获取图片类型，由于word中公式图片是以wmf进行保存的矢量图，我们需要进行转换才能输出图片
             * 这里我们将图片转换为了svg格式
             * 并将图片进行存储
             * @param content 图片的二进制形式
             * @param pictureType 图片的类型对象
             * @param suggestedName 默认图片的名称
             * @param widthInches    生成img时的宽度
             * @param heightInches   生成img时的高度
             * @return 返回src属性的值
             */
            public String savePicture(byte[] content,
                                      PictureType pictureType, String suggestedName,
                                      float widthInches, float heightInches) {
                String uuid = UUID.randomUUID().toString().replaceAll("-", "");
                try {
                    //在测试类中我们直接将wmf格式转为了svg，这里应该重构Move method
                    if (pictureType.getExtension().equals("wmf")) {
                        uuid += ".svg";
                        WmfParser parser = new WmfParser();
                        SvgGdi gdi = new SvgGdi(false);
                        InputStream is = new ByteArrayInputStream(content);
                        parser.parse(is, gdi);
                        Document doc = gdi.getDocument();
                        FileOutputStream fos = new FileOutputStream(new File("D:\\date\\file\\textPaper\\img\\" + uuid));
                        ImgUtil.outputSvg(doc, fos);
                    } else {
                        System.out.println(pictureType.getExtension());
                        uuid += "." + pictureType.getExtension();
                        FileOutputStream fos = new FileOutputStream(new File("D:\\date\\file\\textPaper\\img\\" + uuid));
                        fos.write(content);
                    }
                } catch (Exception e) {
                    e.printStackTrace();
                }
                return "D:\\date\\file\\textPaper\\img\\" + uuid;
            }
        });
        //设置准备转化的wordDocument，并进行转换，获取htmlDocument
        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();

        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer serializer = tf.newTransformer();
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        //转换完成，获得html字符串，我们在这里对html的img标签进行了替换以便支持svg图片
        String html = new String(out.toByteArray()).replaceAll("<img", "<embed");
        //添加了html的头文件，以便制定为gb2312编码，防止乱码
        html = html.replaceAll("<head>", "<head><meta http-equiv=\"Content-Type\" content=\"text/html; charset=GB2312\" />");
        System.out.println(html);
        //将html进行存储
        writeFile(html, outPutFile);
        out.close();
       }

   /**
     * 用于将字符串写入到某个路径中，该方法需要被**重构**
     *
     * @param content
     * @param path
     */
    public static void writeFile(String content, String path) {
        FileOutputStream fos = null;
        BufferedWriter bw = null;
        try {
            File file = new File(path);
            fos = new FileOutputStream(file);
            bw = new BufferedWriter(new OutputStreamWriter(fos, "GB2312"));
            bw.write(content);
        } catch (FileNotFoundException fnfe) {
            fnfe.printStackTrace();
        } catch (IOException ioe) {
            ioe.printStackTrace();
        } finally {
            try {
                if (bw != null)
                    bw.close();
                if (fos != null)
                    fos.close();
            } catch (IOException ie) {
            }
        }
    }

这是用到的一个工具类

public class ImgUtil {
    public static void wmfToSvg(byte[] wmf, String savePath, String name) {
        WmfParser parser = new WmfParser();
        File mik = new File(savePath);
        if (mik.exists()) {
            if (mik.isDirectory()) ;
            else throw new RuntimeException("the same name file exists, can not create dir");
        } else mik.mkdirs();
        SvgGdi gdi = null;
        try {
            gdi = new SvgGdi(false);
            InputStream is = new ByteArrayInputStream(wmf);
            parser.parse(is, gdi);
            Document doc = gdi.getDocument();
            FileOutputStream fos = new FileOutputStream(new File(savePath + name));
            outputSvg(doc, fos);
        } catch (SvgGdiException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (WmfParseException e) {
            e.printStackTrace();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    /**
     * @param doc
     * @param out
     * @throws Exception
     */
    public static void outputSvg(Document doc, OutputStream out) throws TransformerException, IOException {
        TransformerFactory factory = TransformerFactory.newInstance();
        Transformer transformer = factory.newTransformer();
        transformer.setOutputProperty(OutputKeys.METHOD, "xml");
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty(OutputKeys.DOCTYPE_PUBLIC, "-//W3C//DTD SVG 1.0//EN");
        transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd");
        transformer.transform(new DOMSource(doc), new StreamResult(out));
        out.flush();
        out.close();
    }
}

注意：很多博客中，在转换的时候都要有一步获取图片集合，存储图片，这么一个过程，如果网页图片是嵌入型的则没这个必要，如果网页是衬于文字下方或浮于文字上方则用我的方式获取图片不全。
缺陷：这里我直接将所有图片的标签替换了，这样是不合理的。应该加一个判断，有目的的替换svg图片格式的标签，但由于我太懒了，如果你们实现了，可以给我在线留言
期望：对于2007版的docxMathML公式一个理想的做法就是抽取出MathML格式，然后进行解析，MathML是可以直接在网页中展示的，而且，MathML公式如果直接粘贴到word中，会自动变为公式，如果各位有这方面好的想法，欢迎一起讨论。whb3299065@126.com欢迎给我留言