java 公式写入word,阅读方程式从Word(Docx)到html的公式,并使用Java保存数据库...

I have a word/docx file which has equations as under images

uLG65.png

I want read data of file word/docx and save to my database

and when need I can get data from database and show on my html page

I used apache Poi for read data form docx file but It can't take equations

Please help me!

解决方案

Word *.docx files are ZIP archives containing XML files which are Office Open XML. The formulas contained in Word *.docx documents are Office MathML (OMML).

Unfortunately this XML format is not really well known outside Microsoft Office. So it is not directly usable in HTML for example. But fortunately it is XML and as such it is transformable using Transforming XML Data with XSLT. So we can transform that OMML into MathML for example, which is usable in a wider area of use cases.

A transformation process via XSLT mainly bases on a XSL definition of the transformation. Unfortunately creating a such is also not really easy. But fortunately Microsoft has done that already and if you have a current Microsoft Office installed, you can find this file OMML2MML.XSL in the Microsoft Office program directory in %ProgramFiles%\. If you don't find it, do a web research to get it.

So if we are knowing this all, we can getting the OMML from the XWPFDocument, transforming it into MathML and then saving that for later usage.

My example stores the found formulas as MathML in a ArrayList of strings. You should also be able storing this strings in your data base.

The example needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025. This is because it uses CTOMath which is not shipped with the smaller poi-ooxml-schemas jar.

Word document:

X2i2Z.png

Java code:

import java.io.*;

import org.apache.poi.xwpf.usermodel.*;

import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;

import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMath;

import org.openxmlformats.schemas.officeDocument.x2006.math.CTOMathPara;

import org.w3c.dom.Node;

import javax.xml.transform.Transformer;

import javax.xml.transform.TransformerFactory;

import javax.xml.transform.dom.DOMSource;

import javax.xml.transform.stream.StreamSource;

import javax.xml.transform.stream.StreamResult;

import java.awt.Desktop;

import java.util.List;

import java.util.ArrayList;

/*

needs the full ooxml-schemas-1.3.jar as mentioned in https://poi.apache.org/faq.html#faq-N10025

*/

public class WordReadFormulas {

static File stylesheet = new File("OMML2MML.XSL");

static TransformerFactory tFactory = TransformerFactory.newInstance();

static StreamSource stylesource = new StreamSource(stylesheet);

static String getMathML(CTOMath ctomath) throws Exception {

Transformer transformer = tFactory.newTransformer(stylesource);

Node node = ctomath.getDomNode();

DOMSource source = new DOMSource(node);

StringWriter stringwriter = new StringWriter();

StreamResult result = new StreamResult(stringwriter);

transformer.setOutputProperty("omit-xml-declaration", "yes");

transformer.transform(source, result);

String mathML = stringwriter.toString();

stringwriter.close();

//The native OMML2MML.XSL transforms OMML into MathML as XML having special name spaces.

//We don't need this since we want using the MathML in HTML, not in XML.

//So ideally we should changing the OMML2MML.XSL to not do so.

//But to take this example as simple as possible, we are using replace to get rid of the XML specialities.

mathML = mathML.replaceAll("xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\"", "");

mathML = mathML.replaceAll("xmlns:mml", "xmlns");

mathML = mathML.replaceAll("mml:", "");

return mathML;

}

public static void main(String[] args) throws Exception {

XWPFDocument document = new XWPFDocument(new FileInputStream("Formula.docx"));

//storing the found MathML in a AllayList of strings

List mathMLList = new ArrayList();

//getting the formulas out of all body elements

for (IBodyElement ibodyelement : document.getBodyElements()) {

if (ibodyelement.getElementType().equals(BodyElementType.PARAGRAPH)) {

XWPFParagraph paragraph = (XWPFParagraph)ibodyelement;

for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {

mathMLList.add(getMathML(ctomath));

}

for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {

for (CTOMath ctomath : ctomathpara.getOMathList()) {

mathMLList.add(getMathML(ctomath));

}

}

} else if (ibodyelement.getElementType().equals(BodyElementType.TABLE)) {

XWPFTable table = (XWPFTable)ibodyelement;

for (XWPFTableRow row : table.getRows()) {

for (XWPFTableCell cell : row.getTableCells()) {

for (XWPFParagraph paragraph : cell.getParagraphs()) {

for (CTOMath ctomath : paragraph.getCTP().getOMathList()) {

mathMLList.add(getMathML(ctomath));

}

for (CTOMathPara ctomathpara : paragraph.getCTP().getOMathParaList()) {

for (CTOMath ctomath : ctomathpara.getOMathList()) {

mathMLList.add(getMathML(ctomath));

}

}

}

}

}

}

}

document.close();

//creating a sample HTML file

String encoding = "UTF-8";

FileOutputStream fos = new FileOutputStream("result.html");

OutputStreamWriter writer = new OutputStreamWriter(fos, encoding);

writer.write("\n");

writer.write("");

writer.write("

");

writer.write("");

//using MathJax for helping all browsers to interpret MathML

writer.write("

writer.write(">");

writer.write("");

writer.write("");

writer.write("

");

writer.write("

Following formulas was found in Word document:

");

int i = 1;

for (String mathML : mathMLList) {

writer.write("

Formula" + i++ + ":

");

writer.write(mathML);

writer.write("

}

writer.write("");

writer.write("");

writer.close();

Desktop.getDesktop().browse(new File("result.html").toURI());

}

}

Result:

dZ78M.png

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值