公司业务需要把world文档中编辑好的新闻(文字+图片)录入到CMS管理后台,生成一篇新闻发布。因为不能把图片直接复制粘贴到UEditor编辑器上,还要一个一个上传太麻烦。所以这里做了一个上传docx文件解析后,直接返回html正文放到前端编辑器继续编辑。
功能要求:
1.图片要下载到服务器指定位置,并把前端请求图片地址拼接到img标签的src上。
2.图片文字要按照顺序排列。
3.过滤掉超链接、其他图形等一般新闻不用的元素。
实现:
- maven最小依赖,3.17版本支持jdk1.6及以上。4版本需要jdk1.8及以上支持了
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml-schemas</artifactId>
<version>3.17</version>
</dependency>
2.代码实现
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.util.List;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFPictureData;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.xmlbeans.XmlCursor;
import org.apache.xmlbeans.XmlObject;
import org.openxmlformats.schemas.drawingml.x2006.main.CTGraphicalObject;
import org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture;
import org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDrawing;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTText;
public class AnalyzeDocx {
public static void main(String[] args) throws Exception {
String content = analyzeDocx("e://abc.docx");
System.out.println(content);
}
public static String analyzeDocx(String path) throws Exception {
StringBuilder sb = new StringBuilder();
try (InputStream in = new FileInputStream(path); XWPFDocument xwpfDocument = new XWPFDocument(in);) {
List<XWPFParagraph> paragraphs = xwpfDocument.getParagraphs();
for (XWPFParagraph xwpfParagraph : paragraphs) {
List<XWPFRun> runs = xwpfParagraph.getRuns();
for (XWPFRun xwpfRun : runs) {
CTR ctr = xwpfRun.getCTR();
if(ctr.xmlText().contains("w:type=\"textWrapping\"")){
sb.append("<br>");//段内换行
continue;
}
XmlCursor newCursor = ctr.newCursor();
newCursor.selectPath("./*");
while (newCursor.toNextSelection()) {
XmlObject object = newCursor.getObject();
if (object instanceof CTText) {// 文字
CTText ctText = (CTText) object;
if (ctText.isSetSpace()) {
continue;// 先不支持超链接
}
String text = ctText.getStringValue();
if (text != null && text.length() > 0) {
sb.append(text);
}
} else if (object instanceof CTDrawing) {// 图片1
CTDrawing drawing = (CTDrawing) object;
CTInline[] inlineArray = drawing.getInlineArray();
for (CTInline ctInline : inlineArray) {
CTGraphicalObject graphic = ctInline.getGraphic();
XmlCursor newCursor2 = graphic.getGraphicData().newCursor();
newCursor2.selectPath("./*");
while (newCursor2.toNextSelection()) {
XmlObject object2 = newCursor2.getObject();
if (object2 instanceof CTPicture) {
CTPicture picture = (org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture) object2;
sb.append("<br>").append(
imgHtml(xwpfDocument, picture.getBlipFill().getBlip().getEmbed()))
.append("<br>");
}
}
}
}
}
}
sb.append("<br>");// 分段
}
} catch (Exception e) {
e.printStackTrace();
}
return sb.toString();
}
private static String imgHtml(XWPFDocument xwpfDocument, String blipID) {
XWPFPictureData pictureData = xwpfDocument.getPictureDataByID(blipID);
String imageName = pictureData.getFileName();
String newfilename = System.currentTimeMillis() + imageName;
byte[] bytev = pictureData.getData();
try (FileOutputStream fos = new FileOutputStream("E:/" + newfilename);) {
fos.write(bytev);// 此处保存图片后,变成可访问的http然后用<img>标签包裹
} catch (Exception e) {
e.printStackTrace();
}
return "<img src='/rongmeitiapi/api/picture/find/image/20181107/d66ce5ffc18365a3dab1e46c484dfabb.jpeg'>";
}
}
imgHtml方法需要把图片重命名后,变成前端可访问的连接,再去拼接img标签。我这边因为是测试,所以写死了img标签。
注意:这个只是处理正常的可视图片,对于emf类型的图片,不处理因为新闻也用不到。
如果需要捕获所有的,请参考https://www.jb51.net/article/132091.htm