java利用poi解析docx生成html

最新推荐文章于 2024-06-13 15:25:13 发布

土豆番茄酱紫

最新推荐文章于 2024-06-13 15:25:13 发布

阅读量3.3k

点赞数 1

分类专栏： java 文章标签： poi docx解析

本文链接：https://blog.csdn.net/wyyrockking/article/details/83866565

版权

java 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

公司业务需要把world文档中编辑好的新闻（文字+图片）录入到CMS管理后台，生成一篇新闻发布。因为不能把图片直接复制粘贴到UEditor编辑器上，还要一个一个上传太麻烦。所以这里做了一个上传docx文件解析后，直接返回html正文放到前端编辑器继续编辑。
功能要求：
1.图片要下载到服务器指定位置，并把前端请求图片地址拼接到img标签的src上。
2.图片文字要按照顺序排列。
3.过滤掉超链接、其他图形等一般新闻不用的元素。
实现：

maven最小依赖，3.17版本支持jdk1.6及以上。4版本需要jdk1.8及以上支持了

    	<groupId>org.apache.poi</groupId>
    		<artifactId>poi</artifactId>
    		<version>3.17</version>
		</dependency>
		<dependency>
    	<groupId>org.apache.poi</groupId>
    		<artifactId>poi-ooxml</artifactId>
   		 	<version>3.17</version>
		</dependency>
		<dependency>
    	<groupId>org.apache.poi</groupId>
    		<artifactId>poi-ooxml-schemas</artifactId>
    		<version>3.17</version>
		</dependency>

2.代码实现

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.util.List;

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFParagraph;
import org.apache.poi.xwpf.usermodel.XWPFPictureData;
import org.apache.poi.xwpf.usermodel.XWPFRun;
import org.apache.xmlbeans.XmlCursor;
import org.apache.xmlbeans.XmlObject;
import org.openxmlformats.schemas.drawingml.x2006.main.CTGraphicalObject;
import org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture;
import org.openxmlformats.schemas.drawingml.x2006.wordprocessingDrawing.CTInline;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDrawing;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTR;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTText;

public class AnalyzeDocx {

	public static void main(String[] args) throws Exception {
		String content = analyzeDocx("e://abc.docx");
		System.out.println(content);
	}

	public static String analyzeDocx(String path) throws Exception {

		StringBuilder sb = new StringBuilder();
		try (InputStream in = new FileInputStream(path); XWPFDocument xwpfDocument = new XWPFDocument(in);) {
			List<XWPFParagraph> paragraphs = xwpfDocument.getParagraphs();
			for (XWPFParagraph xwpfParagraph : paragraphs) {
				List<XWPFRun> runs = xwpfParagraph.getRuns();
				for (XWPFRun xwpfRun : runs) {
					CTR ctr = xwpfRun.getCTR();
					if(ctr.xmlText().contains("w:type=\"textWrapping\"")){
						sb.append("<br>");//段内换行
						continue;
					}
					XmlCursor newCursor = ctr.newCursor();
					newCursor.selectPath("./*");
					while (newCursor.toNextSelection()) {
						XmlObject object = newCursor.getObject();
						if (object instanceof CTText) {// 文字
							CTText ctText = (CTText) object;
							if (ctText.isSetSpace()) {
								continue;// 先不支持超链接
							}
							String text = ctText.getStringValue();
							if (text != null && text.length() > 0) {
								sb.append(text);
							}
						} else if (object instanceof CTDrawing) {// 图片1
							CTDrawing drawing = (CTDrawing) object;
							CTInline[] inlineArray = drawing.getInlineArray();
							for (CTInline ctInline : inlineArray) {
								CTGraphicalObject graphic = ctInline.getGraphic();
								XmlCursor newCursor2 = graphic.getGraphicData().newCursor();
								newCursor2.selectPath("./*");
								while (newCursor2.toNextSelection()) {
									XmlObject object2 = newCursor2.getObject();
									if (object2 instanceof CTPicture) {
										CTPicture picture = (org.openxmlformats.schemas.drawingml.x2006.picture.CTPicture) object2;
										sb.append("<br>").append(
												imgHtml(xwpfDocument, picture.getBlipFill().getBlip().getEmbed()))
												.append("<br>");
									}
								}
							}
						}
					}
				}
				sb.append("<br>");// 分段
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
		return sb.toString();
	}

	private static String imgHtml(XWPFDocument xwpfDocument, String blipID) {
		XWPFPictureData pictureData = xwpfDocument.getPictureDataByID(blipID);
		String imageName = pictureData.getFileName();
		String newfilename = System.currentTimeMillis() + imageName;
		byte[] bytev = pictureData.getData();
		try (FileOutputStream fos = new FileOutputStream("E:/" + newfilename);) {
			fos.write(bytev);// 此处保存图片后，变成可访问的http然后用<img>标签包裹
		} catch (Exception e) {
			e.printStackTrace();
		}
		return "<img src='/rongmeitiapi/api/picture/find/image/20181107/d66ce5ffc18365a3dab1e46c484dfabb.jpeg'>";
	}

}

imgHtml方法需要把图片重命名后，变成前端可访问的连接，再去拼接img标签。我这边因为是测试，所以写死了img标签。
注意：这个只是处理正常的可视图片，对于emf类型的图片，不处理因为新闻也用不到。
如果需要捕获所有的，请参考https://www.jb51.net/article/132091.htm

土豆番茄酱紫

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
1
评论
java利用poi解析docx生成html

公司业务需要把world文档中编辑好的新闻（文字+图片）录入到CMS管理后台，生成一篇新闻发布。因为不能把图片直接复制粘贴到UEditor编辑器上，还要一个一个上传太麻烦。所以这里做了一个上传docx文件解析后，直接返回html正文放到前端编辑器继续编辑。功能要求：1.图片要下载到服务器指定位置，并把前端请求图片地址拼接到img标签的src上。2.图片文字要按照顺序排列。3.过滤掉超链接、...
复制链接

扫一扫

专栏目录