java中实现word（doc、docx）中完美提取文字、表格为结构化数据

最新推荐文章于 2025-05-18 22:44:32 发布

小游甲鱼

最新推荐文章于 2025-05-18 22:44:32 发布

阅读量1.4w

点赞数 25

分类专栏： word结构化抽取文章标签： poi java msword excel

本文链接：https://blog.csdn.net/sinat_36219677/article/details/106457708

版权

word结构化抽取专栏收录该内容

1 篇文章

订阅专栏

java poi word文字表格结构化抽取

目的
- 好处
概述及依赖
开始
- 抽取

目的

对于word中的数据，我们可能存在将其抽取为结构化数据的需求。

好处

将数据存储于数据库中，将数据从word繁杂的以手工编辑的格式媒介中抽离出来，便于做大数据分析、ai数据集准备等后续操作。
提供在网页等其它媒介中方便地展示、编辑、再储存等，可自由定制数据展示方式，而不需依赖word客户端组件。

概述及依赖

Word包括docx和doc，其中doc源文件为二进制流文件，可读性较差。docx为xml文件，可读性较强。
想要使用全套的poi解析word，引用的maven包如下：

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>ooxml-schemas</artifactId>
    <version>1.3</version>
</dependency>
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-scratchpad</artifactId>
    <version>3.17</version>
</dependency>
<dependency>
    <groupId>org.apache.xmlbeans</groupId>
    <artifactId>xmlbeans</artifactId>
    <version>2.6.0</version>
</dependency>

相信我，使用这套引用没有问题，特别全！甚至你版本号也不需要改！如果你需要改动版本号，还需要注意不同包之间的版本关系。具体查看网址如下：
Maven Repository: org.apache.poi » poi
进入这个网址后点击对应版本号往下拉可以看到对应依赖版本配置，以保证没有依赖错误。

开始

在poi中，doc与docx使用的是完全不同的类，方法内部逻辑也不相同，需要各自开发。
docx主要使用的类：XWPFDocument（WordExtractor类是其子集，不需要它），相关规范
doc主要使用的类：HWPFDocument

抽取

核心思想：

对整个word文档从上至下扫描，并对其中的文字和表格进行区分处理，优点：

可以记录文字和表格的顺序，而其它网站上的抽取方法很可能会丢失页面文字和表格的顺序。
其它网站的方法很可能在抽文字时会把表格中的文字一并抽出，而我这里可以自由选择是否抽取出表格中的文字。

常量定义

/**
 * word表格默认高度
 */
private static final int DEFAULT_HEIGHT = 500;

/**
 * word表格默认宽度
 */
private static final int DEFAULT_WIDTH = 1000;

/**
 * word表格转换参数 默认为/1 可以根据需求调整
 */
private static final int DEFAULT_DIV = 1;

/**
 * 目前没有提取word的字体大小 默认为12
 */
private static final Float DEFAULT_FONT_SIZE = 12.0F;

/**
 * word的全角空格 以及\t 制表符
 */
private static final String WORD_BLANK = "[\u00a0|\u3000|\u0020|\b|\t]";

/**
 * word的它自己造换行符 要换成string的换行符
 */
private static final String WORD_LINE_BREAK = "[\u000B|\r]";

/**
 * word table中的换行符和空格
 */
private static final String WORD_TABLE_FILTER = "[\\t|\\n|\\r|\\s+| +]";

/**
 * 计算表格行列信息时设置的偏移值
 */
private static final Float TABLE_EXCURSION = 5F;

/**
 * 抽取文字时去掉不必须字符正则
 */
private static final String splitter = "[\\t|\\n|\\r|\\s+|\u00a0+]";

private static final String regexClearBeginBlank = "^" + splitter + "*|" + splitter + "*$";

结构化javabean类:

WordTableCell类：

@Data
public class WordTableCell {

    private Float x;

    private Float y;

    private Float width;

    private Float height;

    private String text;

    /**
     * 默认为12
     */
    private Float fontSize;

    /**
     * 行号 0开始
     */
    private Integer row;

    /**
     * 列号 0开始
     */
    private Integer col;

    /**
     * 行跨度 从1开始
     */
    private Integer rowspan;

    /**
     * 列跨度 从1开始
     */
    private Integer colspan;
}

WordTable类：

@Data
public class WordTable {
    
    private List<WordTableCell> wordTableCellList;
    
    private Float width;
    
    private Float height;
}

WordContent类（包括word抽取出的文字和表格结构）

@Data
public class WordContent {

    /**
     * text包括段落文字(不包括表格文字,改成包括表格文字也很简单)
     */
    private String text;

    /**
     * 抽取的表格对象
     */
    private List<WordTable> wordTableList;
}

docx核心方法解析：

概述：
每个docx中都对应着一个xml文件，样式示例如下：

					<w:tbl>
						<w:tblPr>
							<w:tblStyle w:val="3"/>
							<w:tblW w:type="auto" w:w="0"/>
							<w:jc w:val="center"/>
							<w:tblBorders>
								<w:top w:color="auto" w:space="0" w:sz="12" w:val="single"/>
								<w:left w:color="auto" w:space="0" w:sz="12" w:val="single"/>
								<w:bottom w:color="auto" w:space="0" w:sz="12" w:val="single"/>
								<w:right w:color="auto" w:space="0" w:sz="12" w:val="single"/>
								<w:insideH w:color="auto" w:space="0" w:sz="4" w:val="single"/>
								<w:insideV w:color="auto" w:space="0" w:sz="4" w:val="single"/>
							</w:tblBorders>
							<w:tblLayout w:type="fixed"/>
							<w:tblCellMar>
								<w:top w:type="dxa" w:w="0"/>
								<w:left w:type="dxa" w:w="108"/>
								<w:bottom w:type="dxa" w:w="0"/>
								<w:right w:type="dxa" w:w="108"/>
							</w:tblCellMar>
						</w:tblPr>
						<w:tblGrid>
							<w:gridCol w:w="1185"/>
							<w:gridCol w:w="1664"/>
							<w:gridCol w:w="1336"/>
							<w:gridCol w:w="1364"/>
							<w:gridCol w:w="1816"/>
							<w:gridCol w:w="1380"/>
						</w:tblGrid>
						<w:tr>
							<w:tblPrEx>
								<w:tblBorders>
									<w:top w:color="auto" w:space="0" w:sz="12" w:val="single"/>
									<w:left w:color="auto" w:space="0" w:sz="12" w:val="single"/>
									<w:bottom w:color="auto" w:space="0" w:sz="12" w:val="single"/>
									<w:right w:color="auto" w:space="0" w:sz="12" w:val="single"/>
									<w:insideH w:color="auto" w:space="0" w:sz="4" w:val="single"/>
									<w:insideV w:color="auto" w:space="0" w:sz="4" w:val="single"/>
								</w:tblBorders>
							</w:tblPrEx>
							<w:trPr>
								<w:trHeight w:hRule="atLeast" w:val="630"/>
								<w:jc w:val="center"/>
							</w:trPr>
							<w:tc>
								<w:tcPr>
									<w:tcW w:type="dxa" w:w="1185"/>
									<w:vMerge w:val="restart"/>
									<w:shd w:color="auto" w:fill="C0C0C0" w:val="clear"/>
									<w:vAlign w:val="center"/>
								</w:tcPr>
								<w:p>
									<w:r>
										<w:t>申请人</w:t>
									</w:r>
									<w:r>
										<w:br w:type="textWrapping"/>
									</w:r>
									<w:bookmarkStart w:id="0" w:name="_GoBack"/>
									<w:r>
										<w:t>信用等级</w:t>
									</w:r>
									<w:bookmarkEnd w:id="0"/>
								</w:p>
							</w:tc>
						</w:tr>
					</w:tbl>

大体对标签字段解释如下：

<w:tbl>：表格开始
<w:tblPr> ：表格属性定义
<w:tblGrid>：表格单元格定义，里面定义着从左至右每个最小单元格的宽度，如果某个单元格span为2，则会在cell中定义<w:gridSpan w:val=“2”/>
<w:trPr>：表格每一行的属性定义
<w:tc>：表格中某个单元格的属性定义
<w:vMerge w:val=“restart”/>：表示这个单元格跨行，并且是跨行的第一个cell
<w:vMerge w:val=“continue”/>：表示这个单元格跨行，但不是跨行的第一个cell
<w:p>：word中的一个段落
<w:r>：段落中的一个格式一致的文本块
我们使用poi包中的方法对xml文件中的字段进行解析，抽取出文件中的结构，并根据位置信息填充必要的结构信息。

抽取文字元素

// 读取docx文字部分
StringBuilder docxText = new StringBuilder();
Iterator<IBodyElement> iter = docx.getBodyElementsIterator();
int count = 0;
while (iter.hasNext()) {
    IBodyElement element = iter.next();
    if (element instanceof XWPFParagraph) {
        // 获取段落元素
        XWPFParagraph paragraph = (XWPFParagraph) element;
        String text = paragraph.getText();
        if (StringUtils.isBlank(text)) {
            continue;
        }
        // 将word中的特有字符转化为普通的换行符、空格符等
        String textWithSameBlankAndBreak = text.replaceAll(WORD_BLANK, " ").replaceAll(WORD_LINE_BREAK, "\n")
                .replaceAll("\n+", "\n");
        // 去除word特有的不可见字符
        String textClearBeginBlank = textWithSameBlankAndBreak.replaceAll(regexClearBeginBlank, "");
        // 为抽取的每一个段落加上\n作为换行符标识
        docxText.append(textClearBeginBlank).append("\n");
    } else if (element instanceof XWPFTable) {
        try {
            // 获取表格中的原始文字 默认文字中不加入表格文字 取消注释可加入
            /*String text = originTableTextList.get(count);
            docxText.append(text);*/
            count++;
        } catch (Exception e) {
            log.error("docx抽表数据与对应的表格位置不一致");
        }
    }
}

抽取表格元素
docx抽取表格的长宽主要使用两种方法，优先采用表格边框法：

表格边框法：根据[相关规范]中的<w:tblPrEx>
单元格法：根据[相关规范]中的<w:tcPr>
span：表示跨单元格，有rowspan和colspan两种

List<WordTable> allWordTableCellList = new ArrayList<>();
Iterator<XWPFTable> it = docx.getTablesIterator();
// 抽取表中的文字集合
List<String> originTableTextList = new ArrayList<>();
while (it.hasNext()) {
    try {
        XWPFTable table = it.next();
        WordTable wordTable = new WordTable();
        List<WordTableCell> wordTableCellList = new ArrayList<>();
        // 默认每个表格左上角的位置为(0,0)
        float x = 0.0f;
        float y = 0.0f;
        // TblGridExist是记录表格的边框 如果存在的话用它来计算单元格宽度很准 但是不一定存在 else 会使用单元格法
        boolean isTblGridExist = true;
        // 一种计算width的方式，表格边框法
        List<CTTblGridCol> tableGridColList = null;
        try {
            // 尝试读取表格网格信息
            tableGridColList = table.getCTTbl().getTblGrid().getGridColList();
        } catch (Exception e) {
            log.info("该docx表格无边框");
            isTblGridExist = false;
        }
        // 采用表格边框法
        if (isTblGridExist) {
            for (int i = 0; i < table.getNumberOfRows(); i++) {
                int colNums = table.getRow(i).getTableCells().size();
                int currentRowHeight = getDocxRowHeight(table, i) / DEFAULT_DIV;
                for (int j = 0, minCellNums = 0; j < colNums; j++) {
                    XWPFTableCell cell = table.getRow(i).getCell(j);
                    int spanNumber = 1;
                    // 表示colspan
                    BigInteger girdSpanBigInteger;
                    try {
                        girdSpanBigInteger = cell.getCTTc().getTcPr().getGridSpan().getVal();
                    } catch (Exception e) {
                        girdSpanBigInteger = null;
                    }
                    if (girdSpanBigInteger != null) {
                        spanNumber = girdSpanBigInteger.intValue();
                    }
                    int widthByGrid = 0;
                    for (int k = 0; k < spanNumber; k++) {
                        widthByGrid += tableGridColList.get(minCellNums + k).getW().intValue();
                    }
                    int width = widthByGrid / DEFAULT_DIV;
                    minCellNums += spanNumber;

                    if (!docxIsContinue(cell)) {
                        int height = this.getDocxCellHeight(table, currentRowHeight, i, j);
                        WordTableCell wordTableCell = this
                                .buildWordCellContent((float) height, (float) width, cell.getText(),
                                        DEFAULT_FONT_SIZE, x, y);
                        wordTableCellList.add(wordTableCell);
                    }
                    x += width;
                }
                if (i + 1 == table.getNumberOfRows()) {
                    wordTable.setHeight(y);
                    wordTable.setWidth(x);
                }
                x = 0.0f;
                y += currentRowHeight;
            }
        } else {
            // 另一种查看width方式，单元格法
            for (int i = 0; i < table.getNumberOfRows(); i++) {
                int colNums = table.getRow(i).getTableCells().size();
                int currentRowHeight = getDocxRowHeight(table, i) / DEFAULT_DIV;
                for (int j = 0; j < colNums; j++) {
                    XWPFTableCell cell = table.getRow(i).getCell(j);
                    int width = getDocxCellWidth(table, i, j) / DEFAULT_DIV;
                    if (width <= 0) {
                        // tableGridMethod = true;
                        width = DEFAULT_WIDTH;
                    }
                    if (!docxIsContinue(cell)) {
                        int height = this.getDocxCellHeight(table, currentRowHeight, i, j);
                        WordTableCell wordTableCell = this
                                .buildWordCellContent((float) height, (float) width, cell.getText(),
                                        DEFAULT_FONT_SIZE, x, y);
                        wordTableCellList.add(wordTableCell);
                    }
                    x += width;
                }
                if (i + 1 == table.getNumberOfRows()) {
                    wordTable.setHeight(y);
                    wordTable.setWidth(x);
                }
                x = 0.0f;
                y += currentRowHeight;
            }
        }

        wordTable.setWordTableCellList(wordTableCellList);
        allWordTableCellList.add(wordTable);
        // 以下代码为为抽取的文字中加入表格文字
        /* 
        String originTableText = "<tb>\n" + table.getText().replaceAll(WORD_TABLE_FILTER, "") + "</tb>\n";
        originTableTextList.add(originTableText);
        */
    } catch (Exception e) {
        log.error("docx表格解析错误", e);
    }
}
// 为表格加入行列信息
allWordTableCellList.forEach(this::fillSpan);
// 开始抽取doc中的文字
StringBuilder docText = new StringBuilder();
for (int i = 0; i < range.numParagraphs(); i++) {
    Paragraph paragraph = range.getParagraph(i);
    // 拿出段落中不包括表格的文字
    if (!paragraph.isInTable()) {
        String text = paragraph.text();
        if (StringUtils.isBlank(text)) {
            continue;
        }
        String textWithSameBlankAndBreak = text.replaceAll(WORD_BLANK, " ").replaceAll(WORD_LINE_BREAK, "\n");
        String clearBeginBlank = textWithSameBlankAndBreak.replaceAll(regexClearBeginBlank, "");
        docText.append(clearBeginBlank).append("\n");
    } else {
        try {
            // 寻找表格的开始位置和结束位置
            int index = i;
            int endIndex = index;
            // 拿出表格中文字
            StringBuilder tableOriginText = new StringBuilder(paragraph.text());
            for (; index < range.numParagraphs(); index++) {
                Paragraph tableParagraph = range.getParagraph(index);
                if (!tableParagraph.isInTable() || tableParagraph.getTableLevel() < 1) {
                    endIndex = index;
                    break;
                } else {
                    tableOriginText.append(tableParagraph.text());
                }
            }
            i = endIndex - 1;
            // 过滤掉表格中所有不可见符号
            String tableOriginTextWithoutBlank = tableOriginText.toString().replaceAll(WORD_TABLE_FILTER, "");
            // 默认不加入表格中字体
            // docText.append("<tb>").append(tableOriginTextWithoutBlank).append("</tb>").append("\n");
        } catch (Exception e) {
            log.error("doc抽表数据与对应的表格位置不一致");
        }

private int getDocCellToLeftWidth(Table table, int row, int col) {
    int leftWidth = 0;
    for (int i = 0; i < col; i++) {
        leftWidth += getDocCellWidth(table, row, i);
    }
    return leftWidth;
}

private int getDocCellWidth(Table table, int row, int col) {
    int width = table.getRow(row).getCell(col).getWidth() / DEFAULT_DIV;
    if (width < 0) {
        width = Math.abs(width);
        log.info("doc取出的宽度为负数");
    }
    return width == 0 ? DEFAULT_WIDTH : width;
}

private int getDocRowHeight(Table table, int row) {
    int height = table.getRow(row).getRowHeight();
    if (height < 0) {
        log.info("出现height小于0");
        height = Math.abs(height);
    }
    return height == 0 ? DEFAULT_HEIGHT : height;
}

/**
 * 只会传isRestart进来 判断往下是不是continue
 */
private int getDocContinueRowHeight(Table table, int row, int col, int rowHeight) {
    int nextRow = row + 1;
    if (nextRow >= table.numRows()) {
        return rowHeight;
    }
    int nextRowHeight = getDocRowHeight(table, nextRow) / DEFAULT_DIV;
    int nextColNums = table.getRow(nextRow).numCells();
    for (int j = 0; j < nextColNums; j++) {
        TableCell nextRowCell = table.getRow(nextRow).getCell(j);
        if (docIsContinue(nextRowCell) && getDocCellWidth(table, nextRow, j) == getDocCellWidth(table, row, col)
                && getDocCellToLeftWidth(table, nextRow, j) == getDocCellToLeftWidth(table, row, col)) {
            rowHeight += nextRowHeight;
            return getDocContinueRowHeight(table, nextRow, j, rowHeight);
        }
    }
    return rowHeight;
}

/**
 * 是否行合并单元格，但不是第一个
 */
private boolean docIsContinue(TableCell cell) {
    return cell.isVerticallyMerged() && !cell.isFirstVerticallyMerged();
}

/**
 * 行合并单元格且为第一个
 */
private boolean docIsRestart(TableCell cell) {
    return cell.isFirstVerticallyMerged();
}

先写到这里，后面看有没有人有word抽取结构化的需求再决定是否继续写下去。
相关代码已上传至github，有帮助的话希望点个star，谢谢了。有问题欢迎留言或者私信我，有空我会解答。
https://github.com/boyonger/word-extractor