java pdf 识别表格内容 识别空格

本文介绍如何使用Java的PDFBox库,特别是PDFLayoutTextStripper类,来识别PDF文件中的表格内容,并处理空格。通过引入特定版本的第三方库,可以实现PDF的解析并打印出文本内容。
摘要由CSDN通过智能技术生成

java pdf 识别表格内容 识别空格

maven 依赖
只有版本2.0.0以上的 pdfbox版本与此版本的PDFLayoutTextStripper.java兼容
<dependency
<groupId io.github.jonathanlink</groupId
<artifactId PDFLayoutTextStripper</artifactId
<version 2.2.3</version
</dependency

package pdf;

import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDRectangle;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import org.apache.pdfbox.text.TextPositionComparator;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

public class PDFLayoutTextStripper extends PDFTextStripper {

public static final boolean DEBUG = false;
public static final int OUTPUT_SPACE_CHARACTER_WIDTH_IN_PT = 4;

private double currentPageWidth;
private TextPosition previousTextPosition;
private List<TextLine> textLineList;

/**
* Constructor
*/
public PDFLayoutTextStripper() throws IOException {
    super();
    this.previousTextPosition = null;
    this.textLineList = new ArrayList<TextLine>();
}

/**
* 
* @param page page to parse
*/
@Override
public void processPage(PDPage page) throws IOException {
    PDRectangle pageRectangle = page.getMediaBox();
    if (pageRectangle!= null) {
        this.setCurrentPageWidth(pageRectangle.getWidth());
        super.processPage(page);
        this.previousTextPosition = null;
        this.textLineList = new ArrayList<TextLine>();
    }
}

@Override
protected void writePage() throws IOException {
    List<List<TextPosition>> charactersByArticle = super.getCharactersByArticle();
    for( int i = 0; i < charactersByArticle.size(); i++) {
        List<TextPosition> textList = charactersByArticle.get(i);
        try {
            this.sortTextPositionList(textList);
        } catch ( IllegalArgumentException e) {
            System.err.println(e);
        }
        this.iterateThroughTextList(textList.iterator()) ;
    }
    this.writeToOutputStream(this.getTextLineList());
}

private void writeToOutputStream(final List<TextLine> textLineList) throws IOException {
    for (TextLine textLine : textLineList) {
        char[] line = textLine.getLine().toCharArray();
        super.getOutput().write(line);
        super.getOutput().write('\n');
        super.getOutput().flush();
    }
}

/*
 * In order to get rid of the warning:
 * TextPositionComparator class should implement Comparator<TextPosition> instead of Comparator
 */
@SuppressWarnings("unchecked")
private void sortTextPositionList(final List<TextPosition> textList) {
    TextPositionComparator comparator = new TextPositionComparator();
    Collections.sort(textList, comparator);
}

private void writeLine(final List<TextPosition> textPositionList) {
    if ( textPositionList.size() > 0 ) {
        TextLine textLine = this.addNewLine();
        boolean firstCharacterOfLineFound = false;
        for (TextPosition textPosition : textPositionList ) {
            CharacterFactory characterFactory = new CharacterFactory(firstCharacterOfLineFound);
            Character character = c
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值