OP的示例PDF中的不可见文本通常通过定义剪辑路径(文本的边界之外)和填充路径(隐藏下面的文本)而变得不可见.因此,我们必须在文本提取期间考虑与路径相关的指令以忽略该不可见文本.
不幸的是,为这些指令设计的回调未在PDFTextStripper或其父类LegacyPDFStreamEngine和PDFStreamEngine中声明.
但是它们在另一个主要的PDFStreamEngine子类PDFGraphicsStreamEngine中声明,并且它们在PageDrawer中明智地实现.
因此,为了利用这一点,我们可以复制&粘贴&将PageDrawer实现调整为PDFTextStripper的子类,例如喜欢这个:
public class PDFVisibleTextStripper extends PDFTextStripper {
public PDFVisibleTextStripper() throws IOException {
addOperator(new AppendRectangleToPath());
addOperator(new ClipEvenOddRule());
addOperator(new ClipNonZeroRule());
addOperator(new ClosePath());
addOperator(new CurveTo());
addOperator(new CurveToReplicateFinalPoint());
addOperator(new CurveToReplicateInitialPoint());
addOperator(new EndPath());
addOperator(new FillEvenOddAndStrokePath());
addOperator(new FillEvenOddRule());
addOperator(new FillNonZeroAndStrokePath());
addOperator(new FillNonZeroRule());
addOperator(new LineTo());
addOperator(new MoveTo());
addOperator(new StrokePath());
}
@Override
protected void processTextPosition(TextPosition text) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
PDGraphicsState gs = getGraphicsState();
Area area = gs.getCurrentClippingPath();
if (area == null || (area.contains(start.getX(), start.getY()) && area.contains(end.getX(), end.getY())))
super.processTextPosition(text);
}
private GeneralPath linePath = new GeneralPath();
void deleteCharsInPath() {
for (List list : charactersByArticle) {
List toRemove = new ArrayList<>();
for (TextPosition text : list) {
Matrix textMatrix = text.getTextMatrix();
Vector start = textMatrix.transform(new Vector(0, 0));
Vector end = new Vector(start.getX() + text.getWidth(), start.getY());
if (linePath.contains(start.getX(), start.getY()) || linePath.contains(end.getX(), end.getY())) {
toRemove.add(text);
}
}
if (toRemove.size() != 0) {
System.out.println(toRemove.size());
list.removeAll(toRemove);
}
}
}
public final class AppendRectangleToPath extends OperatorProcessor {
@Override
public void process(Operator operator, List operands) throws IOException {
if (operands.size() < 4) {
throw new MissingOperandException(operator, operands);
}
if (!checkArrayTypesClass(operands, COSNumber.cl