java 查找pdf指定位置_如何搜索某些特定的字符串或单词,并在java中的pdf文档中进行坐标...

PDFBox’PDFTextStripper类仍然具有位置文本(在缩减为纯文本之前)的最后一种方法是方法

/**

* Write a Java string to the output stream. The default implementation will ignore the textPositions

* and just calls {@link #writeString(String)}.

*

* @param text The text to write to the stream.

* @param textPositions The TextPositions belonging to the text.

* @throws IOException If there is an error when writing the text.

*/

protected void writeString(String text, List textPositions) throws IOException

一个人应该在这里拦截,因为这个方法接收预处理的,特别是排序的TextPosition对象(如果一个请求排序开始).

(实际上我更喜欢在调用方法writeLine中拦截,根据其参数和局部变量的名称具有一行的所有TextPosition实例并且每个字调用一次writeString;但不幸的是,PDFBox开发人员已将此方法声明为私有……好吧,也许这会改变,直到最后的2.0.0发布…轻推,轻推.更新:不幸的是它在发布中没有改变……叹息)

此外,使用辅助类将TextPosition实例的序列包装在类似String的类中以使代码更清晰是有帮助的.

考虑到这一点,人们可以搜索这样的变量

List findSubwords(PDDocument document, int page, String searchTerm) throws IOException

{

final List hits = new ArrayList();

PDFTextStripper stripper = new PDFTextStripper()

{

@Override

protected void writeString(String text, List textPositions) throws IOException

{

TextPositionSequence word = new TextPositionSequence(textPositions);

String string = word.toString();

int fromIndex = 0;

int index;

while ((index = string.indexOf(searchTerm, fromIndex)) > -1)

{

hits.add(word.subSequence(index, index + searchTerm.length()));

fromIndex = index + 1;

}

super.writeString(text, textPositions);

}

};

stripper.setSortByPosition(true);

stripper.setStartPage(page);

stripper.setEndPage(page);

stripper.getText(document);

return hits;

}

有了这个助手类

public class TextPositionSequence implements CharSequence

{

public TextPositionSequence(List textPositions)

{

this(textPositions, 0, textPositions.size());

}

public TextPositionSequence(List textPositions, int start, int end)

{

this.textPositions = textPositions;

this.start = start;

this.end = end;

}

@Override

public int length()

{

return end - start;

}

@Override

public char charAt(int index)

{

TextPosition textPosition = textPositionAt(index);

String text = textPosition.getUnicode();

return text.charAt(0);

}

@Override

public TextPositionSequence subSequence(int start, int end)

{

return new TextPositionSequence(textPositions, this.start + start, this.start + end);

}

@Override

public String toString()

{

StringBuilder builder = new StringBuilder(length());

for (int i = 0; i < length(); i++)

{

builder.append(charAt(i));

}

return builder.toString();

}

public TextPosition textPositionAt(int index)

{

return textPositions.get(start + index);

}

public float getX()

{

return textPositions.get(start).getXDirAdj();

}

public float getY()

{

return textPositions.get(start).getYDirAdj();

}

public float getWidth()

{

TextPosition first = textPositions.get(start);

TextPosition last = textPositions.get(end);

return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();

}

final List textPositions;

final int start, end;

}

要输出它们的位置,宽度,最终字母和最终字母位置,您可以使用它

void printSubwords(PDDocument document, String searchTerm) throws IOException

{

System.out.printf("* Looking for '%s'\n", searchTerm);

for (int page = 1; page <= document.getNumberOfPages(); page++)

{

List hits = findSubwords(document, page, searchTerm);

for (TextPositionSequence hit : hits)

{

TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);

System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",

page, hit.getX(), hit.getY(), hit.getWidth(),

lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());

}

}

}

对于测试,我使用MS Word创建了一个小测试文件:

cGEhB.png

这个测试的输出

@Test

public void testVariables() throws IOException

{

try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf");

PDDocument document = PDDocument.load(resource); )

{

System.out.println("\nVariables.pdf\n-------------\n");

printSubwords(document, "${var1}");

printSubwords(document, "${var 2}");

}

}

Variables.pdf

-------------

* Looking for '${var1}'

Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06

Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995

Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997

Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18

* Looking for '${var 2}'

Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997

Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74

Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998

Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81

我有点惊讶,因为如果在单行上找到${var 2};毕竟,PDFBox代码让我假设方法writeString我overrode只检索单词;它看起来好像检索线的较长部分而不仅仅是单词……

如果您需要来自分组TextPosition实例的其他数据,只需相应地增强TextPositionSequence.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值