Java PDFBox提取PDF中字符的坐标或位置

样young

已于 2022-06-23 15:09:44 修改

阅读量5k

点赞数 1

分类专栏： Java 文章标签： apache java servlet

于 2022-06-23 14:53:58 首次发布

原文链接：https://blog.csdn.net/allway2/article/details/124399662

版权

Java 专栏收录该内容

82 篇文章 7 订阅

订阅专栏

本文介绍了如何通过扩展PDFTextStripper类来提取PDF文档中字符的坐标和大小。步骤包括创建Java类、调用writeText方法、覆盖writeString方法以打印字符的位置和尺寸。示例代码展示了一个具体的实现，输出了每个字符的Unicode、X和Y坐标、高度和宽度等信息。

摘要由CSDN通过智能技术生成

为了提取 pdf 中字符的坐标或位置和大小，我们将扩展 PDFTextStripper 类，拦截并实现 writeString(String string, List<TextPosition> textPositions)方法。

org.apache.pdfbox.contentstream 类。PDFTextStripper 去除所有文本。

writeString() 方法中的List<TextPosition> 包含有关字符的信息，例如是否其 Unicode、字符的 X 坐标、Y 坐标、高度、宽度、x 缩放值、y 缩放值、字体大小、空间宽度等。

提取PDF中字符坐标的步骤
以下是在 PDF 中提取字符的坐标或位置的分步过程。

1.扩展PDFTextStripper

创建一个 Java 类并使用 PDFTextStripper 对其进行扩展。

public class GetCharLocationAndSize extends PDFTextStripper {
. . .
}
2.调用writeText方法

设置页面边界（从第一页到最后一页）以去除文本并调用方法 writeText()。

PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
3.覆盖writeString

writeString 方法接收有关字符在流中的文本位置的信息。我们将重写 writeString 方法，如下所示。

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
. . .
}
4. 打印位置和尺寸

对于单个字符的 TextPosition 列表中的每个项目，打印坐标和大小。

示例 1 – 提取 PDF 中字符的坐标或位置
在此示例中，我们将获取带有文本的 PDF，并提取字符的 (X, Y) 坐标。

GetCharLocationAndSize.java

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.List;

/**
* This is an example on how to get the x/y coordinates and size of each character in PDF
*/
public class GetCharLocationAndSize extends PDFTextStripper {

public GetCharLocationAndSize() throws IOException {
}

/**
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException {
PDDocument document = null;
String fileName = "apache.pdf";
try {
document = PDDocument.load( new File(fileName) );
PDFTextStripper stripper = new GetCharLocationAndSize();
stripper.setSortByPosition( true );
stripper.setStartPage( 0 );
stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
}
finally {
if( document != null ) {
document.close();
}
}
}

/**
* Override the default functionality of PDFTextStripper.writeString()
*/
@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
for (TextPosition text : textPositions) {
System.out.println(text.getUnicode()+ " [(X=" + text.getXDirAdj() + ",Y=" +
text.getYDirAdj() + ") height=" + text.getHeightDir() + " width=" +
text.getWidthDirAdj() + "]");
}
}
}
输出

2 [(X=26.004425,Y=22.003723) height=5.833024 width=5.0907116]
0 [(X=31.095137,Y=22.003723) height=5.833024 width=5.0907116]
1 [(X=36.18585,Y=22.003723) height=5.833024 width=5.0907097]
7 [(X=41.276558,Y=22.003723) height=5.833024 width=5.0907097]
- [(X=46.367268,Y=22.003723) height=5.833024 width=2.8872108]
8 [(X=49.25448,Y=22.003723) height=5.833024 width=5.0907097]
- [(X=54.34519,Y=22.003723) height=5.833024 width=2.8872108]
6 [(X=57.2324,Y=22.003723) height=5.833024 width=5.0907097]
W [(X=226.4448,Y=22.003723) height=5.833024 width=7.911499]
e [(X=233.88747,Y=22.003723) height=5.833024 width=4.922714]
l [(X=238.81018,Y=22.003723) height=5.833024 width=2.2230377]
c [(X=241.03322,Y=22.003723) height=5.833024 width=4.399185]
o [(X=245.4324,Y=22.003723) height=5.833024 width=4.895355]
m [(X=250.32776,Y=22.003723) height=5.833024 width=7.7943115]
e [(X=258.12207,Y=22.003723) height=5.833024 width=4.922699]
如果您想使用相同的 PDF 文件，请在此处下载 PDF 文档apache.pdf 。否则，您可以在 Java 程序中为您的 PDF 文件路径分配文件名。

结论
在本PDFBox 教程中，我们学习了提取 PDF 文档中字符的坐标或位置，以及提取 Unicode、X 坐标、Y 坐标、高度、宽度、x 缩放值、y 缩放值、字体大小、空间的方法宽度等。

PDFBox：2.0.25，Spire.PDF：4.4.8
————————————————
版权声明：本文为CSDN博主「allway2」的原创文章，遵循CC 4.0 BY-SA版权协议，转载请附上原文出处链接及本声明。
原文链接：https://blog.csdn.net/allway2/article/details/124399662

样young

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Java PDFBox提取PDF中字符的坐标或位置

为了提取 pdf 中字符的坐标或位置和大小，我们将扩展 PDFTextStripper 类，拦截并实现 writeString(String string, List textPositions)方法。org.apache.pdfbox.contentstream 类。PDFTextStripper 去除所有文本。writeString() 方法中的List 包含有关字符的信息，例如是否其 Unicode、字符的 X 坐标、Y 坐标、高度、宽度、x 缩......
复制链接

扫一扫

专栏目录