Java PDFBox从 PDF 文档中提取单词

最新推荐文章于 2024-05-28 22:22:50 发布

allway2

最新推荐文章于 2024-05-28 22:22:50 发布

阅读量875

点赞数

文章标签： java

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/allway2/article/details/124399820

版权

要从 PDF 文档（从所有页面）中提取单词，我们将覆盖 PDFTextStripper 的 writeText 方法。

org.apache.pdfbox.contentstream 类。PDFTextStripper 去除所有文本。

为了从 PDF 文档中提取提取词，我们将扩展这个 PDFTextStripper 类，拦截并实现 writeString(String str, List<TextPosition> textPositions)方法。

writeString 方法的第一个参数是一行。可以使用单词分隔符将该行拆分为单词。

从 PDF 文档中提取单词的步骤

以下是从 pdf 中提取单词的分步过程：

1.扩展PDFTextStripper

创建一个Java 类并使用 PDFTextStripper 对其进行扩展。

public class GetWordsFromPDF extends PDFTextStripper {

. . .

}

2.调用writeText方法

设置页面边界（从第一页到最后一页）以去除文本并调用方法 writeText()。

PDFTextStripper stripper = new GetCharLocationAndSize();

stripper.setSortByPosition( true );

stripper.setStartPage( 0 );

stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());

stripper.writeText(document, dummy);

3.覆盖writeString

writeString 方法接收一行文本作为第一个参数。对 PDF 文档中的每一行文本调用 writeString 方法。

@Override

protected void writeString(String string, List<TextPosition> textPositions) throws IOException {

. . .

}

4. 获取单词

通过单词分隔符拆分 writeString 方法接收到的字符串。

示例 1 – 从 PDF 中提取单词

在此示例中，我们将获取一个 PDF 文档，并从该 PDF 中提取所有单词。

GetWordsFromPDF.java

import org.apache.pdfbox.pdmodel.PDDocument;

import org.apache.pdfbox.text.PDFTextStripper;

import org.apache.pdfbox.text.TextPosition;

import java.io.ByteArrayOutputStream;

import java.io.File;

import java.io.IOException;

import java.io.OutputStreamWriter;

import java.io.Writer;

import java.util.ArrayList;

import java.util.List;

/**

* This is an example on how to extract words from PDF document

*/

public class GetWordsFromPDF extends PDFTextStripper {

static List<String> words = new ArrayList<String>();

public GetWordsFromPDF() throws IOException {

}

/**

* @throws IOException If there is an error parsing the document.

*/

public static void main( String[] args ) throws IOException {

PDDocument document = null;

String fileName = "apache.pdf"; // replace with your PDF file name

try {

document = PDDocument.load( new File(fileName) );

PDFTextStripper stripper = new GetWordsFromPDF();

stripper.setSortByPosition( true );

stripper.setStartPage( 0 );

stripper.setEndPage( document.getNumberOfPages() );

Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());

stripper.writeText(document, dummy);

// print words

for(String word:words){

System.out.println(word);

}

}

finally {

if( document != null ) {

document.close();

}

}

}

/**

* Override the default functionality of PDFTextStripper.writeString()

*/

@Override

protected void writeString(String str, List<TextPosition> textPositions) throws IOException {

String[] wordsInStream = str.split(getWordSeparator());

if(wordsInStream!=null){

for(String word :wordsInStream){

words.add(word);

}

}

}

}

输出

2017-8-6

Welcome

to

The

Apache

Software

Foundation!

Custom

Search

The

Apache

Way

(/foundation/governance/)

(Donating to The Apache Software Foundation)

如果您想使用相同的 PDF 文件，请在此处下载 PDF 文档apache.pdf 。否则，您可以fileName在 Java 程序中指定 PDF 文件路径。

结论

在这个Apache PDFBox 教程中，我们学习了从 PDF 中提取单词。您还可以参考PDF 中字符的提取坐标或位置。

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Java PDFBox从 PDF 文档中提取单词

要从 PDF 文档（从所有页面）中提取单词，我们将覆盖 PDFTextStripper 的 writeText 方法。org.apache.pdfbox.contentstream 类。PDFTextStripper 去除所有文本。为了从 PDF 文档中提取提取词，我们将扩展这个 PDFTextStripper 类，拦截并实现writeString(String str, List<TextPosition> textPositions)方法。writeString 方法的第一个参.
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。