pdfparser java_如何使用java从PDF中提取内容?

在Java编程中,如何使用java从PDF中提取内容?

项目的目录结构如下 -

c2e4b5934ae29b42c473917541e27d7e.png

Tika的工具包可从以下网址下载:http://tika.apache.org/download.html ,只下载:tika-app-1.16.jar 和 tika-server-1.16.jar 。

以下是使用java从PDF中提取内容的程序 -

import java.io.File;

import java.io.FileInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.pdf.PDFParser;

import org.apache.tika.sax.BodyContentHandler;

public class ExtractContentFromPDF {

public static void main(String[] args) throws Exception {

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

FileInputStream inputstream = new FileInputStream(new File("pdfExample.pdf"));

ParseContext pcontext = new ParseContext();

// parsing the document using PDF parser

PDFParser pdfparser = new PDFParser();

pdfparser.parse(inputstream, handler, metadata, pcontext);

// getting the content of the document

System.out.println("Contents of the PDF :" + handler.toString());

// getting metadata of the document

System.out.println("Metadata of the PDF:");

String[] metadataNames = metadata.names();

for (String name : metadataNames) {

System.out.println(name + " : " + metadata.get(name));

}

}

}

原PDF文件:pdfExample.pdf 的内容如下 -

b0384657ed87eb6c03b1d3309913cd64.png

执行上面示例代码,得到以下结果 -

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in [jar:file:/F:/worksp/javaexamples/libs/tika_libs/tika-app-1.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in [jar:file:/F:/worksp/javaexamples/libs/tika_libs/tika-server-1.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

九月 27, 2017 4:29:50 上午 org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem

警告: JBIG2ImageReader not loaded. jbig2 files will be ignored

See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

TIFFImageWriter not loaded. tiff files will not be processed

See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

J2KImageReader not loaded. JPEG2000 files will not be processed.

See http://pdfbox.apache.org/2.0/dependencies.html#jai-image-io

for optional dependencies.

九月 27, 2017 4:29:50 上午 org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem

警告: org.xerial's sqlite-jdbc is not loaded.

Please provide the jar on your classpath to parse sqlite files.

See tika-parsers/pom.xml for the correct version.

Contents of the PDF :

Apache Tika is a library that is used for document type detection and

content extraction from various file formats.

Internally, Tika uses various existing document parsers and

document type detection techniques to detect and extract data.

Using Tika, one can develop a universal type detector and content

extractor to extract both structured text as well as metadata from

different types of documents such as spreadsheets, text documents,

images, PDFs and even multimedia input formats to a certain extent.

Metadata of the PDF:

date : 2017-09-26T20:00:44Z

pdf:PDFVersion : 1.7

pdf:docinfo:title :

xmp:CreatorTool : WPS Office

Company :

Keywords :

access_permission:modify_annotations : true

access_permission:can_print_degraded : true

subject :

dc:creator : Administrator

dcterms:created : 2017-09-26T20:00:44Z

Last-Modified : 2017-09-26T20:00:44Z

dcterms:modified : 2017-09-26T20:00:44Z

dc:format : application/pdf; version=1.7

Last-Save-Date : 2017-09-26T20:00:44Z

pdf:docinfo:creator_tool : WPS Office

access_permission:fill_in_form : true

pdf:docinfo:keywords :

pdf:docinfo:modified : 2017-09-26T20:00:44Z

meta:save-date : 2017-09-26T20:00:44Z

pdf:encrypted : false

modified : 2017-09-26T20:00:44Z

pdf:docinfo:custom:SourceModified : D:20170927041644+08'16'

cp:subject :

pdf:docinfo:subject :

Content-Type : application/pdf

pdf:docinfo:creator : Administrator

creator : Administrator

meta:author : Administrator

dc:subject :

meta:creation-date : 2017-09-26T20:00:44Z

created : Tue Sep 26 16:00:44 BOT 2017

Comments :

access_permission:extract_for_accessibility : true

access_permission:assemble_document : true

xmpTPg:NPages : 1

Creation-Date : 2017-09-26T20:00:44Z

access_permission:extract_content : true

pdf:docinfo:custom:Company :

access_permission:can_print : true

SourceModified : D:20170927041644+08'16'

pdf:docinfo:custom:Comments :

meta:keyword :

Author : Administrator

producer :

access_permission:can_modify : true

pdf:docinfo:producer :

pdf:docinfo:created : 2017-09-26T20:00:44Z

¥ 我要打赏

纠错/补充

收藏

加QQ群啦,易百教程官方技术学习群

注意:建议每个人选自己的技术方向加群,同一个QQ最多限加 3 个群。

Java可以使用开源库Apache PDFBox来解析PDF文件,包括提取文本和图片。以下是一个简单的示例代码,演示如何读取PDF的文本和图片: ```java import java.io.File; import java.io.IOException; import java.util.List; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.rendering.PDFRenderer; import org.apache.pdfbox.rendering.ImageType; import org.apache.pdfbox.rendering.RenderedImage; public class PDFParser { public static void main(String[] args) throws IOException { // 读取PDF文件 PDDocument document = PDDocument.load(new File("example.pdf")); // 提取文本 PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(document); System.out.println("文本内容:\n" + text); // 提取图片 List<PDPage> pages = document.getPages(); PDFRenderer renderer = new PDFRenderer(document); int pageNum = 1; for (PDPage page : pages) { // 渲染页面为图像 RenderedImage image = renderer.renderImageWithDPI(pageNum - 1, 300, ImageType.RGB); // 保存图像到文件 File outputFile = new File("page" + pageNum + ".png"); ImageIO.write(image, "png", outputFile); pageNum++; } // 关闭文档 document.close(); } } ``` 此示例将提取PDF文件的文本并将其打印到控制台,然后提取每个页面的图像并将其保存到文件。请注意,这将生成一个PNG图像文件,其包含PDF页面的可见内容。如果您需要提取PDF的矢量图形,请使用不同的方法。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值