java tika pdf_TIKA - 提取PDF

最新推荐文章于 2024-07-06 13:29:25 发布

otter_ai

最新推荐文章于 2024-07-06 13:29:25 发布

阅读量206

点赞数

文章标签： java tika pdf

本文链接：https://blog.csdn.net/weixin_29039773/article/details/114210239

版权

此博客展示了如何使用Apache Tika的PDFParser从PDF文档中提取内容和元数据。通过编译并运行提供的Java代码，可以获取PDF的文本内容以及包括创建日期、作者和文件版本在内的元数据。

摘要由CSDN通过智能技术生成

以下是从PDF中提取内容和元数据的程序。

importjava.io.File;importjava.io.FileInputStream;importjava.io.IOException;importorg.apache.tika.exception.TikaException;importorg.apache.tika.metadata.Metadata;importorg.apache.tika.parser.ParseContext;importorg.apache.tika.parser.pdf.PDFParser;importorg.apache.tika.sax.BodyContentHandler;importorg.xml.sax.SAXException;publicclassPdfParse{publicstaticvoidmain(finalString[]args)throwsIOException,TikaException{BodyContentHandlerhandler=newBodyContentHandler();Metadatametadata=newMetadata();FileInputStreaminputstream=newFileInputStream(newFile("Example.pdf"));ParseContextpcontext=newParseContext();//parsing the document using PDF parserPDFParserpdfparser=newPDFParser();pdfparser.parse(inputstream,handler,metadata,pcontext);//getting the content of the documentSystem.out.println("Contents of the PDF :"+handler.toString());//getting metadata of the documentSystem.out.println("Metadata of the PDF:");String[]metadataNames=metadata.names();for(Stringname:metadataNames){System.out.println(name+" : "+metadata.get(name));}}}

将上述代码保存为PdfParse.java，并使用以下命令从命令提示符进行编译：

javac PdfParse.java

java PdfParse

下面给出了Example.pdf文档的快照：

PDF文档具有以下属性：

执行上述程序后，将得到以下输出

输出：

Contents of the PDF:

Apache Tika is a framework for content type detection and content extraction

which was designed by Apache software foundation. It detects and extracts metadata

and structured text content from different types of documents such as spreadsheets,

text documents, images or PDFs including audio or video input formats to certain extent.

Metadata of the PDF:

dcterms:modified : 2014-09-28T12:31:16Z

meta:creation-date : 2014-09-28T12:31:16Z

meta:save-date : 2014-09-28T12:31:16Z

dc:creator : Krishna Kasyap

pdf:PDFVersion : 1.5

Last-Modified : 2014-09-28T12:31:16Z

Author : Krishna Kasyap

dcterms:created : 2014-09-28T12:31:16Z

date : 2014-09-28T12:31:16Z

modified : 2014-09-28T12:31:16Z

creator : Krishna Kasyap

xmpTPg:NPages : 1

Creation-Date : 2014-09-28T12:31:16Z

pdf:encrypted : false

meta:author : Krishna Kasyap

created : Sun Sep 28 05:31:16 PDT 2014

dc:format : application/pdf; version=1.5

producer : Microsoft® Word 2013

Content-Type : application/pdf

xmp:CreatorTool : Microsoft® Word 2013

Last-Save-Date : 2014-09-28T12:31:16Z

otter_ai

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
java tika pdf_TIKA - 提取PDF

以下是从PDF中提取内容和元数据的程序。importjava.io.File;importjava.io.FileInputStream;importjava.io.IOException;importorg.apache.tika.exception.TikaException;importorg.apache.tika.metadata.Metadata;importorg.apache.ti...
复制链接

扫一扫