java tika pdf_tika提取pdf信息异常

在使用Apache Tika从PDF提取信息时遇到WriteLimitReachedException,原因是文档字符超过默认限制10万字。通过设置BodyContentHandler构造函数的writeLimit参数为PDF文档大小,成功获取PDF的元数据,包括dc:subject、Creation-Date、作者等信息。
摘要由CSDN通过智能技术生成

org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available).

at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:141)

at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)

at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:85)

at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)

at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:146)

at org.apache.tika.sax.SafeContentHandler.access$001(SafeContentHandler.java:46)

at org.apache.tika.sax.SafeContentHandler$1.write(SafeContentHandler.java:82)

at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:140)

at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:287)

at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:278)

at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:305)

at org.apache.tika.parser.pdf.PDF2XHTML.writeString(PDF2XHTML.java:398)

at org.apache.pdfbox.util.PDFTextStripper.writeString(PDFTextStripper.java:866)

at org.apache.pdfbox.util.PDFTextStripper.writeLine(PDFTextStripper.java:1896)

at org.apache.pdfbox.util.PDFTextStripper.writePage(PDFTextStripper.java:744)

at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:461)

at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)

at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)

at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:130)

at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:159)

在使用apache tika提取pdf信息时,报以上错误。根据错误信息提示,可能读取超过请求限制(10万字)。

我的代码如下:

Parser parser = new PDFParser();

//parser.

BodyContentHandler handler = new BodyContentHandler();

Metadata metadata = new Metadata();

InputStream stream = null;

try {

stream = new FileInputStream(new File("1.pdf"));

parser.parse(stream, handler, metadata, new ParseContext());

for (String name : metadata.names()) {

System.out.println(name + ":\t" + metadata.get(name));

}

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (SAXException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (TikaException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} finally {

try {

stream.close();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

}

对读取字数限制,可能在某个构造函数里我没有传入最大限制,而使用了默认的十万字。检查一下上面的代码,我注意到了

BodyContentHandler的构造函数:

org.apache.tika.sax.BodyContentHandler.BodyContentHandler(int writeLimit)

看样子有关系。修改一下构造函数的数字为:10*1024*1024(这个数字有pdf文档大小决定)。

重新调试程序,即可获得pdf的元数据信息如下:

dc:subject:

meta:save-date:2014-07-22T21:02:38Z

subject:PostgreSQL 9.3 Documentation

Author:The PostgreSQL Global Development Group

dcterms:created:2014-07-22T20:55:33Z

date:2014-07-22T21:02:38Z

creator:The PostgreSQL Global Development Group

Creation-Date:2014-07-22T20:55:33Z

title:PostgreSQL 9.3 Documentation

trapped:False

meta:author:The PostgreSQL Global Development Group

created:Wed Jul 23 04:55:33 CST 2014

meta:keyword:

cp:subject:PostgreSQL 9.3 Documentation

dc:format:application/pdf; version=1.4

PTEX.Fullbanner:This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012/Debian) kpathsea version 6.1.0

xmp:CreatorTool:LaTeX with hyperref package

Keywords:

dc:title:PostgreSQL 9.3 Documentation

Last-Save-Date:2014-07-22T21:02:38Z

meta:creation-date:2014-07-22T20:55:33Z

dcterms:modified:2014-07-22T21:02:38Z

dc:creator:The PostgreSQL Global Development Group

pdf:PDFVersion:1.4

Last-Modified:2014-07-22T21:02:38Z

modified:2014-07-22T21:02:38Z

xmpTPg:NPages:2861

pdf:encrypted:false

producer:pdfTeX-1.40.13; modified using iText® 5.1.3 ©2000-2011 1T3XT BVBA

Content-Type:application/pdf

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值