java高级技术pdf,适用于Java的高级PDF解析器

I want to extract different content from a PDF file in Java:

The complete visible text

images

links

Is it also possible to get the following?

document meta tags like title, description or author

only headlines

input elements if the document contains a form

I do not need to manipulate or render PDF files. Which library would be the best fit for that kind of purpose?

UPDATE

OK, I tried PDFBox:

Document luceneDocument = LucenePDFDocument.getDocument(new File(path));

Field contents = luceneDocument.getField("contents");

System.out.println(contents.stringValue());

But the output is null. The field "summary" is OK though.

The next snippet works fine.

PDDocument doc = PDDocument.load(path);

PDFTextStripper stripper = new PDFTextStripper();

String text = stripper.getText(doc);

System.out.println(text);

doc.close();

But then, I have no clue how to extract the images, links, etc.

UPDATE 2

I found an example how to extract the images, but I still got no answer on how to extract:

links

document meta tags like title, description or author

only headlines

input elements if the document contains a form

解决方案

iText is my PDF tool of choice these days.

The complete visible text

"Visible" is a tough one. You can parse out all the parsable text with the com.itextpdf.text.pdf.parse package's classes... but those classes don't know about CLIPPING. You can constrain the parser to the page size easily enough.

// all text on the page, regardless of position

PdfTextExtractor.getTextFromPage(reader, pageNum);

You'd actually need the override that takes a TextExtractionStrategy, the filtered strategy. It gets interesting fairly quickly, but I think you can get everything you want here "out of the box".

images

Yep, via the same package classes. Image listeners aren't as well supported as text listeners, but do exist.

links

Yes. Links are "annotations" to various PDF pages. Finding them is a simple matter of looping through each page's "annotations array" and picking out the link annotations.

PdfDictionary pageDict = myReader.getPageN(1);

PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);

ArrayList dests = new ArrayList();

if (annots != null) {

for (int i = 0; i < annots.size(); ++i) {

PdfDictionary annotDict = annots.getAsDict(i);

PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);

if (subType != null && PdfName.LINK.equals(subType)) {

PdfDictionary action = annotDict.getAsDict(PdfName.A);

if (action != null && PdfName.URI.equals(action.getAsName(PdfName.S)) {

dests.add(action.getAsString(PdfName.URI).toString());

} // else { its an internal link, meh }

}

}

}

You can find the PDF Spec here.

input elements

Definitely. For either XFA (LiveCycle Designer) or the older-tech "AcroForm" forms, iText can find all the fields, and their values.

AcroFields fields = myReader.getAcroFields();

Set fieldNames = fields.getFields().keySet();

for (String fldName : fieldNames) {

System.out.println( fldName + ": " + fields.getField( fldName ) );

}

Mutli-select lists wouldn't be handled all that well. You'll get a blank space after the colon for empty text fields and for buttons. None too informative... but that'll get you started.

document meta tags like title, description or author

Pretty trivial. Yes.

Map info = myPdfReader.getInfo();

System.out.println( info );

In addition to the basic author/title/etc, there's a fairly involved XML schema you can access via reader.getMetadata().

only headlines

A TextRenderFilter can ignore text based on whatever criteria you wish. Font size sounds about right based on your comment.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值