java tika pdf,Apache Tika提取扫描PDF文件

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

public String extractText(InputStream stream) {

AutoDetectParser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

parser.parse(stream, handler, metadata, context);

String text = handler.toString();

return text;

}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages method of the PDFParserConfig class but this didn't change a thing.

Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor did extract embedded resources of a doc file but not for my PDF files.

It would be awesome if anyone of you could provide some help :)

解决方案

Tim Allison brought the solution:

Parser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();

PDFParserConfig pdfConfig = new PDFParserConfig();

pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();

parseContext.set(TesseractOCRConfig.class, config);

parseContext.set(PDFParserConfig.class, pdfConfig);

parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

EDIT:

Here is the complete solution:

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.parser.ocr.TesseractOCRConfig;

import org.apache.tika.parser.pdf.PDFParserConfig;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

import java.io.FileInputStream;

import java.io.IOException;

/**

* @since 8/26/16

*/

public class Sample {

public static void main(String[] args)

throws IOException, TikaException, SAXException {

Parser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();

PDFParserConfig pdfConfig = new PDFParserConfig();

pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();

parseContext.set(TesseractOCRConfig.class, config);

parseContext.set(PDFParserConfig.class, pdfConfig);

//need to add this to make sure recursive parsing happens!

parseContext.set(Parser.class, parser);

FileInputStream stream = new FileInputStream("samplepdf.pdf");

Metadata metadata = new Metadata();

parser.parse(stream, handler, metadata, parseContext);

System.out.println(metadata);

String content = handler.toString();

System.out.println("===============");

System.out.println(content);

System.out.println("Done");

}

}

Maven Dependencies:

org.apache.tika

tika-parsers

1.13

com.levigo.jbig2

levigo-jbig2-imageio

1.6.5

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值