java tika pdf,Apache Tika提取扫描PDF文件

最新推荐文章于 2022-05-23 15:31:15 发布

淡庸

最新推荐文章于 2022-05-23 15:31:15 发布

阅读量794

点赞数

文章标签： java tika pdf

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway.

My tesseract is set up correctly and extracting JPG and PNG files works like a charm. The code i'm using looks like that (don't mind the missing excetion handling):

public String extractText(InputStream stream) {

AutoDetectParser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

Metadata metadata = new Metadata();

ParseContext context = new ParseContext();

parser.parse(stream, handler, metadata, context);

String text = handler.toString();

return text;

}

I searched a lot but i didn't find any solutions that work for me. I already tried the setExtractInlineImages method of the PDFParserConfig class but this didn't change a thing.

Extracting embedded documents using a custom ParsingEmbeddedDocumentExtractor did extract embedded resources of a doc file but not for my PDF files.

It would be awesome if anyone of you could provide some help :)

解决方案

Tim Allison brought the solution:

Parser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();

PDFParserConfig pdfConfig = new PDFParserConfig();

pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();

parseContext.set(TesseractOCRConfig.class, config);

parseContext.set(PDFParserConfig.class, pdfConfig);

parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

This works for me :)

EDIT:

Here is the complete solution:

import org.apache.tika.exception.TikaException;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.parser.AutoDetectParser;

import org.apache.tika.parser.ParseContext;

import org.apache.tika.parser.Parser;

import org.apache.tika.parser.ocr.TesseractOCRConfig;

import org.apache.tika.parser.pdf.PDFParserConfig;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.SAXException;

import java.io.FileInputStream;

import java.io.IOException;

/**

* @since 8/26/16

public class Sample {

public static void main(String[] args)

throws IOException, TikaException, SAXException {

Parser parser = new AutoDetectParser();

BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();

PDFParserConfig pdfConfig = new PDFParserConfig();

pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();

parseContext.set(TesseractOCRConfig.class, config);

parseContext.set(PDFParserConfig.class, pdfConfig);

//need to add this to make sure recursive parsing happens!

parseContext.set(Parser.class, parser);

FileInputStream stream = new FileInputStream("samplepdf.pdf");

Metadata metadata = new Metadata();

parser.parse(stream, handler, metadata, parseContext);

System.out.println(metadata);

String content = handler.toString();

System.out.println("===============");

System.out.println(content);

System.out.println("Done");

}

Maven Dependencies:

org.apache.tika

tika-parsers

1.13

com.levigo.jbig2

levigo-jbig2-imageio

1.6.5

淡庸

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
java tika pdf,Apache Tika提取扫描PDF文件

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF...
复制链接

扫一扫