PDFBox与lucene的集成

最新推荐文章于 2021-02-27 00:32:31 发布

caoxu1987728

最新推荐文章于 2021-02-27 00:32:31 发布

阅读量2.3k

点赞数

分类专栏： Lucene 文章标签： lucene url string path 文档 file

本文链接：https://blog.csdn.net/caoxu1987728/article/details/2349372

版权

Lucene 专栏收录该内容

52 篇文章 0 订阅

订阅专栏

PDFBox提供和Lucene的集成，它提供了一套简单的方法把PDF Documents加入到Lucene的索引中去，请看以下代码：
Document lucenedocument = LucenePDFDocument.getDocument(…);
其中，LucenePDFDocument是PDFBox中提供的一个类，它的getDocument被重载为3个方法，分别接收一个File对象、InputStream对象或者URL对象作为参数，然后从该参数传递进来的PDF文件中，提取并生成Lucene的Document对象。当通过PDFBox从一个PDF文档中得到一个Lucene Document后，可以直接使IndexWriter
把它加到Lucene的index中。LucenePDFDocument自动从PDF文件中提取各种元数据Field，并把它们加入到Document中。

下面通过LucenePDFDocument，直接对PDF建立索引，在ch7.pdf包下面新建一个PdfLuceneTest类，该类的代码如下：

public class PdfLuceneTest {
public static void main(String[] args) {
   try {

    // IndexWriter存放索引到d:index下
   IndexWriter writer = new IndexWriter( " d:/index " ,
      new StandardAnalyzer(), true );

    // LucenePDFDocument返回由PDF产生的Lucene Docuement
   Document d = LucenePDFDocument
     .getDocument( new File( " C:/index.pdf " ));
    // 写入索引
   writer.addDocument(d);
    // 关闭索引文件流
   writer.close();

    // 读取d:index下的索引文件建立IndexSearcher
   IndexSearcher searcher = new IndexSearcher( " d:/index " );

    // 对索引的contents Field进行查找关键词poi
   Term t = new Term( " contents " , " poi " );
    // 根据Term生成Query
   Query q = new TermQuery(t);
    // 搜索返回结果集
   Hits hits = searcher.search(q);

    // 打印结果集
    for ( int i = 0 ; i < hits.length(); i ++ ) {
    System.out.println(hits.doc(i));
   }
  } catch (Exception e) {
   e.printStackTrace();
  }
}
}

函数利用LucenePDFDocument的getDocument函数，从一个PDF文件直接返回一个Lucene
的Document，其中包含有path、url、modified、contents、summary等Field，把它们直接写
入index，然后创建一个IndexSearcher，对contents字段经行检索，查找关键词“poi”（注
意必须是小写）。

可是结果还是不能运行：老是出现如下错误：

java.io.IOException: Error decrypting document, details: Error: The supplied password does not match either the owner or user password in the document.
at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:208)
at org.pdfbox.searchengine.lucene.LucenePDFDocument.addContent(LucenePDFDocument.java:427)
at org.pdfbox.searchengine.lucene.LucenePDFDocument.convertDocument(LucenePDFDocument.java:286)
at org.pdfbox.searchengine.lucene.LucenePDFDocument.getDocument(LucenePDFDocument.java:377)
at ch7.pdf.PdfLucene.main(PdfLucene.java:27)

我初步猜想是少了一个包，poi的包，也可能不是。唉！搞了很久了！头疼……