Lucene索引前对doc pdf html文件的预处理

最新推荐文章于 2021-06-05 02:58:53 发布

allenshi_szl

最新推荐文章于 2021-06-05 02:58:53 发布

阅读量2.8k

点赞数

分类专栏： Nutch &amp; Lucene 文章标签： lucene html html解析器文件管理器文档 file

Nutch & Lucene 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

初学Lucene，写下点小小的心得：

Lucene提供的API能够方便地对文本文档创建索引，但当需要为像.doc 、.pdf 这样的非文本文档创建索引时就必须先把文档转换为纯文本。

对.pdf文件的处理

lib：PDFBox

PDFBox是一个在java环境中对pdf文件进行解析处理的开源软件，同时它也提供了一个丰富的类库支持对pdf文件的操作。PDFBox为使用Lucene的开发者专门提供了LucenePDFDocument类，它的static方法getDocument（ps:该方法被重载三次）能够直接返回一个Lucene的Document类型结果。所以在为一个pdf文件（例子中为File类型实例pdfFile）创建索引时只要写下如下语句就可以了：

document = LucenePDFDocument.getDocument(pdfFile);

getDocument方法的这种形式很好，接下来对.doc以及对.html文件的处理也参照这种形式。

对.doc文件的处理

lib：tm-extractors-0.4

这原是Apache的POI类库的一部分——HWPF，POI提供了一组操作MS-word/excel等文件的方法，在最近的release版本中HWPF被移出，需要下载独立的tm-extractors-0.4。下面的代码中实现了static方法getDocument(File)返回 Lucene的Document类型结果，主要通过调用WordExtractor类的成员方法extractor，该方法能返回一个包含所解析doc文件内容的String类型实例。

public class LuceneDOCDocument {

public static Document getDocument(File doc) {

String docPath = doc.getAbsolutePath();

String title = doc.getName();

InputStream inputStream = null ;

Reader contents = null ;

Document document = new Document();

try

{

inputStream = new FileInputStream(doc);

}

catch (FileNotFoundException e)

{

e.printStackTrace();

}

WordExtractor extractor = new WordExtractor();

try {

contents = new StringReader(extractor.extractText(inputStream));

}

catch (Exception e) {

e.printStackTrace();

}

document.add( new Field( " title " , title, Field.Store.YES, Field.Index.TOKENIZED));

document.add( new Field( " contents " , contents));

document.add( new Field( " path " , docPath, Field.Store.YES, Field.Index.NO));

return document;

}

}

HTML文件虽然是文本，但是由于其中包含的标记无法被Lucene识别，导致也会被编入索引，而用户在搜索时并不希望搜索到这些标签，所以在对HTML文件创建所以前必须对其进行去标签的处理。

对HTML的处理（去标签）

lib:htmlparser

原本Lucene的DEMO中也附带了一个HtmlParser，只是这个html解析器功能比较弱。另外不知道是不是笔者使用上的错误，在使用索引文件管理器Luke查看时发现，DEMO带的HtmlParser的成员方法getReader所返回的字符流的内容并非是html文件的全部文本内容，而仅仅是标题内容。

在下面的例子中，笔者使用了更为强大的htmlparser，同样在代码中定义了static方法getDocument(File)返回Document类型。

public class LuceneHTMLDocument {

public static Document getDocument(File html) {

String htmlPath = html.getAbsolutePath();

String text = "" ;

Parser parser = null ;

try {

parser = new Parser(htmlPath);

}

catch (ParserException e) {

e.printStackTrace();

}

try {

parser.setEncoding( " UTF-8 " );

}

catch (ParserException e) {

e.printStackTrace();

}

HtmlPage visitor = new HtmlPage(parser);

try {

parser.visitAllNodesWith(visitor);

}

catch (ParserException e) {

e.printStackTrace();

}

NodeList nodes = visitor.getBody();

int size = nodes.size();

for ( int i = 0 ;i < size;i ++ ) {

Node node = nodes.elementAt(i);

text += node.toPlainTextString();

}

String title = visitor.getTitle();

Reader contents = new StringReader (text);

Document document = new Document();

document.add( new Field( " title " , title, Field.Store.YES, Field.Index.TOKENIZED));

document.add( new Field( " contents " , contents));

document.add( new Field( " path " , htmlPath, Field.Store.YES, Field.Index.NO));

return document;

}

}

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
2
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论 2

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。