全文搜索Lucene

最新推荐文章于 2024-11-12 10:47:45 发布

zhanglingsi_521

最新推荐文章于 2024-11-12 10:47:45 发布

阅读量69

点赞数

文章标签： lucene Google Apache HTML

    说到搜索，大部分学员想到的就是数据库的模糊查询。模糊查询有两个问题：效率低，不能查找HTML页面上的静态内容。

    于是需要工具进行对于网站的全文搜索。基本原理是用工具扫描整个网站的页面，将内容进行索引并保存，然后可以通过用户的关键字搜索。如果你用过google桌面搜索或者msn桌面搜索，应该对这个过程比较了解。

    可见一个全文搜索工具至少需要包含两个方面：创建索引和搜索。

    Lucene作为apache的开源工具，免费提供这些功能，并且能识别多种文件格式，支持词法分析等，是比较常用的全文搜索工具。

    但是Lucene的中文教程很少，有本《Lucene In Action》，但讲的是老版本，Lucene2.0后，API变化较大，还没有很好的资料。

    下面先贴出今天写的两段代码，针对磁盘文件的索引创建和搜索：

    1，创建索引
public class CreateIndex {

public static void main(String[] args) {

String indexPath = "g:/indexout";   //保存索引的路径
try {
   IndexWriter writer = new IndexWriter(indexPath, new StandardAnalyzer(), true);

   addFile(writer, new File("g:/index"));   //为指定的目录创建索引

   int count = writer.docCount();          //得到索引文件的个数
   writer.optimize();                      //优化索引
   writer.close();
   System.out.println("创建索引完毕，一共索引了" + count + "个文件!");
} catch (Exception e) {
}
}

//用递归的方法搜索文件

public static void addFile(IndexWriter writer, File file) {
if (file.isDirectory()) {
   File fs[] = file.listFiles();
   for (int i=0; i<fs.length; i++) {
    addFile(writer, fs[i]);
   }
} else {
   try {

//记录文件内容
Field fContent = new Field("content", new FileReader(file));

    //记录文件名
    Field fPath = new Field("path", file.getAbsolutePath(), Field.Store.YES, Field.Index.UN_TOKENIZED);
    Document doc = new Document();
    doc.add(fContent);
    doc.add(fPath);
    writer.addDocument(doc);
   } catch (Exception e) {
   }
}
}
}

2,搜索

public class SearchTest {
public static void main(String[] args) {
String key = "搜索关键字";
try {
   Directory dir = FSDirectory.getDirectory("g:/indexout");
   IndexSearcher searcher = new IndexSearcher(dir);
   Query query = new QueryParser("content", new StandardAnalyzer()).parse(key);    //创建一个查询
   Hits hits = searcher.search(query);    //得到查询的结果
   Iterator it = hits.iterator();         //迭代打印结果
   while (it.hasNext()) {
    Hit hit = (Hit) it.next();
    Document doc = hit.getDocument();
    System.out.println(doc.get("path"));     //打印搜索到文件的路径
   }
} catch (Exception e) {
}