Lucene初步了解

最新推荐文章于 2024-04-26 20:19:53 发布

lilp_ndsc

最新推荐文章于 2024-04-26 20:19:53 发布

阅读量1.7k

点赞数

分类专栏： Luence 文章标签： lucene exception file string 数据库文档

本文链接：https://blog.csdn.net/lilp_ndsc/article/details/5496550

版权

Luence 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

Lucene是一个高性能的全文检索工具类库，Lucene的API接口设计的比较通用，
输入输出结构都很像数据库的表==>记录==>字段，所以很多传统的应用的文件、
数据库等都可以比较方便的映射到Lucene的存储结构/接口中

Lucene 的数据源结构

Lucene 中 Document 是创建和搜索索引的基本单，一个Document 可以包括多个Field ，每个字段有对应的名字字符串值

搜索返回的结果为Hits 和 Document Hits 由Document 组成，Hits将会在Lucene3.0中取消，替以TopDocCollector 和TopDocs

Document 是由文本文件或其它数据源转化来的

索引数据源：doc(field1,field2...) doc(field1,field2...)
                  /  indexer /
                 _____________
                | Lucene Index|
                --------------
                 / searcher /
 结果输出：Hits(doc(field1,field2) doc(field1...))

与数据库的结构比较相似

索引数据源：record(field1,field2...) record(field1..)
              /  SQL: insert/
               _____________
              | DB  Index   |
               -------------
              / SQL: select /
结果输出：results(record(field1,field2))   上面是转别人的

建立Lucene 工具，需要导入jar包，基本应用导入Lucene 核心包和高亮显示包即可，准备好需要进行搜索的源数据库文件

Lucene API: IndexWriter 负责创建和操作索引库

IndexWriter 操作索引可以增改改其中的Document，其中addDocument removeDocument 等方法操作

创建索引库需要将文件转化成Document 通过IndexWriter

IndexWriter 的创建方法，IndexWriter(String path,
                   Analyzer a,   //分词器   Analyzer analyzer=new StartandAnalyzer();
                   boolean create,   //是否重新创建索引，如果为True 是删除原来索引，重新创建
                   IndexWriter.MaxFieldLength mfl)   //创建索引词的数据量，分词器分出多少词

IndexWriter indexWriter=new IndexWriter(path,analyzer,true,IndexWriter.MaxFieldLength.LIMITED)

如何将File转化成Document

File file=new File(path);
Document doc=new Document();
doc.add(new Field("name",file.getName(),Store.YES,Index.ANALYZED));
doc.add(new Field("content",getFileContent(file),Store.YES,Index.ANALYZED));
doc.add(new Field("size",String.valueOf(file.length()),Store.YES,Index.NOT_ANALYZER));
doc.add(new Field("path",file.getAbsolutePath(),Store.YES,Index.NO));
return doc;
//Store 是Field内部类，代表该字段是否存储有时候并不需要存储，比如：网页中的URL需要存储，不需要建立索引 Store.COMPRESS 压缩后存储 Index 代表是否进行索引，进行索引分两种情况：分词后索引【分词后再索引】和直接索引【把整个词当成一个关键字索引】 Index.ANALYZER Index.NO Index.NOT_ANALYZER

读取文件内容

BufferedReader br=new BufferedReader(new InputStreamReader(new FileInputStream(file)));
StringBuffer content=new StringBuffer();
for(String con=null;(con=br.readLine())!=null;){
content.append(con+"/n");
}
return content.toString();

private String filePath="D://project_dev//LuceneDemo//luceneDatasource//key.txt";
private String indexPath="D://project_dev//LuceneDemo//luceneIndex";
private Analyzer analyzer=new StandardAnalyzer();
@Test
public void craeteIndex()throws Exception{
Document doc=File2DocumentUtils.file2Document(filePath);

  //创建索引库需要指定索引库目录
  IndexWriter iw=new IndexWriter(FSDirectory.open(new File(indexPath)),analyzer,true,MaxFieldLength.LIMITED);
  iw.addDocument(doc);
  iw.addDocument(doc1);
  iw.close();    //注意关闭索引创建器
}

运行上面方法会在指定的索引库目录位置创建索引库

//进行搜索

@Test
public void search()throws Exception{
  String queryString="key";
  QueryParser queryParser=new MultiFieldQueryParser(new String{"name","content"},analyzer);   //查询解析器
  Query query=queryParser.parse(queryString); //要查询的数据
  IndexReader reader = IndexReader.open(FSDirectory.open(new File(indexPath)), true); //准备索引目录，创建索引查询器
  IndexSearcher is=new IndexSearcher(reader);
  TopDocs topdocs=is.search(query, null, 10000);  //中间参数是个过滤器，过滤查询的结果,返回TopDocs 结果
  System.out.println("共搜索到记录"+topdocs.totalHits+"条");
  for (ScoreDoc sdoc : topdocs.scoreDocs) {
   int docSn=sdoc.doc;    //取出文档编号
   Document doc=is.doc(docSn);   //根据文档编号查询文档
   System.out.println(doc.get("name"));   //取出文档相关内容
   System.out.println(doc.get("content"));
   System.out.println(doc.get("size"));
   System.out.println(doc.get("path"));
  }
  System.out.println(topdocs.totalHits);
}

Directory 索引库目录： FSDirectory, RAMDirectory 两个实现类文件系统和内存索引位置

下面的例子索引库起动时将内容放入内存，索引库关闭后将索引写入文件

@Test
public void test2()throws Exception{
  Directory fsDir=FSDirectory.getDirectory(new File(indexPath));
  Directory ramDir=new RAMDirectory(fsDir);   //启动时会把内容读取放到内存中
  //运行时操作RamDir
  IndexWriter ramIndexWriter=new IndexWriter(ramDir,analyzer,MaxFieldLength.LIMITED);
  //添加一个文档
  Document doc=File2DocumentUtils.file2Document(filePath);
  ramIndexWriter.addDocument(doc);
  ramIndexWriter.close();   //刷新内存
  //退出时保存
  IndexWriter fsIndexWriter=new IndexWriter(fsDir,analyzer,true,MaxFieldLength.LIMITED);
  fsIndexWriter.addIndexesNoOptimize(new Directory[]{ramDir});    //不优化
  fsIndexWriter.close();
}

org.apache.lucene.index.CorruptIndexException: Unknown format version: -9 报错问题所在，并解决

That exception means your index was written with a newer version of
Lucene than the version you are using to open the IndexReader.

建立索引多了会创建很多大量小索引文件，搜索时增加IO操作的效率，因此可以对这些索引文件进行优化，合并文件

@Test
public void test3()throws Exception{
  Directory fsDir=FSDirectory.getDirectory(new File(indexPath));
  IndexWriter fsIndexWriter=new IndexWriter(fsDir,analyzer,MaxFieldLength.LIMITED);
  fsIndexWriter.commit();
  //刷新后优化文件,该项合并的合并
  fsIndexWriter.optimize();   //优化操作
  fsIndexWriter.close();
}

lilp_ndsc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Lucene初步了解

Lucene是一个高性能的全文检索工具类库，Lucene的API接口设计的比较通用，输入输出结构都很像数据库的表==>记录==>字段，所以很多传统的应用的文件、数据库等都可以比较方便的映射到Lucene的存储结构/接口中 Lucene 的数据源结构Lucene 中 Document 是创建和搜索索引的基本单，一个Document 可以包括多个Field ，每个字段有对应的名字字符串
复制链接

扫一扫