lucene 构建索引

索引过程:

  • 提取文本和创建文档
  • 分析文档
  • 向索引添加文档(倒排索引)
基本索引操作

  •    向索引添加文档
addDocument(Document  使用默认分析器添加文档

addDocument(Document,Analyzer) -- 使用指定的分析器添加文档)

  • 删除索引中的文档
---deleteDocuments(Term)

               ---deleteDocuments(Term[])

---deleteDocuments(Query)

---deleteDocuments(Query[])

maxDoc 和 numDocs()方法的区别:maxDoc()返回索引中删除和未被删除的文档总数,后者返回索引中未被删除的文档总数。

  • 更新索引中的文档
步骤:首先删除旧文档,再插入新文档。新文档必须包含旧文档中的所有域。

---updateDocument(Term,Document)

---updateDocument(Term,Doducment,Analyzer)

  • 域选项
  • 域索引选项(Field.Index.*)
Index.ANALYZED :使用分析器将域值分解成单独的词汇单元流,并使每个词汇单元能被单独搜索。适用于 普通文本区域。

Index.NOT_ANALYZED: 对域索引,但不对String 值进行分析。适用于分析哪些不能被拆分的域值,如URL,文件路径,日期,人名,使用于“精确匹配”搜索。

Index.ANALYZED_NORMS:一个扁体,不会在索引中存储 norms 信息。norm记录索引中的index-time boost信息。

Index.NOT_ANALYZED_NO_NORMS  不存储norms。

Index.NO:不被索引.

  • 域存储选项 (Field.Store.*)
Store.YES: 指定存储域值,用于搜索结果显示域值。

Store.NO:不存储域值。

CompressionTools:提供静态方法压缩和解压字节数组。

  • 域项向量选项

 

Field对象其它初始化方法

     Field(String name,Reader value,TermVector termVector): 使用reader来表示域值

域排序选项

对文档和域进行加权操作

  文档加权: setBoost(float)

  域加权: Field.setBoost(float);

索引数字:NumericField(name).set<Type>Value;

近实时搜索(near-real-time search)

索引优化:

indexwriter优化方法

    optimize()  将索引压缩至一个段,操作完成再返回

   optimize(int maxNumSegments) :部分优化,将所以压缩至最多 max个段。

    optimize(boolean doWait):true立即执行;false 后台线程调用合并程序

    optimize(int maxNumSegments,boolean doWaint):组合

   消耗大量CPU和I/O资源。

Directory子类

SimpleFSDirectory

NIOFSDirectory: java.nio.*

MMapDirectory: 内存映射I/O进行文件访问。64位jre是好选择。

RAMDirectory:存入内存

FileSwitchDirectory:使用两个文件目录,根据文件扩展名在两个目录间切换使用。

package com.bit.section2;

import static org.junit.Assert.*;

import java.io.IOException;

import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldSelectorResult;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.junit.Before;
import org.junit.Test;

import com.bit.util.TestUtil;

public class IndexingTest {
	
	protected String[] ids = {"1","2"};
	protected String[] unindexed = {"Netherlands","Italy"};
	protected String[] unsorted = {"Amesterdam has lots of canals","Venice has lots of canals"};
	protected String[] text = {"Amsterdam","Venice"};

	private Directory directory;
	
	@Before
	public void setUp() throws Exception {
		directory = new RAMDirectory();
		
		IndexWriter writer = getWriter();
		
		for(int i=0;i<ids.length;i++){
			Document doc = new Document();
			doc.add(new Field("id", ids[i],Field.Store.YES,Field.Index.NOT_ANALYZED));
			doc.add(new Field("country",unindexed[i],Field.Store.YES,Field.Index.NO));
			doc.add(new Field("contents",unsorted[i],Field.Store.NO,Field.Index.ANALYZED));
			doc.add(new Field("city",text[i],Field.Store.YES,Field.Index.ANALYZED));
			writer.addDocument(doc);
		}
		writer.close();
	}
	
	/**
	 * 生成索引生成器
	 * @return
	 * @throws IOException 
	 * @throws LockObtainFailedException 
	 * @throws CorruptIndexException 
	 */
	private IndexWriter getWriter() throws CorruptIndexException, LockObtainFailedException, IOException {
		return new IndexWriter(directory, new IndexWriterConfig(Version.LUCENE_36,new WhitespaceAnalyzer()));
	}
	@Test
	public void test() {
		fail("Not yet implemented");
	}
	
	
	protected int getHitCount(String fieldName,String searchString) throws IOException, Exception {
		IndexSearcher searcher = new IndexSearcher(directory);
		Term term = new Term(fieldName, searchString);
		Query query = new TermQuery(term);
		int hitCount = TestUtil.hitCount(searcher,query);
		searcher.close();
		return hitCount;
	}

	@Test
	public void testIndexWriter() throws CorruptIndexException, LockObtainFailedException, IOException{
		IndexWriter writer = getWriter();
		assertEquals(ids.length, writer.numDocs());
		writer.close();
	}
	
	@Test
	public void testIndexReader() throws CorruptIndexException, IOException{
		IndexReader reader = IndexReader.open(directory);
		assertEquals(ids.length, reader.maxDoc());
		assertEquals(ids.length, reader.numDocs());
		reader.close();
	}
	

	/**
	 * 测试删除文档
	 */
	@Test
	public void testDeleteBeforeOptimize() throws IOException{
		IndexWriter writer = getWriter();
		assertEquals(2, writer.numDocs());
		
		//确认被标记删除的文档
		writer.deleteDocuments(new Term("id", "1"));
		writer.commit();
		assertTrue(writer.hasDeletions());
		
		//确认删除一个文档并剩余一个文档
		assertEquals(2, writer.maxDoc());
		assertEquals(1, writer.numDocs());
		writer.close();
	}
	
	/**
	 * 优化后的删除
	 * @throws IOException
	 */
	@Test
	public void testDeleteAfterOptimize() throws IOException{
		IndexWriter writer = getWriter();
		assertEquals(2, writer.numDocs());
		
		writer.deleteDocuments(new Term("id", "1"));
		writer.optimize();
		writer.commit();
		
		assertFalse(writer.hasDeletions());
		assertEquals(1, writer.maxDoc());
		assertEquals(1, writer.numDocs());
		writer.close();
	}
	
	/**
	 * 更新文档
	 * @throws Exception 
	 */
	@Test
	public void testUpdate() throws Exception{
		assertEquals(1, getHitCount("city", "Amsterdam"));
		
		IndexWriter writer = getWriter();
		
		Document doc = new Document();
		doc.add(new Field("id", "1",Field.Store.YES,Field.Index.NOT_ANALYZED));
		doc.add(new Field("country","Netherlands",Field.Store.YES,Field.Index.NO));
		doc.add(new Field("contents","Den Haag has lots of museums",Field.Store.NO,Field.Index.ANALYZED));
		doc.add(new Field("city","DenHaag",Field.Store.YES,Field.Index.ANALYZED));
		
		writer.updateDocument(new Term("id","1"),doc );
		
		writer.commit();
		
		assertEquals(0, getHitCount("city", "Amsterdam"));
		assertEquals(1, getHitCount("city", "DenHaag"));
	}
	
}




  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值