Apache Lucene建立索引与搜索数据
章节
1.Lucene开发环境的Maven依赖
2.Lucene包结构功能表
3.Lucene中IndexOptions说明
4.创建索引
5.修改索引
6.删除索引
7.精确查询
8.模糊查询
9.组合查询
Lucene
Lucene是非常成熟的开源免费的Java语言的全文索引检索工具包.
全文检索
是指计算机索引程序通过扫描文章中每一个词,对每一个词建立一个索引,指明该词在文章中出现的次数和位置,当用户查询时,检索程序就根据事先建立的索引进行查找,并将查找的结果反馈给用户的检索放视.
Lucene的索引结构是有层次结构的,主要分以下几个层次:
索引(Index):
在Lucene中一个索引是放在一个文件夹中的。
如上图,同一文件夹中的所有的文件构成一个Lucene索引。
段(Segment):
一个索引可以包含多个段,段与段之间是独立的,添加新文档可以生成新的段,不同的段可以合并。
如上图,具有相同前缀文件的属同一个段,图中共两个段 “_0” 和 “_1”。
segments.gen和segments_5是段的元数据文件,也即它们保存了段的属性信息。
文档(Document):
文档是我们建索引的基本单位,不同的文档是保存在不同的段中的,一个段可以包含多篇文档。
新添加的文档是单独保存在一个新生成的段中,随着段的合并,不同的文档合并到同一个段中。
域(Field):
一篇文档包含不同类型的信息,可以分开索引,比如标题,时间,正文,作者等,都可以保存在不同的域里。
不同域的索引方式可以不同,在真正解析域的存储的时候,我们会详细解读。
词(Term):
词是索引的最小单位,是经过词法分析和语言处理后的字符串。
Lucene开发环境的Maven依赖
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-core -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>8.9.0</version>
</dependency>
<!-- 一般分词器,适用于英文分词 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-common -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-common</artifactId>
<version>8.11.1</version>
</dependency>
<!-- 中文分词器 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-analyzers-smartcn -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analyzers-smartcn</artifactId>
<version>8.11.1</version>
</dependency>
<!-- 对分词索引查询解析 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-queryparser -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>8.11.1</version>
</dependency>
<!-- 检索关键字高亮显示 -->
<!-- https://mvnrepository.com/artifact/org.apache.lucene/lucene-highlighter -->
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>8.11.1</version>
</dependency>
<!-- 中文分词工具包 -->
<!-- https://mvnrepository.com/artifact/com.jianggujin/IKAnalyzer-lucene -->
<dependency>
<groupId>com.jianggujin</groupId>
<artifactId>IKAnalyzer-lucene</artifactId>
<version>8.0.0</version>
</dependency>
</dependencies>
注:IKAnalyzer分词器的版本需要与lucene版本一致
Lucene包结构功能表
包名 | 功能 |
---|---|
org.apache.lucene.analysis | 语言分析器,主要用于切词,支持中文主要是扩展此类 |
org.apache.lucene.document | 索引存储时的文档结构管理,类似于关系型数据库的表结构 |
org.apache.lucene.index | 索引管理,包括索引建立、删除等 |
org.apache.lucene.queryParser | 查询分析器,实现查询关键词间的运算,如与、或、非等 |
org.apache.lucene.search | 检索管理,根据查询条件,检索得到结果 |
org.apache.lucene.store | 数据存储管理,主要包括一些底层的I/O操作 |
org.apache.lucene.util | 一些共用类 |
创建索引
例:
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.MMapDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Test
public void buildIndex() throws Exception
{
String indexDir="索引目录位置";//lucene索引目录位置
File luceneIndexDirectory=new File(indexDir);
/**
* Directory有多种实现:
* FSDirectory直接操作磁盘(固态推荐)
* MMapDirectory内存映射(大内存推荐)
* */
FSDirectory directory= MMapDirectory.open(luceneIndexDirectory.toPath());//打开索引目录
IndexWriterConfig indexWriterConfig=new IndexWriterConfig(new IKAnalyzer());
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
indexWriterConfig.setMaxBufferedDocs(1000);//设置缓存区,每n个文档写入一次
IndexWriter writer=new IndexWriter(directory,indexWriterConfig);
writer.forceMerge(100);//设置每个segments保存几个文档,该值越大索引时间越小,对应搜索会变慢
/**
*定义字段
* */
/**
* id:存储、不分词、索引
* */
FieldType idFieldType=new FieldType();
idFieldType.setStored(true);
idFieldType.setTokenized(false);
idFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
/**
* title:存储、分词、索引
**/
FieldType titleFieldType=new FieldType();
titleFieldType.setStored(true);
titleFieldType.setTokenized(true);
titleFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
/**
* content:存储、索引、分词
* */
FieldType contentFieldType=new FieldType();
contentFieldType.setStored(true);
contentFieldType.setTokenized(true);
contentFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
Document document=new Document();
document.add(new Field("id","1",idFieldType));
document.add(new Field("title","一往无前",titleFieldType));
FileReader fileReader=new FileReader("D:\\SourceCodes\\Temp\\example\\一往无前.txt");
char[] chbuf=new char[1024*8];
int len=0;
String content="";
while((len=fileReader.read(chbuf))!=-1)
{
content=new String(chbuf,0,len);
}
document.add(new Field("content",content,contentFieldType));
writer.addDocument(document);
writer.flush();
writer.commit();
writer.close();
}
Directory的多种实现:
- FSDrectory直接操作磁盘(固态推荐)
- MMapDirectory内存映射(大内存推荐)
Lucene中IndexOptions说明:
属性 | 说明 |
---|---|
NONE | 不建立索引 |
DOCS | 文档建立索引 |
DOCS_AND_FREQS | 文档、词频建立索引 |
DOCS_AND_FREQS_AND_POSITIONS | 文档、词频、词位置建立索引 |
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS | 文档、词频、词位置、偏移量建立索引 |
修改索引
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.FieldType;
import org.apache.lucene.index.IndexOptions;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.store.FSDirectory;
import org.wltea.analyzer.lucene.IKAnalyzer;
import java.io.File;
import java.io.FileReader;
@Test
public void updateIndex()
{
String indexDir="lucene索引目录位置";
File luceneIndexDirecotry=new File(indexDir);
try(FSDirectory fsd=FSDirectory.open(luceneIndexDirecotry.toPath()))
{
IndexWriterConfig indexWriterConfig=new IndexWriterConfig(new IKAnalyzer());
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
IndexWriter indexWriter=new IndexWriter(fsd,indexWriterConfig);
/**
* 定义字段
* */
//id:存储、不分词、索引
FieldType idFieldType=new FieldType();
idFieldType.setStored(true);
idFieldType.setTokenized(false);
idFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
//title:存储、分词、索引
FieldType titleFieldType=new FieldType();
titleFieldType.setStored(true);
titleFieldType.setTokenized(true);
titleFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
//content:存储、分词、索引
FieldType contentFieldType=new FieldType();
contentFieldType.setStored(true);
contentFieldType.setTokenized(true);
contentFieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
Document document=new Document();
document.add(new Field("id","1",idFieldType));
document.add(new Field("title","一往无前",titleFieldType));
FileReader fileReader=new FileReader("D:\\SourceCodes\\Temp\\example\\一往无前.txt");
char[] chbuf=new char[1024*8];
int len=0;
String content="";
while((len=fileReader.read(chbuf))!=-1)
{
content=new String(chbuf,0,len);
}
document.add(new Field("content",content,contentFieldType));
indexWriter.addDocument(document);
indexWriter.flush();
indexWriter.commit();
indexWriter.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
删除索引
例:
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.MMapDirectory;
import org.junit.jupiter.api.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Test
public void deleteIndex()
{
String indexDir="lucene索引目录位置";
File luceneIndexDirectory=new File(indexDir);
try(Directory directory= MMapDirectory.open(luceneIndexDirectory.toPath()))
{
IndexWriterConfig indexWriterConfig=new IndexWriterConfig(new IKAnalyzer());
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
indexWriterConfig.setMaxBufferedDocs(1000);
IndexWriter writer=new IndexWriter(directory,indexWriterConfig);
writer.forceMerge(100);
writer.deleteDocuments(new Term("id","1"));
writer.flush();
writer.commit();
writer.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
精确查询
例:
@Test
public void termQuery()
{
String indexDir="lucene索引目录位置";
File luceneIndexDirectory=new File(indexDir);
String content="雷军";
try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
{
DirectoryReader reader=DirectoryReader.open(fsd);
IndexSearcher searcher=new IndexSearcher(reader);
TopDocs docs=searcher.search(new TermQuery(new Term("content",content)),10);
for(ScoreDoc doc:docs.scoreDocs)
{
Document document=searcher.doc(doc.doc);
System.out.println(document.get("content"));
}
reader.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
模糊查询
例:
@Test
public void termQuery()
{
String indexDir="lucene索引目录位置";
File luceneIndexDirectory=new File(indexDir);
String content="雷军";
try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
{
DirectoryReader reader=DirectoryReader.open(fsd);
IndexSearcher searcher=new IndexSearcher(reader);
FuzzyQuery query=new FuzzyQuery(new Term("content","10年"));
TopDocs docs1=searcher.search(query,100);
for(ScoreDoc doc:docs1.scoreDocs)
{
Document document=searcher.doc(doc.doc);
System.out.println(document.get("title"));
}
reader.close();
}
catch (Exception e)
{
e.printStackTrace();
}
组合查询
布尔规则
参数 | 含义 |
---|---|
MUST | 与 |
MUST_NOT | 不 |
SHOULD | 或 |
FILTER | 非 |
布尔组合查询1
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Test
public void screeningQuery()
{
String indexDir="lucene索引目录位置";
File luceneIndexDirectory=new File(indexDir);
try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
{
DirectoryReader reader=DirectoryReader.open(fsd);
IndexSearcher searcher=new IndexSearcher(reader);
String[] fields={"title","content"};
QueryParser titleQueryParser=new QueryParser("title",new IKAnalyzer());
Query titleQuery=titleQueryParser.parse("一往无前");
QueryParser contentQueryParser=new QueryParser("content",new IKAnalyzer());
Query contentQuery=contentQueryParser.parse("雷军");
//布尔运算
Query query=new BooleanQuery.Builder().add(titleQuery, BooleanClause.Occur.MUST).add(contentQuery, BooleanClause.Occur.MUST).build();
TopDocs docs= searcher.search(query,10);
for(ScoreDoc doc: docs.scoreDocs)
{
Document document=searcher.doc(doc.doc);
System.out.println(document);
}
reader.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
布尔组合查询2
例:
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.queryparser.classic.MultiFieldQueryParser;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.junit.jupiter.api.Test;
import org.wltea.analyzer.lucene.IKAnalyzer;
@Test
public void booleanScreenQuery()
{
String indexDir="lucene索引目录位置";
File luceneIndexDirectory=new File(indexDir);
try(FSDirectory fsd=FSDirectory.open(luceneIndexDirectory.toPath()))
{
DirectoryReader reader=DirectoryReader.open(fsd);
IndexSearcher searcher=new IndexSearcher(reader);
String[] fields={"title","content"};
String[] stringQuery={"一往无前","雷军"};
BooleanClause.Occur[] flags={BooleanClause.Occur.MUST,BooleanClause.Occur.MUST};
Query query=MultiFieldQueryParser.parse(stringQuery,fields,flags,new IKAnalyzer());
TopDocs docs=searcher.search(query,10);
for(ScoreDoc doc: docs.scoreDocs)
{
Document document=searcher.doc(doc.doc);
System.out.println(document);
}
reader.close();
}
catch (Exception e)
{
e.printStackTrace();
}
}
注:由于中文分词器IKanalyzer版本过老产生的编译问题可通过下述方法解决:
创建类MyIKTokenizer
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import java.io.IOException;
public class MyIKTokenizer extends Tokenizer {
private IKSegmenter _IKImplement;
private final CharTermAttribute termAtt = (CharTermAttribute)this.addAttribute(CharTermAttribute.class);
private final OffsetAttribute offsetAtt = (OffsetAttribute)this.addAttribute(OffsetAttribute.class);
private final TypeAttribute typeAtt = (TypeAttribute)this.addAttribute(TypeAttribute.class);
private int endPosition;
public MyIKTokenizer(boolean useSmart) {
this._IKImplement = new IKSegmenter(this.input, useSmart);
}
public boolean incrementToken() throws IOException {
this.clearAttributes();
Lexeme nextLexeme = this._IKImplement.next();
if (nextLexeme != null) {
this.termAtt.append(nextLexeme.getLexemeText());
this.termAtt.setLength(nextLexeme.getLength());
this.offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
this.endPosition = nextLexeme.getEndPosition();
this.typeAtt.setType(nextLexeme.getLexemeTypeString());
return true;
} else {
return false;
}
}
public void reset() throws IOException {
super.reset();
this._IKImplement.reset(this.input);
}
public final void end() {
int finalOffset = this.correctOffset(this.endPosition);
this.offsetAtt.setOffset(finalOffset, finalOffset);
}
}
创建类MIKAnalyzer
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Tokenizer;
public class MyIKAnalyzer {
public static final class MyIKAnalyzer extends Analyzer {
private boolean useSmart;
public boolean useSmart() {
return this.useSmart;
}
public void setUseSmart(boolean useSmart) {
this.useSmart = useSmart;
}
public MyIKAnalyzer() {
this(false);
}
@Override
protected TokenStreamComponents createComponents(String s) {
Tokenizer _MyIKTokenizer = new MyIKTokenizer(this.useSmart());
return new TokenStreamComponents(_MyIKTokenizer);
}
public MyIKAnalyzer(boolean useSmart) {
this.useSmart = useSmart;
}
}
}
调用时调用MyIKAnalyzer