Document:Documents are the unit of indexing and search. A Document is a set of fields. http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/Document.html#add(org.apache.lucene.document.Fieldable)
StringField:只索引不分词,当成一个完整的字符串
TextField:A field that is indexed and tokenized。
StoredField:A field whose value is stored so that IndexSearcher.doc(int) and IndexReader.document() will return the field and its value.
Analyzer analyzer = new Analyzer(LUCENE_48);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_48, analyzer);
indexWriter = new IndexWriter(directory, indexWriterConfig);
Document doc = new Document();
doc.add(new StoredField(TITLE_FIELD, title));
doc.add(new TextField(TEXT_FIELD, text, Field.Store.NO));
indexWriter.addDocument(doc);
Query query = queryParser.parse(text);
TopDocs td = searcher.search(query, conceptCount);
ScoreDoc scoreDoc : td.scoreDocs[0]
String concept = indexReader.document(scoreDoc.doc).get(FIELD_NAME);
Field.Store.YES: Store the original field value in the index。如果存储的话在检索的时候可以返回值。正文内容一般不需要存储。http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/Field.Store.html
Lucene中文分词
- StandardAnalyzer:按单个字分
- SmartChineseAnalyzer:对中文支持较好,但扩展性差,扩展词库,禁用词库和同义词库等不好处理。
- mmseg4j:
- IK-analyzer:有扩展词典和停用词词典
- 官方文件https://code.google.com/archive/p/ik-analyzer/downloads IK Analyzer 2012FF_hf1.zip
- 解压后,倒入jar包
- IKAnalyzer.cfg.xml、ext.dic、stopwords.dic放到resources下
- 本地配置
- git clone https://github.com/wks/ik-analyzer.git
- mvn install -Dmaven.test.skip=true
- maven pom配置
- 官方文件https://code.google.com/archive/p/ik-analyzer/downloads IK Analyzer 2012FF_hf1.zip
- Hanlp:https://github.com/hankcs/hanlp-lucene-plugin
public class FenciTest {
public static void main(String[] args) throws IOException {
testAnanlyzer();
Analyzer analyzer = new HanLPIndexAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("", "中华人民共和国");
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
// 偏移量
OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);
// 距离
PositionIncrementAttribute positionAttr = tokenStream.getAttribute(PositionIncrementAttribute.class);
System.out.println(attribute + " " + offsetAtt.startOffset() + " " + offsetAtt.endOffset() + " " + positionAttr.getPositionIncrement());
}
}
public static void testAnanlyzer() throws IOException {
// 1、创建一个分析器对象
Analyzer analyzer = new IKAnalyzer(); // 智能中文分析器
// 2、从分析器对象中获得tokenStream对象
// 参数1:域的名称,可以为null,或者是""
// 参数2:要分析的文本
TokenStream tokenStream = analyzer.tokenStream("", "中华人民共和国");
// 3、设置一个引用(相当于指针),这个引用可以是多种类型,可以是关键词的引用,偏移量的引用等等
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); // charTermAttribute对象代表当前的关键词
// 偏移量(其实就是关键词在文档中出现的位置,拿到这个位置有什么用呢?因为我们将来可能要对该关键词进行高亮显示,进行高亮显示要知道这个关键词在哪?)
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
// 4、调用tokenStream的reset方法,不调用该方法,会抛出一个异常
tokenStream.reset();
// 5、使用while循环来遍历单词列表
while (tokenStream.incrementToken()) {
System.out.println("start→" + offsetAttribute.startOffset()); // 关键词起始位置
// 6、打印单词
System.out.println(charTermAttribute);
System.out.println("end→" + offsetAttribute.endOffset()); // 关键词结束位置
}
// 7、关闭tokenStream对象
tokenStream.close();
}
}