lucene学习

最新推荐文章于 2022-03-25 15:13:13 发布

sinat_34080511

最新推荐文章于 2022-03-25 15:13:13 发布

阅读量166

点赞数 1

分类专栏：架构

本文链接：https://blog.csdn.net/sinat_34080511/article/details/87778460

版权

架构专栏收录该内容

10 篇文章 0 订阅

订阅专栏

Document：Documents are the unit of indexing and search. A Document is a set of fields. http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/Document.html#add(org.apache.lucene.document.Fieldable)

StringField：只索引不分词，当成一个完整的字符串

TextField：A field that is indexed and tokenized。

StoredField：A field whose value is stored so that IndexSearcher.doc(int) and IndexReader.document() will return the field and its value.

Analyzer analyzer = new Analyzer(LUCENE_48);
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_48, analyzer);
indexWriter = new IndexWriter(directory, indexWriterConfig);

Document doc = new Document();
doc.add(new StoredField(TITLE_FIELD, title));
doc.add(new TextField(TEXT_FIELD, text, Field.Store.NO));
indexWriter.addDocument(doc);

Query query = queryParser.parse(text);
TopDocs td = searcher.search(query, conceptCount);
ScoreDoc scoreDoc : td.scoreDocs[0]
String concept = indexReader.document(scoreDoc.doc).get(FIELD_NAME);

Field.Store.YES： Store the original field value in the index。如果存储的话在检索的时候可以返回值。正文内容一般不需要存储。http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/Field.Store.html

Lucene中文分词

StandardAnalyzer：按单个字分
SmartChineseAnalyzer：对中文支持较好，但扩展性差，扩展词库，禁用词库和同义词库等不好处理。
mmseg4j：
IK-analyzer：有扩展词典和停用词词典
- 官方文件https://code.google.com/archive/p/ik-analyzer/downloads IK Analyzer 2012FF_hf1.zip
  - 解压后，倒入jar包
  - IKAnalyzer.cfg.xml、ext.dic、stopwords.dic放到resources下
- 本地配置
  - git clone https://github.com/wks/ik-analyzer.git
  - mvn install -Dmaven.test.skip=true
  - maven pom配置
Hanlp：https://github.com/hankcs/hanlp-lucene-plugin

public class FenciTest {
    public static void main(String[] args) throws IOException {
        testAnanlyzer();
        Analyzer analyzer = new HanLPIndexAnalyzer();
        TokenStream tokenStream = analyzer.tokenStream("", "中华人民共和国");
        tokenStream.reset();
        while (tokenStream.incrementToken()) {
            CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);
            // 偏移量
            OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);
            // 距离
            PositionIncrementAttribute positionAttr = tokenStream.getAttribute(PositionIncrementAttribute.class);
            System.out.println(attribute + " " + offsetAtt.startOffset() + " " + offsetAtt.endOffset() + " " + positionAttr.getPositionIncrement());
        }
    }
    public static void testAnanlyzer() throws IOException {
        // 1、创建一个分析器对象
        Analyzer analyzer = new IKAnalyzer(); // 智能中文分析器
        // 2、从分析器对象中获得tokenStream对象
        // 参数1：域的名称，可以为null，或者是""
        // 参数2：要分析的文本
        TokenStream tokenStream = analyzer.tokenStream("", "中华人民共和国");

        // 3、设置一个引用(相当于指针)，这个引用可以是多种类型，可以是关键词的引用，偏移量的引用等等
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); // charTermAttribute对象代表当前的关键词
        // 偏移量(其实就是关键词在文档中出现的位置，拿到这个位置有什么用呢？因为我们将来可能要对该关键词进行高亮显示，进行高亮显示要知道这个关键词在哪？)
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        // 4、调用tokenStream的reset方法，不调用该方法，会抛出一个异常
        tokenStream.reset();
        // 5、使用while循环来遍历单词列表
        while (tokenStream.incrementToken()) {
            System.out.println("start→" + offsetAttribute.startOffset()); // 关键词起始位置
            // 6、打印单词
            System.out.println(charTermAttribute);
            System.out.println("end→" + offsetAttribute.endOffset()); // 关键词结束位置
        }
        // 7、关闭tokenStream对象
        tokenStream.close();
    }
}

sinat_34080511

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene学习

Document：Documents are the unit of indexing and search. A Document is a set of fields. http://lucene.apache.org/core/2_9_4/api/all/org/apache/lucene/document/Document.html#add(org.apache.lucene.docume...
复制链接

扫一扫

专栏目录