es 自定义分词插件

MXC抹茶

于 2024-07-01 02:12:27 发布

阅读量31

点赞数

文章标签： elasticsearch 大数据搜索引擎全文检索

0. 数据准备

1. 创建索引
curl -X PUT -H 'Content-Type:application/json' -d '{"settings":{"index":{"number_of_shards":2,"number_of_replicas":0}},"mappings":{"properties":{"description":{"type":"text"},"name":{"type":"keyword"},"age":{"type":"integer"}}}}' localhost:9200/user

2. 查看索引信息
(base) xxx@58deMacBook-Pro business_scf_productservice % curl localhost:9200/_cat/indices?v
health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   user  uWw_V1ECRbSmZxyLF0TdBg   2   0          0            0       452b           452b

3. 插入数据
curl -X POST -H 'Content-Type:application/json' -d '{"description":"this is a good boy","name":"zhangsan","age":20}' localhost:9200/user/_doc/

4. 查询数据
curl -X GET localhost:9200/user/_search


curl -X GET "localhost:9200/user/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "description": "good boy"
    }
  }
}'

1. es什么情况下调用分词？

1、写入

同步对写入的text 字段进行分词，然后进行相关后续存储

断点下到： org.apache.lucene.analysis.standard.StandardTokenizer#incrementToken 查看调用链。可以看到从org.apache.lucene.index.DefaultIndexingChain#processField 调用到分词方法。

2、查询：

同步对输入的条件进行分词。分词后去词典表以及倒排列表进行查询。

断点下到： org.apache.lucene.analysis.standard.StandardTokenizer#incrementToken 查看调用链。可以看到从org.elasticsearch.index.search.MatchQueryParser#parse 调用过去。

2. es调用分词返回的结果是什么?

返回结果：是一个对象，包含当前的词、起始位置、结束位置等信息，用于es 建立倒排列表。

1、自己curl 测试

curl -X GET "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "Hello, world! This is a test. 123@example.com! 我是中国人~"
}
'
--- 结果
{
	"tokens": [{
		"token": "hello",
		"start_offset": 0,
		"end_offset": 5,
		"type": "<ALPHANUM>",
		"position": 0
	}, {
		"token": "world",
		"start_offset": 7,
		"end_offset": 12,
		"type": "<ALPHANUM>",
		"position": 1
	}, {
		"token": "this",
		"start_offset": 14,
		"end_offset": 18,
		"type": "<ALPHANUM>",
		"position": 2
	}, {
		"token": "is",
		"start_offset": 19,
		"end_offset": 21,
		"type": "<ALPHANUM>",
		"position": 3
	}, {
		"token": "a",
		"start_offset": 22,
		"end_offset": 23,
		"type": "<ALPHANUM>",
		"position": 4
	}, {
		"token": "test",
		"start_offset": 24,
		"end_offset": 28,
		"type": "<ALPHANUM>",
		"position": 5
	}, {
		"token": "123",
		"start_offset": 30,
		"end_offset": 33,
		"type": "<NUM>",
		"position": 6
	}, {
		"token": "example.com",
		"start_offset": 34,
		"end_offset": 45,
		"type": "<ALPHANUM>",
		"position": 7
	}, {
		"token": "我",
		"start_offset": 47,
		"end_offset": 48,
		"type": "<IDEOGRAPHIC>",
		"position": 8
	}, {
		"token": "是",
		"start_offset": 48,
		"end_offset": 49,
		"type": "<IDEOGRAPHIC>",
		"position": 9
	}, {
		"token": "中",
		"start_offset": 49,
		"end_offset": 50,
		"type": "<IDEOGRAPHIC>",
		"position": 10
	}, {
		"token": "国",
		"start_offset": 50,
		"end_offset": 51,
		"type": "<IDEOGRAPHIC>",
		"position": 11
	}, {
		"token": "人",
		"start_offset": 51,
		"end_offset": 52,
		"type": "<IDEOGRAPHIC>",
		"position": 12
	}]
}

2、代码测试

package qz.es;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

import java.io.StringReader;

public class AnalyzerTest {

    public static void main(String[] args) throws Exception {
        String text = "Hello, world! This is a test. 123@example.com! 我是中国人~";
        // 初始化StringReader，准备输入文本
        StringReader reader = new StringReader(text);

        // 创建StandardTokenizer实例
        StandardTokenizer tokenizer = new StandardTokenizer();
        // 获取CharTermAttribute，用于获取分词结果
        CharTermAttribute termAtt = tokenizer.addAttribute(CharTermAttribute.class);
        // 开始分词
        tokenizer.setReader(reader);
        tokenizer.reset();

        while (tokenizer.incrementToken()) {
            String token = termAtt.toString();
            System.out.println(token);
        }

        tokenizer.end();
        tokenizer.close();
        reader.close();
    }
}
---  结果
Hello
world
This
is
a
test
123
example.com
我
是
中
国
人

查看单个返回的词信息：

es 自定义分词插件_json

3. 自定义自己的分词插件

脱离ES 只用两个组件： Analyzer、Tokenizer。Analyzer是一个更全面的概念，涵盖了文本分析的整个流程，包括但不限于分词(比如过滤等精细化处理)。Tokenizer则专注于分词这一特定步骤。

集成到es需要2个组件：AbstractIndexAnalyzerProvider、AnalysisPlugin,(测试不需要 AbstractTokenizerFactory 也能正常使用)

// 参考项目: https://gitee.com/Qiao-Zhi/custom_analyzer_es_plugin

4. 自己的分词插件调用其他分词插件如何实现-扩展分词？

1、 org.apache.lucene.analysis.Tokenizer#reset 重置的时候可以拿到当前需要分词的词，然后processAnalyzer(sb.toString()); 分词完成缓存到对象属性

private BufferedReader reader;

public void reset(Reader input) throws IOException {
        if (BufferedReader.class.isAssignableFrom(input.getClass())) {
            reader = ((BufferedReader) input);
        } else {
            reader = new BufferedReader(input);
        }
        CharBuffer buffer = CharBuffer.allocate(256);
        StringBuilder sb = new StringBuilder();
        while (reader.read(buffer) != -1) {
            sb.append(buffer.flip());
            buffer.clear();
        }

        // 要分词的字符串
        terms = null;
        processAnalyzer(sb.toString());
    }

2、org.apache.lucene.analysis.TokenStream#incrementToken 遍历上面分词结果返回

@Override
    public final boolean incrementToken() throws IOException {
        CustomTom customTerm = tokenizerAdapter.nextTerm();
        if (lexerTerm == null) {
            return false;
        }

        String word = lexerTerm.word;
        int offset = lexerTerm.offset;
        int endOffset = offset + word.length();
        termAtt.setEmpty().append(word);
        offsetAtt.setOffset(correctOffset(offset),
                correctOffset(endOffset));
        return true;
    }