elasticsearch分词器和自定义分词器

最新推荐文章于 2024-05-15 05:59:27 发布

龙胖不下锅

最新推荐文章于 2024-05-15 05:59:27 发布

阅读量134

点赞数

文章标签： elasticsearch 搜索引擎

本文链接：https://blog.csdn.net/weixin_45277567/article/details/130425235

版权

分词器介绍

https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-analyzers.html

查看分词结果（standard-analyzer，es默认）

https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-standard-analyzer.html

post _analyze
{
  "analyzer": "standard",
  "text": "分词的数"
}

中文分词结果

post _analyze
{
  "analyzer": "standard",
  "text": "我喜欢看小说"
}

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "喜",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "欢",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "看",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "小",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "说",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        }
    ]
}

英文分词结果

post _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "2",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<NUM>",
            "position": 1
        },
        {
            "token": "quick",
            "start_offset": 6,
            "end_offset": 11,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "brown",
            "start_offset": 12,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "foxes",
            "start_offset": 18,
            "end_offset": 23,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "jumped",
            "start_offset": 24,
            "end_offset": 30,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "over",
            "start_offset": 31,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "the",
            "start_offset": 36,
            "end_offset": 39,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "lazy",
            "start_offset": 40,
            "end_offset": 44,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "dog's",
            "start_offset": 45,
            "end_offset": 50,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "bone",
            "start_offset": 51,
            "end_offset": 55,
            "type": "<ALPHANUM>",
            "position": 10
        }
    ]
}

自定义分词器

https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-custom-analyzer.html

自定义分词器不是全局的，是归索引库

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom", 
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

type 分析仪类型。接受内置分析器类型。对于自定义分析器，使用custom或省略此参数。
tokenizer 内置或定制的分词器。（必需的） https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-tokenizers.html 分词器(对字段进行切分)
char_filter 一组可选的内置或自定义字符过滤器。 https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-charfilters.html 字符过滤器(在一段文本进行分词之前，先进行预处理，比如过滤html标签等)
filter 可选的内置或自定义令牌过滤器数组。 https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-tokenfilters.html token过滤器(对切分的单词进行加工，如大小写转换等)
position_increment_gap 当索引文本值数组时，Elasticsearch 在一个值的最后一项和下一个值的第一项之间插入一个假的“间隙”，以确保短语查询不匹配来自不同数组元素的两个项。默认为100. 查看position_increment_gap更多。

顺序： character filter -> tokenizer -> token filter

支持个数：char_filter(0个或多个)+tokenizer(一个)+filter(0个或多个)

`type`	分析仪类型。接受内置分析器类型。对于自定义分析器，使用`custom`或省略此参数。
`tokenizer`	内置或定制的分词器。（必需的）	https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-tokenizers.html	分词器(对字段进行切分)
`char_filter`	一组可选的内置或自定义字符过滤器。	https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-charfilters.html	字符过滤器(在一段文本进行分词之前，先进行预处理，比如过滤html标签等)
`filter`	可选的内置或自定义令牌过滤器数组。	https://www.elastic.co/guide/en/elasticsearch/reference/8.7/analysis-tokenfilters.html	token过滤器(对切分的单词进行加工，如大小写转换等)
`position_increment_gap`	当索引文本值数组时，Elasticsearch 在一个值的最后一项和下一个值的第一项之间插入一个假的“间隙”，以确保短语查询不匹配来自不同数组元素的两个项。默认为`100`. 查看`position_increment_gap`更多。

POST /_analyze
{
    "tokenizer": "ik_max_word", 
    "filter": ["lowercase","pinyin","asciifolding"],
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?"
}


POST /_analyze
{
    "tokenizer": "ik_max_word", 
    "filter": ["lowercase"],
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?"
}

POST /_analyze
{
    "tokenizer": "ik_max_word", 
    "filter": ["pinyin"],
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?"
}


POST /_analyze
{
    "tokenizer": "ik_max_word", 
    "filter": ["pinyin","ik_smart"],
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?"
}

POST /_analyze
{
    "tokenizer": "pinyin", 
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?"
}

POST /_analyze
{
    "tokenizer": "pinyin", 
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?</html>"
}

POST /_analyze
{
    "tokenizer": "ik_smart", 
    "char_filter":["html_strip"],
    "text": "你知道 Elasticesearch吗?</html>"
}


#char_filter mappings的使用
GET /_analyze
{
  "tokenizer": "keyword",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
        "١ => 1",
        "٢ => 2",
        "٣ => 3",
        "٤ => 4",
        "٥ => 5",
        "٦ => 6",
        "٧ => 7",
        "٨ => 8",
        "٩ => 9"
      ]
    }
  ],
  "text": "My license plate is ٢٥٠١٥"
}


GET /_analyze
{
  "tokenizer": "ik_smart",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": [
        "٠ => 0",
      ]
    }
  ],
  "text": "My license plate is ٠"
}


GET /_analyze
{
  "tokenizer": "ik_smart",
  "char_filter": [
    {
      "type": "pattern_replace",
      "pattern": "(\\d+)-(?=\\d)",
      "replacement": ""
    }
  ],
  "text": "My credit card is 123-456-789"
}