【Elasticsearch】Elasticsearch中的自定分析器

馍馍

已于 2022-08-20 16:43:02 修改

阅读量368

点赞数

分类专栏： Elasticsearch 文章标签： elasticsearch 搜索引擎大数据

于 2022-08-14 17:57:43 首次发布

本文链接：https://blog.csdn.net/SSHH_ZHU/article/details/126333646

版权

Elasticsearch 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

【Elasticsearch】Elasticsearch中的自定分析器

char_filter：es中字符过滤器，字符过滤器用于在将字符流传递给标记器之前对其进行预处理。属于Character filters reference
官方文档，这里是我的一些简单小例子： char_filter使用
filter：标记过滤器接受来自标记器的标记流，并且可以修改标记（例如小写）、删除标记（例如删除停用词）或添加标记（例如同义词）。
官方文档
tokenizer：分词器接收一个字符流，将其分解为单个令牌（通常是单个单词），并输出一个令牌流
分词器，国人日常开发是用个 ik_max_word
官方文档

#自定义分词
DELETE custom_analysis_index
PUT custom_analysis_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "| => or"
          ]
        }
      },
      "filter": {
        "my_stopword": {
          "type": "stop",
          "stopwords": [
            "who",
            "the",
            "are",
            "at"
          ]
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "[ ,.!?]"
        }
      }, 
      "analyzer": {
        "my_analyzer":{
          "type":"custom",
          "char_filter":["my_char_filter"],
          "filter":["my_stopword"],
          "tokenizer":"my_tokenizer"
        }
      }
    }
  }
}

GET custom_analysis_index/_analyze
{
    "analyzer": "my_analyzer",
    "text": ["what is your name? I am kerry","who are you? i am green, i am a student at school"]
}

通过上面的不断组合，我么实现自定义一个解析器；

通过解析我么获取到是已经过滤掉的内容

{
  "tokens": [
    {
      "token": "what",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "is",
      "start_offset": 5,
      "end_offset": 7,
      "type": "word",
      "position": 1
    },
    {
      "token": "your",
      "start_offset": 8,
      "end_offset": 12,
      "type": "word",
      "position": 2
    },
    {
      "token": "name",
      "start_offset": 13,
      "end_offset": 17,
      "type": "word",
      "position": 3
    },
    {
      "token": "I",
      "start_offset": 19,
      "end_offset": 20,
      "type": "word",
      "position": 4
    },
    {
      "token": "am",
      "start_offset": 21,
      "end_offset": 23,
      "type": "word",
      "position": 5
    },
    {
      "token": "kerry",
      "start_offset": 24,
      "end_offset": 29,
      "type": "word",
      "position": 6
    },
    {
      "token": "you",
      "start_offset": 38,
      "end_offset": 41,
      "type": "word",
      "position": 109
    },
    {
      "token": "i",
      "start_offset": 43,
      "end_offset": 44,
      "type": "word",
      "position": 110
    },
    {
      "token": "am",
      "start_offset": 45,
      "end_offset": 47,
      "type": "word",
      "position": 111
    },
    {
      "token": "green",
      "start_offset": 48,
      "end_offset": 53,
      "type": "word",
      "position": 112
    },
    {
      "token": "i",
      "start_offset": 55,
      "end_offset": 56,
      "type": "word",
      "position": 113
    },
    {
      "token": "am",
      "start_offset": 57,
      "end_offset": 59,
      "type": "word",
      "position": 114
    },
    {
      "token": "a",
      "start_offset": 60,
      "end_offset": 61,
      "type": "word",
      "position": 115
    },
    {
      "token": "student",
      "start_offset": 62,
      "end_offset": 69,
      "type": "word",
      "position": 116
    },
    {
      "token": "school",
      "start_offset": 73,
      "end_offset": 79,
      "type": "word",
      "position": 118
    }
  ]
}