简单记录elasticsearch同义词搜索

最新推荐文章于 2024-05-24 14:32:29 发布

会咏春拳的钢铁侠的爸

最新推荐文章于 2024-05-24 14:32:29 发布

阅读量746

点赞数 1

分类专栏： Elasticsearc 文章标签： Elasticsearch

本文链接：https://blog.csdn.net/yzl_66/article/details/87346254

版权

Elasticsearc 专栏收录该内容

0 篇文章 0 订阅

订阅专栏

在elasticsearch-x.x.x/config新建analysis文件夹，在该文件夹下新建txt文件（其他文件格式也可）若进行中文同义词，TXT文件的编码格式需为UTF-8，否则会报异常"type": "malformed_input_exception","reason": "Input length = 1"

PUT /test_indexone
{
  "index": {
    "analysis": {
      "analyzer": {
        "by_smart": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "filter": ["by_tfr","by_sfr"],
          "char_filter": ["by_cfr"]
        },
        "by_max_word": {
          "type": "custom",
          "tokenizer": "ik_max_word",
          "filter": ["by_tfr","by_sfr"],
          "char_filter": ["by_cfr"]
        }
      },
      "filter": {
        "by_tfr": {
          "type": "stop",
          "stopwords": [" "]
        },
        "by_sfr": {
          "type": "synonym",
          "synonyms_path": "analysis/synonyms.txt"
        }
      },
      "char_filter": {
        "by_cfr": {
          "type": "mapping",
          "mappings": ["| => |"]
        }
      }
    }
  }
}

char_filter 用于分词前对原搜索的句子进行处理;

tokenizer 用于将搜索的句子分成多个词组;

filter 用于处理tokenizer输出的词组，比如删除某些词，修改某些词，增加某些词。

实现同义词搜索的原理是，自定义filter，处理tokenizer输出的待搜索词组时，取出其中词的同义词，加入到待搜索的词组中。

每组的同义词格式由如下两种：