ElasticSearch分词使用调研

最新推荐文章于 2023-06-05 16:41:00 发布

墨笙弘一

最新推荐文章于 2023-06-05 16:41:00 发布

阅读量1.6k

点赞数

分类专栏：分布式搜索引擎ElasticSearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/u012934325/article/details/123552455

版权

本文调研了ElasticSearch的分词使用，包括定义ES索引结构的注意事项，如选择适合的映射结构；通过Kibana导入文档记录；重点探讨了分词器的使用，如空格分词和默认分词的效果，并详细分析了IK中文分词器的两种模式（ik_max_word和ik_smart）及其应用场景。

摘要由CSDN通过智能技术生成

1、定义ES索引结构

考虑点：
对于一些更新频率低的数据可以以文档形式存储在ES中，更新频繁或者删除频繁的数据不建议使用
对于索引的数量不能和建议业务表一样去建立索引的个数，需要了解ES的适用场景再使用
示例：
setting设置

{
   
  "fund_product_index" : {
   
    "settings" : {
   
      "index" : {
   
        "routing" : {
   
          "allocation" : {
   
            "include" : {
   
              "_tier_preference" : "data_content"
            }
          }
        },
        "number_of_shards" : "1",
        "provided_name" : "fund_product_index",
        "creation_date" : "1641280331929",
        "number_of_replicas" : "1",
        "uuid" : "99K8gnKzSCqkXoi6cyNYWQ",
        "version" : {
   
          "created" : "7100299"
        }
      }
    }
  }
}

根据业务场景建立ES 映射结构

{
   
  "fund_product_index" : {
   
    "mappings" : {
   
      "_meta" : {
   
        "created_by" : "ml-file-data-visualizer"
      },
      "properties" : {
   
        "firstSpellLetter" : {
   
          "type" : "text",
          "analyzer" : "pinyin"
        },
        "fullName" : {
   
          "type" : "text",
          "analyzer" : "ik_max_word",
          "search_analyzer" : "ik_smart"
        },
        "liteName" : {
   
          "type" : "text",
          "analyzer" : "ik_max_word",
          "search_analyzer" : "ik_smart"
        },
        "productCode" : {
   
          "type" : "text",
          "analyzer" : "ik_max_word",
          "search_analyzer" : "ik_smart"
        },
        "spell" : {
   
          "type" : "text",
          "analyzer" : "pinyin"
        }
      }
    }
  }
}

2、导入需要使用ES文档记录

可以使用Kibana自带的功能，将文档数据导入到ES，需要更新一下mapping关系，做一下reindex

3、分词使用调研

（1）使用空格分词

在这里插入图片描述
效果如下：

(2)默认分词

在这里插入图片描述
效果：

4、分词器调研

（1）中文分词器

我们选择了IK中文分词器，具体效果如下：
ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合，适合 Term Query；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”，适合 Phrase 查询
在这里插入图片描述
分词效果：

{
   
  "tokens" : [
    {
   
      "token" : "融通",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
   
      "token" : "通通",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
   
      "token" : "瑞",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
   
      "token" : "债券",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
   
      "token" : "型",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
   
      "token" : "证券",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "CN_WORD",