Elasticsearch分词

微风薄云

已于 2022-04-05 17:10:48 修改

阅读量146

点赞数

分类专栏： elasticSearch 文章标签： elasticsearch

于 2021-09-14 10:26:50 首次发布

本文链接：https://blog.csdn.net/abaddon_t_mac/article/details/120280019

版权

elasticSearch 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

_analyze用于分析field或者analyzer/tokenizer是如何分析和索引一段文字。
token: 索引中的词
position指明词在原文本中是第几个出现的
start_offset和end_offset表示词在原文本中占据的位置。

Analyzer 由三部分组成

• Character Filters：原始文本处理，如去除 html
首先，字符串按顺序通过每个字符过滤器。他们的任务是在分词前整理字符串。一个字符过滤器可以用来去掉HTML，或者将 & 转化成 and。
• Tokenizer：按照规则切分为单词
其次，字符串被分词器分为单个的词条。一个 whitespace的分词器遇到空格和标点的时候，可能会将文本拆分成词条。
• Token Filters：对切分单词加工、小写、删除 stopwords，增加同义词
最后，词条按顺序通过每个 token 过滤器。这个过程可能会改变词条，例如，lowercase token filter 小写化（将ES转为es）、stop token filter 删除词条（例如，像 a， and， the 等无用词），或者synonym token filter 增加词条（例如，像 jump 和 leap 这种同义词）。

standard分词器

GET /_analyze
{
  "analyzer" : "standard",
  "text" : "Quick Brown Foxes!"
}

返回结果

{
  "tokens" : [
    {
      "token" : "quick",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "brown",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "foxes",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

中文分词器

IK中文分词器地址：https://github.com/medcl/elasticsearch-analysis-ik

optional2需要jdk11,我使用optional1完成安装。

参照官网测试用例

GET /_analyze
{
  "analyzer" : "ik_max_word",
  "text" : "美国留给伊拉克的是个烂摊子吗"
}

返回预期结果

{
  "tokens" : [
    {
      "token" : "美国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "留给",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "伊拉克",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "的",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "CN_CHAR",
      "position" : 3
    },
    {
      "token" : "是",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "CN_CHAR",
      "position" : 4
    },
    {
      "token" : "个",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "烂摊子",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "吗",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "CN_CHAR",
      "position" : 7
    }
  ]
}