es的分词结果怎么看？

最新推荐文章于 2024-10-13 22:41:20 发布

.Passion

最新推荐文章于 2024-10-13 22:41:20 发布

阅读量5.6k

点赞数 5

本文链接：https://blog.csdn.net/qq_43923045/article/details/105921983

版权

elasticsearch 专栏收录该内容

13 篇文章

订阅专栏

GET _analyze
{
  "analyzer":"standard",
  "text":"中国棒棒的"
  
}

返回结果：

{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "国",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "棒",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "棒",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    }
  ]
}

得出结果，用默认的分词器的话，会把中文的每一个字当成一个词，这样非常的不好

那么，我们看看使用了 ik的分词之后会是神马结果。

概念：

ik_smart 最粗粒度，尽量的块状，分词的数目最少
ik_max_word 最细粒度，分词出来的数目最多
stop 停用词
keyword 关键词，整一段

例子：

ik_smart:

GET _analyze
{
  "analyzer":"ik_smart",
  "text":"中国棒棒的,北京大学毕业生"
  
}
得到结果：

{
  "tokens" : [
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "棒棒",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "北京大学",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "毕业生",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 4
    }
  ]
}

2. ik_max_word

}
GET _analyze
{
  "analyzer":"ik_max_word",
  "text":"中国棒棒的,北京大学毕业生"
  
}


下面是分词的结果：

{
  "tokens" : [
    {
      "token" : "中国",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "棒棒",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "的",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "CN_CHAR",
      "position" : 2
    },
    {
      "token" : "北京大学",
      "start_offset" : 6,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "北京大",
      "start_offset" : 6,
      "end_offset" : 9,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "北京",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 5
    },
    {
      "token" : "大学毕业",
      "start_offset" : 8,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "大学",
      "start_offset" : 8,
      "end_offset" : 10,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "毕业生",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "毕业",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "生",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "CN_CHAR",
      "position" : 10
    }
  ]
}

停用词分词

GET _analyze
{
  "analyzer":"stop",
  "text":"中国棒棒的,北京大学毕业生"
  
}

下面是停用词分词的结果

{
  "tokens" : [
    {
      "token" : "中国棒棒的",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "北京大学毕业生",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "word",
      "position" : 1
    }
  ]
}

GET _analyze
{
  "analyzer":"stop",
  "text":"汽车制造/维修/零配件"
  
}

{
  "tokens" : [
    {
      "token" : "汽车制造",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "维修",
      "start_offset" : 5,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "零配件",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    }
  ]
}

keyword 分词

GET _analyze
{
  "analyzer":"keyword",
  "text":"汽车制造/维修/零配件"
  
}

{
  "tokens" : [
    {
      "token" : "汽车制造/维修/零配件",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    }
  ]
}