五分钟带你玩转Elasticsearch(八)ik分词器吐血总结

倒排索引

这里就涉及到了分词

分词语法

默认的分词器

GET _analyze?pretty
  {
    "text": "Haier/海尔 BCD-470WDPG十字对开门风冷变频一级节能家用官方冰箱"
  }
{
  "tokens" : [
    {
      "token" : "haier",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "海",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "尔",
      "start_offset" : 7,
      "end_offset" : 8,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "bcd",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "470wdpg",
      "start_offset" : 13,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "十",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "字",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "对",
      "start_offset" : 22,
      "end_offset" : 23,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "开",
      "start_offset" : 23,
      "end_offset" : 24,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "门",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "风",
      "start_offset" : 25,
      "end_offset" : 26,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "冷",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "变",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    },
    {
      "token" : "频",
      "start_offset" : 28,
      "end_offset" : 29,
      "type" : "<IDEOGRAPHIC>",
      "position" : 13
    },
    {
      "token" : "一",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<IDEOGRAPHIC>",
      "position" : 14
    },
    {
      "token" : "级",
      "start_offset" : 30,
      "end_offset" : 31,
      "type" : "<IDEOGRAPHIC>",
      "position" : 15
    },
    {
      "token" : "节",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<IDEOGRAPHIC>",
      "position" : 16
    },
    {
      "token" : "能",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "<IDEOGRAPHIC>",
      "position" : 17
    },
    {
      "token" : "家",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "<IDEOGRAPHIC>",
      "position" : 18
    },
    {
      "token" : "用",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<IDEOGRAPHIC>",
      "position" : 19
    },
    {
      "token" : "官",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "<IDEOGRAPHIC>",
      "position" : 20
    },
    {
      "token" : "方",
      "start_offset" : 36,
      "end_offset" : 37,
      "type" : "<IDEOGRAPHIC>",
      "position" : 21
    },
    {
      "token" : "冰",
      "start_offset" : 37,
      "end_offset" : 38,
      "type" : "<IDEOGRAPHIC>",
      "position" : 22
    },
    {
      "token" : "箱",
      "start_offset" : 38,
      "end_offset" : 39,
      "type" : "<IDEOGRAPHIC>",
      "position" : 23
    }
  ]
}

ik_max_word

GET _analyze?pretty
  {
    "analyzer": "ik_max_word",
    "text": "Haier/海尔 BCD-470WDPG十字对开门风冷变频一级节能家用官方冰箱"
  }
{
  "tokens" : [
    {
      "token" : "haier",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "海尔",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "bcd-470wdpg",
      "start_offset" : 9,
      "end_offset" : 20,
      "type" : "LETTER",
      "position" : 2
    },
    {
      "token" : "bcd",
      "start_offset" : 9,
      "end_offset" : 12,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "470",
      "start_offset" : 13,
      "end_offset" : 16,
      "type" : "ARABIC",
      "position" : 4
    },
    {
      "token" : "wdpg",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "十字",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "十",
      "start_offset" : 20,
      "end_offset" : 21,
      "type" : "TYPE_CNUM",
      "position" : 7
    },
    {
      "token" : "字",
      "start_offset" : 21,
      "end_offset" : 22,
      "type" : "COUNT",
      "position" : 8
    },
    {
      "token" : "对开",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "开门",
      "start_offset" : 23,
      "end_offset" : 25,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "门风",
      "start_offset" : 24,
      "end_offset" : 26,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "风冷",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "变频",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "一级",
      "start_offset" : 29,
      "end_offset" : 31,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "一",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "TYPE_CNUM",
      "position" : 15
    },
    {
      "token" : "级",
      "start_offset" : 30,
      "end_offset" : 31,
      "type" : "COUNT",
      "position" : 16
    },
    {
      "token" : "节能",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 17
    },
    {
      "token" : "家用",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 18
    },
    {
      "token" : "官方",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "CN_WORD",
      "position" : 19
    },
    {
      "token" : "冰箱",
      "start_offset" : 37,
      "end_offset" : 39,
      "type" : "CN_WORD",
      "position" : 20
    }
  ]
}

ik_max_word

GET _analyze?pretty
  {
    "analyzer": "ik_smart",
    "text": "Haier/海尔 BCD-470WDPG十字对开门风冷变频一级节能家用官方冰箱"
  }
{
  "tokens" : [
    {
      "token" : "haier",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "ENGLISH",
      "position" : 0
    },
    {
      "token" : "海尔",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "bcd-470wdpg",
      "start_offset" : 9,
      "end_offset" : 20,
      "type" : "LETTER",
      "position" : 2
    },
    {
      "token" : "十字",
      "start_offset" : 20,
      "end_offset" : 22,
      "type" : "CN_WORD",
      "position" : 3
    },
    {
      "token" : "对开",
      "start_offset" : 22,
      "end_offset" : 24,
      "type" : "CN_WORD",
      "position" : 4
    },
    {
      "token" : "门",
      "start_offset" : 24,
      "end_offset" : 25,
      "type" : "CN_CHAR",
      "position" : 5
    },
    {
      "token" : "风冷",
      "start_offset" : 25,
      "end_offset" : 27,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "变频",
      "start_offset" : 27,
      "end_offset" : 29,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "一级",
      "start_offset" : 29,
      "end_offset" : 31,
      "type" : "CN_WORD",
      "position" : 8
    },
    {
      "token" : "节能",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "家用",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "官方",
      "start_offset" : 35,
      "end_offset" : 37,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "冰箱",
      "start_offset" : 37,
      "end_offset" : 39,
      "type" : "CN_WORD",
      "position" : 12
    }
  ]
}

ik_max_word:会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合。

ik_smart:会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

分词使用

使用分词后 会将数据以倒排索引的方法存储 实现模糊查询

新建索引并使用ik分词保存 

PUT my_index
{
  "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_max_word" //使用ik分词保存
        },
        "name": {
          "type": "text"
        },
        "age": {
          "type": "integer"
        },
        "created": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    }
}

索引插入文档

POST /my_index3/_bulk
{ "index": { "_id": 1 }}
{ "title" : "Haier/海尔 BCD-470WDPG十字对开门风冷变频一级节能家用官方冰箱", "name" : "王二" , "age": 10, "created": 20190101 }
{ "index": { "_id": 2 }}
{ "title" : "【爆款秒杀】海尔冰箱三门家用小型节能省电双门电冰箱官方旗舰店", "name" : "王二" , "age": 10, "created": 20190101 }
{ "index": { "_id": 3}}
{ "title" : "Panasonic/松下 NR-TC28WS1-N 风冷无霜家用抑菌三门小体积冰箱", "name" : "王二" , "age": 10, "created": 20190101 }
{ "index": { "_id": 4}}
{ "title" : "小米电视4A50英寸4K高清智能网络平板液晶屏家电视机家电官方旗舰", "name" : "王二" , "age": 10, "created": 20190101 }
{ "index": { "_id": 5}}
{ "title" : "创维40X6 40英寸高清电视机智能网络wifi平板液晶屏家用彩电32 43", "name" : "王二" , "age": 10, "created": 20190101 }
{ "index": { "_id": 6}}
{ "title" : "Changhong/长虹 50D4P 50英寸超薄无边全面屏4K超高清智能电视机", "name" : "王二" , "age": 10, "created": 20190101 }

查看分词

GET _analyze?pretty
  {
    "analyzer": "ik_max_word",
    "text": "Haier/海尔 BCD-470WDPG十字对开门风冷变频一级节能家用官方冰箱"
  }

通过条件搜索

GET /my_index/_search?pretty
{
    "query": {
         "match": {"title": "对"}
     }
}

会发现只有分词的条件才能被查询

自定义分词器

参考:https://blog.csdn.net/Barbarousgrowth_yp/article/details/80242811

参考:https://blog.csdn.net/zhou870498/article/details/80501972

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
### 回答1: ElasticsearchIK分词器是一种中文分词器,可以将中文文本分成一个个有意义的词语,以便于搜索和分析。它支持细粒度和粗粒度两种分词模式,可以根据不同的需求进行配置。IK分词器还支持自定义词典和停用词,可以提高分词的准确性和效率。在Elasticsearch中,IK分词器是非常常用的中文分词器之一。 ### 回答2: Elasticsearch是一个开源的搜索引擎,旨在提供实时搜索和分析功能。作为强大的搜索引擎,Elasticsearch极大的依赖于高效的分词器。而其中IK分词器是目前常见的分词器之一。 IK分词器是一款用于中文分词的工具。它采用了一种基于规则和词库相结合的分词算法。通常情况下,IK分词器的工作流程是:首先,将文本按照信息增益、词频、文本跨度等属性计算得到其初始的分词结果。然后,IK分词器通过对分词结果进行多轮处理,逐渐优化分词结果,以达到更为准确、严谨的分词效果。 此外,IK分词器还具备一些较为强大的功能。例如,支持中文姓名、词语拼音、数字、英文、日文等的分词处理。支持自定义词典、停用词、同义词等配置。支持了多种分词模式,如细粒度分词模式,搜索引擎分词模式和最少分词模式等等。这些功能的不断完善和升级,使IK分词器逐渐成为在中文文本分析领域常用的分词器之一。 总体来说,IK分词器Elasticsearch中的应用是非常广泛的,它不仅能够提高搜索效果,而且还能加强文本拓展和分析的功能。随着数据量的不断增长和应用场景的不断拓展,对于一款高效、灵活的分词器的需求越来越大。相信IK分词器在未来的应用中,会有更加广泛的空间和应用。 ### 回答3: Elasticsearchik分词器是一种常用的中文全文检索分词器,能够对中文文本进行分词,将整篇文本划分成有意义的词语,并且支持多种分词模式。 该分词器的优点在于,它采用了基于词典和规则的分词算法,对中文分词效果非常好,特别是对于一些细分领域的专业术语等难分难识的词汇,在ik分词器的支持下也能够准确地被识别和分词。另外,ik分词器还支持自定义词典,可以根据应用场景自定义添加词汇,进一步提高分词效果和搜索准确率。 除此之外,ik分词器还支持多种分词模式,包括最细粒度分词模式、普通分词模式、搜索分词模式和面向中文搜索ik_smart分词模式等,可以根据实际需求进行选择和配置。 总之,ik分词器Elasticsearch中一个非常实用的中文分词组件,对于中文全文检索和搜索应用具有重要的作用,可以提高搜索效果、加速搜索响应速度,为用户提供更加优质的搜索服务。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小鲍侃java

请博主喝个可乐吧,可加微信面基

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值