进阶-第29__彻底掌握IK中文分词_上机动手实战IK中文分词器的安装和使用

之前大家会发现,我们全部是用英文在玩儿。。。好玩儿不好玩儿。。。不好玩儿

 

中国人,其实我们用来进行搜索的,绝大多数,都是中文应用,很少做英文的

standard:没有办法对中文进行合理分词的,只是将每个中文字符一个一个的切割开来,比如说中国人 --> 中 国 人

 

英语的也要学:所以说,我们利用核心知识篇的相关的知识,来把es这种英文原生的搜索引擎,先学一下; 因为有些知识点,可能用英文讲更靠谱,因为比如说analyzed,palyed,students --> stemmer,analyze,play,student。有些知识点,仅仅适用于英文,不太适用于中文

 

从这一讲开始,大家就会觉得很爽,因为全部都是我们熟悉的中文了,没有英文了,高阶知识点,搜索,聚合,全部是中文了

 

在搜索引擎领域,比较成熟和流行的,就是ik分词器

 

中国人很喜欢吃油条

 

Standard(英文分词器):中 国 人 很 喜 欢 吃 油 条

ik:中国人 很 喜欢 吃 油条

 

1、在elasticsearch中安装ik中文分词器

 

(1)git clone https://github.com/medcl/elasticsearch-analysis-ik

(2)git checkout tags/v5.2.0

(3)mvn package

(4)将target/releases/elasticsearch-analysis-ik-5.2.0.zip拷贝到es/plugins/ik目录下

(5)在es/plugins/ik下对elasticsearch-analysis-ik-5.2.0.zip进行解压缩

(6)重启es

2、ik分词器基础知识

 

两种analyzer,你根据自己的需要自己选吧,但是一般是选用ik_max_word

 

ik_max_word: 会将文本做最细粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”,会穷尽各种可能的组合;

 

ik_smart: 会做最粗粒度的拆分,比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

 

共和国 --> 中华人民共和国和国歌,搜到吗????

 

3、ik分词器的使用

创建索引,让其使用ik分词器

PUT /my_index

{

  "mappings": {

    "my_type": {

      "properties": {

        "text": {

          "type": "text",

          "analyzer": "ik_max_word"

        }

      }

    }

  }

}

结果:

{

  "acknowledged": true,

  "shards_acknowledged": true

}

添加数据

POST /my_index/my_type/_bulk

{ "index": { "_id": "1"} }

{ "text": "男子偷上万元发红包求交女友 被抓获时仍然单身" }

{ "index": { "_id": "2"} }

{ "text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝" }

{ "index": { "_id": "3"} }

{ "text": "深圳女孩骑车逆行撞奔驰 遭索赔被吓哭(图)" }

{ "index": { "_id": "4"} }

{ "text": "女人对护肤品比对男票好?网友神怼" }

{ "index": { "_id": "5"} }

{ "text": "为什么国内的街道招牌用的都是红黄配?" }

 

查看数据是否添加成功

{

  "took": 5,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 5,

    "max_score": 1,

    "hits": [

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "5",

        "_score": 1,

        "_source": {

          "text": "为什么国内的街道招牌用的都是红黄配?"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "2",

        "_score": 1,

        "_source": {

          "text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "4",

        "_score": 1,

        "_source": {

          "text": "女人对护肤品比对男票好?网友神怼"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "1",

        "_score": 1,

        "_source": {

          "text": "男子偷上万元发红包求交女友 被抓获时仍然单身"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "3",

        "_score": 1,

        "_source": {

          "text": "深圳女孩骑车逆行撞奔驰 遭索赔被吓哭(图)"

        }

      }

    ]

  }

}

测试使用ik_max_word 进行分词分析

GET /my_index/_analyze

{

  "text": "男子偷上万元发红包求交女友 被抓获时仍然单身",

  "analyzer": "ik_max_word"

}

结果:

{

  "tokens": [

    {

      "token": "男子",

      "start_offset": 0,

      "end_offset": 2,

      "type": "CN_WORD",

      "position": 0

    },

    {

      "token": "偷上",

      "start_offset": 2,

      "end_offset": 4,

      "type": "CN_WORD",

      "position": 1

    },

    {

      "token": "上万",

      "start_offset": 3,

      "end_offset": 5,

      "type": "CN_WORD",

      "position": 2

    },

    {

      "token": "万元",

      "start_offset": 4,

      "end_offset": 6,

      "type": "CN_WORD",

      "position": 3

    },

    {

      "token": "万",

      "start_offset": 4,

      "end_offset": 5,

      "type": "CN_WORD",

      "position": 4

    },

    {

      "token": "元",

      "start_offset": 5,

      "end_offset": 6,

      "type": "CN_CHAR",

      "position": 5

    },

    {

      "token": "发红包",

      "start_offset": 6,

      "end_offset": 9,

      "type": "CN_WORD",

      "position": 6

    },

    {

      "token": "发红",

      "start_offset": 6,

      "end_offset": 8,

      "type": "CN_WORD",

      "position": 7

    },

    {

      "token": "发",

      "start_offset": 6,

      "end_offset": 7,

      "type": "CN_WORD",

      "position": 8

    },

    {

      "token": "红包",

      "start_offset": 7,

      "end_offset": 9,

      "type": "CN_WORD",

      "position": 9

    },

    {

      "token": "求",

      "start_offset": 9,

      "end_offset": 10,

      "type": "CN_CHAR",

      "position": 10

    },

    {

      "token": "交",

      "start_offset": 10,

      "end_offset": 11,

      "type": "CN_CHAR",

      "position": 11

    },

    {

      "token": "女友",

      "start_offset": 11,

      "end_offset": 13,

      "type": "CN_WORD",

      "position": 12

    },

    {

      "token": "抓获",

      "start_offset": 15,

      "end_offset": 17,

      "type": "CN_WORD",

      "position": 13

    },

    {

      "token": "获",

      "start_offset": 16,

      "end_offset": 17,

      "type": "CN_WORD",

      "position": 14

    },

    {

      "token": "时",

      "start_offset": 17,

      "end_offset": 18,

      "type": "CN_CHAR",

      "position": 15

    },

    {

      "token": "仍然",

      "start_offset": 18,

      "end_offset": 20,

      "type": "CN_WORD",

      "position": 16

    },

    {

      "token": "单身",

      "start_offset": 20,

      "end_offset": 22,

      "type": "CN_WORD",

      "position": 17

    }

  ]

}

测试采用英语分词器

GET /my_index/_analyze

{

  "text": "男子偷上万元发红包求交女友 被抓获时仍然单身",

  "analyzer": "standard"

}

结果:

{

  "tokens": [

    {

      "token": "男",

      "start_offset": 0,

      "end_offset": 1,

      "type": "<IDEOGRAPHIC>",

      "position": 0

    },

    {

      "token": "子",

      "start_offset": 1,

      "end_offset": 2,

      "type": "<IDEOGRAPHIC>",

      "position": 1

    },

    {

      "token": "偷",

      "start_offset": 2,

      "end_offset": 3,

      "type": "<IDEOGRAPHIC>",

      "position": 2

    },

    {

      "token": "上",

      "start_offset": 3,

      "end_offset": 4,

      "type": "<IDEOGRAPHIC>",

      "position": 3

    },

    {

      "token": "万",

      "start_offset": 4,

      "end_offset": 5,

      "type": "<IDEOGRAPHIC>",

      "position": 4

    },

    {

      "token": "元",

      "start_offset": 5,

      "end_offset": 6,

      "type": "<IDEOGRAPHIC>",

      "position": 5

    },

    {

      "token": "发",

      "start_offset": 6,

      "end_offset": 7,

      "type": "<IDEOGRAPHIC>",

      "position": 6

    },

    {

      "token": "红",

      "start_offset": 7,

      "end_offset": 8,

      "type": "<IDEOGRAPHIC>",

      "position": 7

    },

    {

      "token": "包",

      "start_offset": 8,

      "end_offset": 9,

      "type": "<IDEOGRAPHIC>",

      "position": 8

    },

    {

      "token": "求",

      "start_offset": 9,

      "end_offset": 10,

      "type": "<IDEOGRAPHIC>",

      "position": 9

    },

    {

      "token": "交",

      "start_offset": 10,

      "end_offset": 11,

      "type": "<IDEOGRAPHIC>",

      "position": 10

    },

    {

      "token": "女",

      "start_offset": 11,

      "end_offset": 12,

      "type": "<IDEOGRAPHIC>",

      "position": 11

    },

    {

      "token": "友",

      "start_offset": 12,

      "end_offset": 13,

      "type": "<IDEOGRAPHIC>",

      "position": 12

    },

    {

      "token": "被",

      "start_offset": 14,

      "end_offset": 15,

      "type": "<IDEOGRAPHIC>",

      "position": 13

    },

    {

      "token": "抓",

      "start_offset": 15,

      "end_offset": 16,

      "type": "<IDEOGRAPHIC>",

      "position": 14

    },

    {

      "token": "获",

      "start_offset": 16,

      "end_offset": 17,

      "type": "<IDEOGRAPHIC>",

      "position": 15

    },

    {

      "token": "时",

      "start_offset": 17,

      "end_offset": 18,

      "type": "<IDEOGRAPHIC>",

      "position": 16

    },

    {

      "token": "仍",

      "start_offset": 18,

      "end_offset": 19,

      "type": "<IDEOGRAPHIC>",

      "position": 17

    },

    {

      "token": "然",

      "start_offset": 19,

      "end_offset": 20,

      "type": "<IDEOGRAPHIC>",

      "position": 18

    },

    {

      "token": "单",

      "start_offset": 20,

      "end_offset": 21,

      "type": "<IDEOGRAPHIC>",

      "position": 19

    },

    {

      "token": "身",

      "start_offset": 21,

      "end_offset": 22,

      "type": "<IDEOGRAPHIC>",

      "position": 20

    }

  ]

}

使用ik分词器进行搜索

GET /my_index/my_type/_search

{

  "query": {

    "match": {

      "text": "16岁少女结婚好还是单身好?"

    }

  }

}

结果:

{

  "took": 18,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 3.603062,

    "hits": [

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "2",

        "_score": 3.603062,

        "_source": {

          "text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "4",

        "_score": 1.3862944,

        "_source": {

          "text": "女人对护肤品比对男票好?网友神怼"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "1",

        "_score": 0.2699054,

        "_source": {

          "text": "男子偷上万元发红包求交女友 被抓获时仍然单身"

        }

      }

    ]

  }

}

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值