进阶-第29__彻底掌握IK中文分词_上机动手实战IK中文分词器的安装和使用-CSDN博客

本文链接：https://blog.csdn.net/qq_35524586/article/details/88543565

之前大家会发现，我们全部是用英文在玩儿。。。好玩儿不好玩儿。。。不好玩儿

中国人，其实我们用来进行搜索的，绝大多数，都是中文应用，很少做英文的

standard：没有办法对中文进行合理分词的，只是将每个中文字符一个一个的切割开来，比如说中国人 --> 中国人

英语的也要学：所以说，我们利用核心知识篇的相关的知识，来把es这种英文原生的搜索引擎，先学一下; 因为有些知识点，可能用英文讲更靠谱，因为比如说analyzed，palyed，students --> stemmer，analyze，play，student。有些知识点，仅仅适用于英文，不太适用于中文

从这一讲开始，大家就会觉得很爽，因为全部都是我们熟悉的中文了，没有英文了，高阶知识点，搜索，聚合，全部是中文了

在搜索引擎领域，比较成熟和流行的，就是ik分词器

中国人很喜欢吃油条

Standard(英文分词器)：中国人很喜欢吃油条

ik：中国人很喜欢吃油条

1、在elasticsearch中安装ik中文分词器

（1）git clone https://github.com/medcl/elasticsearch-analysis-ik

（2）git checkout tags/v5.2.0

（3）mvn package

（4）将target/releases/elasticsearch-analysis-ik-5.2.0.zip拷贝到es/plugins/ik目录下

（5）在es/plugins/ik下对elasticsearch-analysis-ik-5.2.0.zip进行解压缩

（6）重启es

2、ik分词器基础知识

两种analyzer，你根据自己的需要自己选吧，但是一般是选用ik_max_word

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；

ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

共和国 --> 中华人民共和国和国歌，搜到吗？？？？

3、ik分词器的使用

创建索引，让其使用ik分词器

PUT /my_index

{

"mappings": {

"my_type": {

"properties": {

"text": {

"type": "text",

"analyzer": "ik_max_word"

}

结果：

{

"acknowledged": true,

"shards_acknowledged": true

}

添加数据

POST /my_index/my_type/_bulk

{ "index": { "_id": "1"} }

{ "text": "男子偷上万元发红包求交女友被抓获时仍然单身" }

{ "index": { "_id": "2"} }

{ "text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝" }

{ "index": { "_id": "3"} }

{ "text": "深圳女孩骑车逆行撞奔驰遭索赔被吓哭(图)" }

{ "index": { "_id": "4"} }

{ "text": "女人对护肤品比对男票好？网友神怼" }

{ "index": { "_id": "5"} }

{ "text": "为什么国内的街道招牌用的都是红黄配？" }

查看数据是否添加成功

{

"took": 5,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 5,

"max_score": 1,

"hits": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "5",

"_score": 1,

"_source": {

"text": "为什么国内的街道招牌用的都是红黄配？"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"_score": 1,

"_source": {

"text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "4",

"_score": 1,

"_source": {

"text": "女人对护肤品比对男票好？网友神怼"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"_score": 1,

"_source": {

"text": "男子偷上万元发红包求交女友被抓获时仍然单身"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "3",

"_score": 1,

"_source": {

"text": "深圳女孩骑车逆行撞奔驰遭索赔被吓哭(图)"

}

]

}

测试使用ik_max_word 进行分词分析

GET /my_index/_analyze

{

"text": "男子偷上万元发红包求交女友被抓获时仍然单身",

"analyzer": "ik_max_word"

}

结果：

{

"tokens": [

{

"token": "男子",

"start_offset": 0,

"end_offset": 2,

"type": "CN_WORD",

"position": 0

{

"token": "偷上",

"start_offset": 2,

"end_offset": 4,

"type": "CN_WORD",

"position": 1

{

"token": "上万",

"start_offset": 3,

"end_offset": 5,

"type": "CN_WORD",

"position": 2

{

"token": "万元",

"start_offset": 4,

"end_offset": 6,

"type": "CN_WORD",

"position": 3

{

"token": "万",

"start_offset": 4,

"end_offset": 5,

"type": "CN_WORD",

"position": 4

{

"token": "元",

"start_offset": 5,

"end_offset": 6,

"type": "CN_CHAR",

"position": 5

{

"token": "发红包",

"start_offset": 6,

"end_offset": 9,

"type": "CN_WORD",

"position": 6

{

"token": "发红",

"start_offset": 6,

"end_offset": 8,

"type": "CN_WORD",

"position": 7

{

"token": "发",

"start_offset": 6,

"end_offset": 7,

"type": "CN_WORD",

"position": 8

{

"token": "红包",

"start_offset": 7,

"end_offset": 9,

"type": "CN_WORD",

"position": 9

{

"token": "求",

"start_offset": 9,

"end_offset": 10,

"type": "CN_CHAR",

"position": 10

{

"token": "交",

"start_offset": 10,

"end_offset": 11,

"type": "CN_CHAR",

"position": 11

{

"token": "女友",

"start_offset": 11,

"end_offset": 13,

"type": "CN_WORD",

"position": 12

{

"token": "抓获",

"start_offset": 15,

"end_offset": 17,

"type": "CN_WORD",

"position": 13

{

"token": "获",

"start_offset": 16,

"end_offset": 17,

"type": "CN_WORD",

"position": 14

{

"token": "时",

"start_offset": 17,

"end_offset": 18,

"type": "CN_CHAR",

"position": 15

{

"token": "仍然",

"start_offset": 18,

"end_offset": 20,

"type": "CN_WORD",

"position": 16

{

"token": "单身",

"start_offset": 20,

"end_offset": 22,

"type": "CN_WORD",

"position": 17

}

]

}

测试采用英语分词器

GET /my_index/_analyze

{

"text": "男子偷上万元发红包求交女友被抓获时仍然单身",

"analyzer": "standard"

}

结果：

{

"tokens": [

{

"token": "男",

"start_offset": 0,

"end_offset": 1,

"type": "<IDEOGRAPHIC>",

"position": 0

{

"token": "子",

"start_offset": 1,

"end_offset": 2,

"type": "<IDEOGRAPHIC>",

"position": 1

{

"token": "偷",

"start_offset": 2,

"end_offset": 3,

"type": "<IDEOGRAPHIC>",

"position": 2

{

"token": "上",

"start_offset": 3,

"end_offset": 4,

"type": "<IDEOGRAPHIC>",

"position": 3

{

"token": "万",

"start_offset": 4,

"end_offset": 5,

"type": "<IDEOGRAPHIC>",

"position": 4

{

"token": "元",

"start_offset": 5,

"end_offset": 6,

"type": "<IDEOGRAPHIC>",

"position": 5

{

"token": "发",

"start_offset": 6,

"end_offset": 7,

"type": "<IDEOGRAPHIC>",

"position": 6

{

"token": "红",

"start_offset": 7,

"end_offset": 8,

"type": "<IDEOGRAPHIC>",

"position": 7

{

"token": "包",

"start_offset": 8,

"end_offset": 9,

"type": "<IDEOGRAPHIC>",

"position": 8

{

"token": "求",

"start_offset": 9,

"end_offset": 10,

"type": "<IDEOGRAPHIC>",

"position": 9

{

"token": "交",

"start_offset": 10,

"end_offset": 11,

"type": "<IDEOGRAPHIC>",

"position": 10

{

"token": "女",

"start_offset": 11,

"end_offset": 12,

"type": "<IDEOGRAPHIC>",

"position": 11

{

"token": "友",

"start_offset": 12,

"end_offset": 13,

"type": "<IDEOGRAPHIC>",

"position": 12

{

"token": "被",

"start_offset": 14,

"end_offset": 15,

"type": "<IDEOGRAPHIC>",

"position": 13

{

"token": "抓",

"start_offset": 15,

"end_offset": 16,

"type": "<IDEOGRAPHIC>",

"position": 14

{

"token": "获",

"start_offset": 16,

"end_offset": 17,

"type": "<IDEOGRAPHIC>",

"position": 15

{

"token": "时",

"start_offset": 17,

"end_offset": 18,

"type": "<IDEOGRAPHIC>",

"position": 16

{

"token": "仍",

"start_offset": 18,

"end_offset": 19,

"type": "<IDEOGRAPHIC>",

"position": 17

{

"token": "然",

"start_offset": 19,

"end_offset": 20,

"type": "<IDEOGRAPHIC>",

"position": 18

{

"token": "单",

"start_offset": 20,

"end_offset": 21,

"type": "<IDEOGRAPHIC>",

"position": 19

{

"token": "身",

"start_offset": 21,

"end_offset": 22,

"type": "<IDEOGRAPHIC>",

"position": 20

}

]

}

使用ik分词器进行搜索

GET /my_index/my_type/_search

{

"query": {

"match": {

"text": "16岁少女结婚好还是单身好？"

}

结果：

{

"took": 18,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 3,

"max_score": 3.603062,

"hits": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"_score": 3.603062,

"_source": {

"text": "16岁少女为结婚“变”22岁 7年后想离婚被法院拒绝"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "4",

"_score": 1.3862944,

"_source": {

"text": "女人对护肤品比对男票好？网友神怼"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"_score": 0.2699054,

"_source": {

"text": "男子偷上万元发红包求交女友被抓获时仍然单身"

}

]

}