Elasticsearch之IK分词器

yzm4399

已于 2022-01-22 09:18:26 修改

阅读量1.6k

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

于 2022-01-21 21:18:29 首次发布

本文链接：https://blog.csdn.net/qq_43654581/article/details/122610040

版权

elasticsearch 专栏收录该内容

7 篇文章

订阅专栏

1.比较

使用内置分词器

内置的分词器将一句话拆分成一个个字，这种拆法意义不大

使用IK分词器

2.安装IK分词器

资源
https://github.com/medcl/elasticsearch-analysis-ik/releases
链接：https://pan.baidu.com/s/1dTzBN6fr1ieks25qDqA26A
提取码：0cc3

在es安装目录下的plugins目录里创建ik目录
mkdir /usr/local/es/elasticsearch-7.2.0/plugins/ik
安装unzip命令
yum -y install unzip
解压
unzip elasticsearch-analysis-ik-7.2.0.zip

重启es即可

3.使用

ik使用
ik_max_word ：会将文本做最细粒度的拆分；尽可能多的拆分出词语
ik_smart：会做最粗粒度的拆分；已被分出的词语将不会再次被其它词语占有

{
-"tokens": [
-{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
-{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
-{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
-{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
-{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
-{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
-{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
-{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
-{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
},
-{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
}
]
}

在这里插入图片描述

4.创建索引、使用IK分词器

创建索引

{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ik" : {
                    "tokenizer" : "ik_max_word"
                }
            }
        }
    },
    "mappings" : {
        "properties" : {
          "username" : {"type" : "text", "analyzer" : "ik_max_word"}
         }
    }
}

添加数据

查询

5.自定义

在这里插入图片描述

{
-"tokens": [
-{
"token": "你好",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
-{
"token": "我",
"start_offset": 3,
"end_offset": 4,
"type": "CN_CHAR",
"position": 1
},
-{
"token": "朴",
"start_offset": 4,
"end_offset": 5,
"type": "CN_CHAR",
"position": 2
},
-{
"token": "国",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 3
},
-{
"token": "昌",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 4
}
]
}

在这里插入图片描述

创建 custom 目录
mkdir custom

custom/myext.dic

custom/myext_stopword.dic

vim IKAnalyzer.cfg.xml

重启es

{
-"tokens": [
-{
"token": "你好",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
-{
"token": "史珍香",
"start_offset": 2,
"end_offset": 5,
"type": "CN_WORD",
"position": 1
},
-{
"token": "我",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 2
},
-{
"token": "朴国昌",
"start_offset": 7,
"end_offset": 10,
"type": "CN_WORD",
"position": 3
}
]
}