IK分词器和ElasticSearch集成使用

2401_86359166

于 2024-09-05 00:53:15 发布

阅读量811

点赞数 10

文章标签： elasticsearch 大数据搜索引擎

本文链接：https://blog.csdn.net/2401_86359166/article/details/141907651

版权

IK分词器3.0的特性如下：

1. 采用了特有的“正向迭代最细粒度切分算法，具有60万字/秒的高速处理能力。
2. 采用了多子处理器分析模式，支持：英文字母（IP地址、Email、URL）、数字（日期，常用中文数量词，罗马数字，科学计数法），中文词汇（姓名、地名处理）等分词处理。
3. 对中英联合支持不是很好,在这方面的处理比较麻烦.需再做一次查询,同时是支持个人词条的优化的词典存储，更小的内存占用。
4. 支持用户词典扩展定义。
5. 针对Lucene全文检索优化的查询分析器IKQueryParser；采用歧义分析算法优化查询关键字的搜索排列组合，能极大的提高Lucene检索的命中率。`

ElasticSearch集成IK分词器

IK分词器的安装

1）下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases

2）解压，将解压后的elasticsearch文件夹拷贝到elasticsearch\plugins下，并重命名文件夹为analysis-ik

3）重新启动ElasticSearch，即可加载IK分词器

IK分词器测试

IK提供了两个分词算法ik_smart 和 ik_max_word，其中 ik_smart 为最少切分，ik_max_word为最细粒度划分

1）最小切分：在浏览器地址栏输入地址

请求： GET http://localhost:9200/_analyze?analyzer=ik_smart&pretty=true&text=IKAnalyzer是一个中文分词工具包

{

“tokens”: [

{

“token”: “ikanalyzer”,

“start_offset”: 0,

“end_offset”: 10,

“type”: “ENGLISH”,

“position”: 0

{

“token”: “是”,

“start_offset”: 10,

“end_offset”: 11,

“type”: “CN_CHAR”,

“position”: 1

{

“token”: “一个”,

“start_offset”: 11,

“end_offset”: 13,

“type”: “CN_WORD”,

“position”: 2

{

“token”: “中文”,

“start_offset”: 13,

“end_offset”: 15,

“type”: “CN_WORD”,

“position”: 3

{

“token”: “分词”,

“start_offset”: 15,

“end_offset”: 17,

“type”: “CN_WORD”,

“position”: 4

{

“token”: “工具包”,

“start_offset”: 17,

“end_offset”: 20,

“type”: “CN_WORD”,

“position”: 5

}

]

}

2）最细切分：在浏览器地址栏输入地址

请求：GET http://localhost:9200/_analyze?analyzer=ik_max_word&pretty=true&text=IKAnalyzer是一个中文分词工具包

{

“tokens”: [

{

“token”: “ikanalyzer”,

“start_offset”: 0,

“end_offset”: 10,

“type”: “ENGLISH”,

“position”: 0

{

“token”: “是”,

“start_offset”: 10,

“end_offset”: 11,

“type”: “CN_CHAR”,

“position”: 1

{

“token”: “一个”,

“start_offset”: 11,

“end_offset”: 13,

“type”: “CN_WORD”,

“position”: 2

{

“token”: “一”,

“start_offset”: 11,

“end_offset”: 12,

“type”: “TYPE_CNUM”,

“position”: 3

{

“token”: “个中”,

“start_offset”: 12,

“end_offset”: 14,

“type”: “CN_WORD”,

“position”: 4

{

“token”: “个”,

“start_offset”: 12,

“end_offset”: 13,

“type”: “COUNT”,

“position”: 5

{

“token”: “中文”,

“start_offset”: 13,

“end_offset”: 15,

“type”: “CN_WORD”,

“position”: 6

{

“token”: “分词”,

“start_offset”: 15,

“end_offset”: 17,

“type”: “CN_WORD”,

“position”: 7

{

“token”: “工具包”,

“start_offset”: 17,

“end_offset”: 20,

“type”: “CN_WORD”,

“position”: 8

{

“token”: “工具”,

“start_offset”: 17,

“end_offset”: 19,

“type”: “CN_WORD”,

“position”: 9

{

“token”: “包”,

“start_offset”: 19,

“end_offset”: 20,

“type”: “CN_CHAR”,

“position”: 10

}

]

}

修改索引映射mapping：

1. 创建blog索引，分词器使用ik_max_word

请求：PUT https://localhost:9200/blog

{

“mappings”: {

“hello”: {

“properties”: {

“id”: {

“type”: “long”,

“store”: true,

“index”:“not_analyzed”

“title”: {

“type”: “text”,

“store”: true,

“index”:“analyzed”,

“analyzer”:“ik_max_word”

“content”: {

“type”: “text”,

“store”: true,

“index”:“analyzed”,

“analyzer”:“ik_max_word”

}

在postman中测试结果：

在这里插入图片描述

2. 创建文档

请求：POST https://localhost:9200/blog/article/1

{

“id”:1,

“title”:“IK分词器测试”,

“content”:“IK提供了两个分词算法ik_smart 和 ik_max_word,其中 ik_smart 为最少切分，ik_max_word为最细粒度划分”

}

{

“id”:2,

“title”:“ElasticSearch是一个基于Lucene的搜索服务器”,

“content”:“它提供了一个分布式多用户能力的全文搜索引擎，基于RESTfulweb接口。Elasticsearch是用Java开发的，并作为Apache许可条款下的开放源码发布，是当前流行的企业级搜索引擎。设计用于云计算中，能够达到实时搜索，稳定，可靠，快速，安装使用方便。”

}

{

“id”:3,

“title”:“ElasticSearch概述”,

“content”:“Elasticsearch是面向文档(document oriented)的，这意味着它可以存储整个对象或文档(document)。然而它不仅仅是存储，还会索引(index)每个文档的内容使之可以被搜索。在Elasticsearch中，你可以对文档（而非成行成列的数据）进行索引、搜索、排序、过滤。”

}

3. term测试查询

请求：POST localhost:9200/blog/hello/_search

请求体：

{

“query”:{

“term”:{

“content”:“搜索”

}