Elasticsearch（四）IK分词插件

最新推荐文章于 2024-07-19 18:00:00 发布

心如花木，向阳而生

最新推荐文章于 2024-07-19 18:00:00 发布

阅读量661

点赞数

分类专栏： linux Java ElasticSearch 文章标签： ElasticsearchIK分词插件

本文链接：https://blog.csdn.net/qq_38669394/article/details/87345737

版权

Java 同时被 3 个专栏收录

35 篇文章 0 订阅

订阅专栏

linux

28 篇文章 0 订阅

订阅专栏

ElasticSearch

14 篇文章 0 订阅

订阅专栏

1.IK分词插件的安装

浏览器访问： https://github.com/medcl/elasticsearch-analysis-ik

选择 releases 版本，是已经打包好的，解压就可以使用。

找到5.4.1版本，下载

进入elsaticsearch的plugins目录下，将ik分词插件解压后拷贝进去，重启elasticsearch

需要选择 elasticsearch-analysis-ik-5.4.1.zip 第一个

然后切换为root用户对elastic 用户重新授权

chown -R elastic /home/elasticsearch/elasticsearch-5.4.1

2.重启elasticsearch

ps -ef | grep elastic

kill -9 es进程。

重启es: ./bin/elasticsearch &

显示加载了IK分词器。

3.测试IK分词器

创建text索引：

PUT test

获取test索引的细粒度分词 ik_max_word ：

GET test/_analyze?analyzer=ik_max_word
{
"text":"武汉市长江大桥"
}

分词结果：

{
  "tokens": [
    {
      "token": "武汉市",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "武汉",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "汉",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "市长",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "长江大桥",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "长江",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "江",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "大桥",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "桥",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 8
    }
  ]
}

获取test索引的粗粒度分词 ik_smart：

GET test/_analyze?analyzer=ik_smart
{
"text":"武汉市长江大桥"
}

分词结果：

{
  "tokens": [
    {
      "token": "武汉市",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "长江大桥",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

4.使用ik分词器官网的测试案例

执行命令：

//创建index 索引
PUT /index 
  
//创建type为fulltext 的_mapping
POST /index/fulltext/_mapping      
{

        "properties": {

            "content": {

                "type": "text",

                "analyzer": "ik_max_word",

                "search_analyzer": "ik_max_word"

            }

        }



}

//插入四条数据
POST /index/fulltext/1                    
{"content":"美国留给伊拉克的是个烂摊子吗"}




POST /index/fulltext/2
{"content":"公安部：各地校车将享最高路权"}




POST /index/fulltext/3
{"content":"中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"}




POST /index/fulltext/4
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}


//查询包含中国的文档
POST /index/fulltext/_search        
{

    "query" : { "match" : { "content" : "中国" }},

    "highlight" : {

        "pre_tags" : ["<tag1>", "<tag2>"],

        "post_tags" : ["</tag1>", "</tag2>"],

        "fields" : {

            "content" : {}

        }

    }

}

查询结果：共有两条数据

{
  "took": 96,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5347766,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 0.5347766,  //评分
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.27638745,
        "_source": {
          "content": "中韩渔警冲突调查：韩警平均每天扣1艘中国渔船"
        },
        "highlight": {
          "content": [
            "中韩渔警冲突调查：韩警平均每天扣1艘<tag1>中国</tag1>渔船"
          ]
        }
      }
    ]
  }
}

5，扩展词库

新增newword.dic 到/home/elasticsearch/elasticsearch-5.4.1/plugins/elasticsearch-analysis-ik-5.4.1/config目录下

在IKAnalyzer.cfg.xml中指定自定义词典的位置

重启elasticsearch

扩展前，输入命令

GET test/_analyze?analyzer=ik_smart
{
  "text":"厉害了我的我的哥"
}

GET test/_analyze?analyzer=ik_smart
{
  "text":"蓝瘦香菇"
}

输出结果

结果1：
{
  "tokens": [
    {
      "token": "厉",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "害了",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "我",
      "start_offset": 3,
      "end_offset": 4,
      "type": "CN_CHAR",
      "position": 2
    },
    {
      "token": "我",
      "start_offset": 5,
      "end_offset": 6,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "的哥",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}
结果2：
{
  "tokens": [
    {
      "token": "蓝",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "瘦",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "香菇",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

扩展中：在custom目录下vi newdic.dic

加入两行，保存退出

将新加入的词典文件加入 IKAnalyzer.cfg.xml中

<entry key="ext_dict">custom/newdic.dic;custom/mydict.dic;custom/single_word_low_freq.dic</entry>

重启elastic和kibana

扩展后分词结果：

结果1：
{
  "tokens": [
    {
      "token": "厉害了我的哥",
      "start_offset": 0,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

结果2：
{
  "tokens": [
    {
      "token": "蓝瘦香菇",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}