Elasticsearch(四)IK分词插件

28 篇文章 0 订阅
14 篇文章 0 订阅

1.IK分词插件的安装

浏览器访问: https://github.com/medcl/elasticsearch-analysis-ik

 选择 releases 版本,是已经打包好的,解压就可以使用。

找到5.4.1版本,下载

 

进入elsaticsearch的plugins目录下,将ik分词插件解压后拷贝进去,重启elasticsearch

需要选择 elasticsearch-analysis-ik-5.4.1.zip 第一个

然后切换为root用户对elastic 用户重新授权

chown -R elastic /home/elasticsearch/elasticsearch-5.4.1 

2.重启elasticsearch

ps -ef | grep elastic

kill -9 es进程。

重启es:  ./bin/elasticsearch &

显示加载了IK分词器。

3.测试IK分词器

创建text索引 :

 PUT test

获取test索引的细粒度分词 ik_max_word :

GET test/_analyze?analyzer=ik_max_word
{
  "text":"武汉市长江大桥"
}

分词结果:

{
  "tokens": [
    {
      "token": "武汉市",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "武汉",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "汉",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "市长",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "长江大桥",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "长江",
      "start_offset": 3,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "江",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "大桥",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "桥",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 8
    }
  ]
}

 

获取test索引的粗粒度分词 ik_smart:

GET test/_analyze?analyzer=ik_smart
{
  "text":"武汉市长江大桥"
}

分词结果:

{
  "tokens": [
    {
      "token": "武汉市",
      "start_offset": 0,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "长江大桥",
      "start_offset": 3,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

 4.使用ik分词器官网的测试案例

 执行命令:

//创建index 索引
PUT /index 
  
//创建type为fulltext 的_mapping
POST /index/fulltext/_mapping      
{

        "properties": {

            "content": {

                "type": "text",

                "analyzer": "ik_max_word",

                "search_analyzer": "ik_max_word"

            }

        }



}

//插入四条数据
POST /index/fulltext/1                    
{"content":"美国留给伊拉克的是个烂摊子吗"}




POST /index/fulltext/2
{"content":"公安部:各地校车将享最高路权"}




POST /index/fulltext/3
{"content":"中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"}




POST /index/fulltext/4
{"content":"中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"}


//查询包含中国的文档
POST /index/fulltext/_search        
{

    "query" : { "match" : { "content" : "中国" }},

    "highlight" : {

        "pre_tags" : ["<tag1>", "<tag2>"],

        "post_tags" : ["</tag1>", "</tag2>"],

        "fields" : {

            "content" : {}

        }

    }

}

查询结果:共有两条数据

{
  "took": 96,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5347766,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 0.5347766,  //评分
        "_source": {
          "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
        },
        "highlight": {
          "content": [
            "<tag1>中国</tag1>驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
          ]
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.27638745,
        "_source": {
          "content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
        },
        "highlight": {
          "content": [
            "中韩渔警冲突调查:韩警平均每天扣1艘<tag1>中国</tag1>渔船"
          ]
        }
      }
    ]
  }
}

 

5,扩展词库

新增newword.dic 到/home/elasticsearch/elasticsearch-5.4.1/plugins/elasticsearch-analysis-ik-5.4.1/config目录下

在IKAnalyzer.cfg.xml中指定自定义词典的位置

重启elasticsearch

扩展前,输入命令

GET test/_analyze?analyzer=ik_smart
{
  "text":"厉害了我的我的哥"
}

GET test/_analyze?analyzer=ik_smart
{
  "text":"蓝瘦香菇"
}

输出结果

结果1:
{
  "tokens": [
    {
      "token": "厉",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "害了",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "我",
      "start_offset": 3,
      "end_offset": 4,
      "type": "CN_CHAR",
      "position": 2
    },
    {
      "token": "我",
      "start_offset": 5,
      "end_offset": 6,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "的哥",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}
结果2:
{
  "tokens": [
    {
      "token": "蓝",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "瘦",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "香菇",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

 

扩展中:在custom目录下vi newdic.dic

加入两行,保存退出

将新加入的词典文件加入 IKAnalyzer.cfg.xml中

 <entry key="ext_dict">custom/newdic.dic;custom/mydict.dic;custom/single_word_low_freq.dic</entry>

重启elastic和kibana

扩展后分词结果:

结果1:
{
  "tokens": [
    {
      "token": "厉害了我的哥",
      "start_offset": 0,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}

结果2:
{
  "tokens": [
    {
      "token": "蓝瘦香菇",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    }
  ]
}


 

 

 

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值