Elasticsearch 系列指南（三）——集成ik分词器

最新推荐文章于 2024-08-06 19:13:04 发布

yinni11

最新推荐文章于 2024-08-06 19:13:04 发布

阅读量371

点赞数

分类专栏： Elasticsearch

Elasticsearch 专栏收录该内容

24 篇文章 2 订阅

订阅专栏

Elasticsearch中，内置了很多分词器（analyzers），例如standard （标准分词器）、english （英文分词）和chinese （中文分词）。其中standard 就是无脑的一个一个词（汉字）切分，所以适用范围广，但是精准度低；english 对英文更加智能，可以识别单数负数，大小写，过滤stopwords（例如“the”这个词）等；chinese 效果很差。这次主要玩这几个内容：安装中文分词ik，对比不同分词器的效果，得出一个较佳的配置。

IK分析插件将Lucene IK分析器（http://code.google.com/p/ik-analyzer/）集成到elasticsearch中，支持自定义字典。

分析：ik_smart，ik_max_word，分词：ik_smart，ik_max_word

Tips:

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；

ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。

分词器对比：

POST http://192.168.159.159:9200/index1/_analyze?analyzer=ik_max_word
联想召回笔记本电源线

ik测试结果：

{
    "tokens": [
        {
            "token": "联想",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "召回",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "笔记本",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "电源线",
            "start_offset": 7,
            "end_offset": 10,
            "type": "CN_WORD",
            "position": 4
        }
    ]
}

自带chinese和standard分词器的结果：

{
    "tokens": [
        {
            "token": "联",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "想",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "召",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "回",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "笔",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "记",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        },
        {
            "token": "本",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 7
        },
        {
            "token": "电",
            "start_offset": 7,
            "end_offset": 8,
            "type": "<IDEOGRAPHIC>",
            "position": 8
        },
        {
            "token": "源",
            "start_offset": 8,
            "end_offset": 9,
            "type": "<IDEOGRAPHIC>",
            "position": 9
        },
        {
            "token": "线",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<IDEOGRAPHIC>",
            "position": 10
        }
    ]
}

由此可见自带分词器将其分成一个一个的字，这在我们使用过程中并不是很友好，因此ik分词器相反是更好的选择，那么接下来我们就看看ik分词器的安装使用
ik安装：
1.下载或编译

可选1 - 从这里下载预生成包：https：//github.com/medcl/elasticsearch-analysis-ik/releases

解压插件到文件夹 your-es-root/plugins/

可选2 - 使用elasticsearch-plugin来安装（version> v5.5.1）：

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.0.0/elasticsearch-analysis-ik-6.0.0.zip

重新开始elasticsearch

注意：要选择和elasticsearch相同的版本

使用准备：创建数据，录入测试数据

PUT http://localhost:9200/index1
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   3, 
     "number_of_replicas" : 1 
  },
  "mappings": {
    "resource": {
      "dynamic": false, 
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_max_word",
          "fields": {
            "cn": {
              "type": "text",
              "analyzer": "ik_max_word"
            },
            "en": {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

http://localhost:9200/_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影，最好，新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

注意bulk api要“回车”换行，不然会报错。

搜索关键词“最新”和“fox”
测试方法：

POST http://localhost:9200/index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "最新",
      "fields": [ "title", "title.cn", "title.en" ]
    }
  }
}

我们修改query 和fields 字段来对比。

1）搜索“最新”，字段限制在title.cn 的结果（只展示hit部分）：

"hits": [
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "1",
        "_score": 1.0537746,
        "_source": {
            "title": "周星驰最新电影"
        }
    },
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "3",
        "_score": 0.9057159,
        "_source": {
            "title": "周星驰最新电影，最好，新电影"
        }
    },
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "4",
        "_score": 0.5319481,
        "_source": {
            "title": "最最最最好的新新新新电影"
        }
    },
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "2",
        "_score": 0.33246756,
        "_source": {
            "title": "周星驰最好看的新电影"
        }
    }
]
再次搜索“最新”，字段限制在title ，title.en 的结果（只展示hit部分）：
"hits": [
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "4",
        "_score": 1,
        "_source": {
            "title": "最最最最好的新新新新电影"
        }
    },
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "1",
        "_score": 0.75,
        "_source": {
            "title": "周星驰最新电影"
        }
    },
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "3",
        "_score": 0.70710677,
        "_source": {
            "title": "周星驰最新电影，最好，新电影"
        }
    },
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "2",
        "_score": 0.625,
        "_source": {
            "title": "周星驰最好看的新电影"
        }
    }
]

结论：如果没有使用ik中文分词，会把“最新”当成两个独立的“字”，搜索准确性低。

2）搜索“fox”，字段限制在title 和title.cn ，结果为空，对于它们两个分词器，fox和foxes不同。再次搜索“fox”，字段限制在title.en ，结果如下：

"hits": [
    {
        "_index": "index1",
        "_type": "resource",
        "_id": "5",
        "_score": 0.9581454,
        "_source": {
            "title": "I'm not happy about the foxes"
        }
    }
]

结论：中文和标准分词器，不对英文单词做任何处理（单复数等），查全率低。

我的最佳配置

其实最开始创建的索引已经是最佳配置了，在title 下增加cn 和en 两个fields，这样对中文，英文和其他什么乱七八糟文的效果都好点。就像前面说的，title 使用标准分词器，title.cn 使用ik分词器，title.en 使用自带的英文分词器，每次搜索同时覆盖。

热词更新配置
网络词语日新月异，如何让新出的网络热词（或特定的词语）实时的更新到我们的搜索当中呢
先用 ik 测试一下：
POST http://192.168.159.159:9200/index1/_analyze?analyzer=ik_max_word
成龙原名陈港生
返回结果

{
  "tokens" : [ {
    "token" : "成龙",
    "start_offset" : 1,
    "end_offset" : 3,
    "type" : "CN_WORD",
    "position" : 0
  }, {
    "token" : "原名",
    "start_offset" : 3,
    "end_offset" : 5,
    "type" : "CN_WORD",
    "position" : 1
  }, {
    "token" : "陈",
    "start_offset" : 5,
    "end_offset" : 6,
    "type" : "CN_CHAR",
    "position" : 2
  }, {
    "token" : "港",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "CN_WORD",
    "position" : 3
  }, {
    "token" : "生",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "CN_CHAR",
    "position" : 4
  } ]
}

比如ik 的主词典中没有”陈港生” 这个词，所以被拆分了。
现在我们来配置一下
修改 IK 的配置文件：ES 目录/plugins/ik/config/ik/IKAnalyzer.cfg.xml
修改如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">  
<properties>  
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->    
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>     
     <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--用户可以在这里配置远程扩展字典 --> 
    <entry key="remote_ext_dict">http://192.168.1.136/hotWords.php</entry>
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

这里我是用的是远程扩展字典，因为可以使用其他程序调用更新，且不用重启 ES，很方便；当然使用自定义的 mydict.dic 字典也是很方便的，一行一个词，自己加就可以了
既然是远程词典，那么就要是一个可访问的链接，可以是一个页面，也可以是一个txt的文档，但要保证输出的内容是 utf-8 的格式
hotWords.php 的内容

$s = <<<'EOF'
陈港生
元楼
蓝瘦
EOF;
header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);
header('ETag: "5816f349-19"');
echo $s;

现在再测试一下，就可以看到 ik 分词器已经匹配到了 “陈港生” 这个词

...
  }, {
    "token" : "陈港生",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "CN_WORD",
    "position" : 2
  }, {
...

至此我们已经完成ES的中文分词配置，大家可以根据实际需求进行具体配置，如有问题请积极指出，大家一起讨论学习！

转载于：https://blog.csdn.net/mjwwjcoder/article/details/79104859