【ELK】elasticsearch分词器介绍和使用

最新推荐文章于 2024-05-13 05:48:10 发布

Cry丶

最新推荐文章于 2024-05-13 05:48:10 发布

阅读量652

点赞数

分类专栏： elk 文章标签： elasticsearch ik分词器

本文链接：https://blog.csdn.net/haohaoxuexiyai/article/details/114260198

版权

elk 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

内置分词器

什么是分词器

分词器,是将用户输入的一段文本,分析成符合逻辑的一种工具。
常见内置分词器
- Standard Analyzer - 默认分词器，按词切分，小写处理
- Simple Analyzer - 按照非字母切分(符号被过滤), 小写处理
- Stop Analyzer - 小写处理，停用词过滤(the,a,is)
- Whitespace Analyzer - 按照空格切分，不转小写
- Patter Analyzer - 正则表达式，默认\W+(非字符分割)
- Language - 提供了30多种常见语言的分词器
Standard Analyzer
- 标准分析器是默认分词器，如果未指定，则使用该分词器。
```
POST /_analyze
{
  "analyzer": "standard",
  "text":"The quick brown fox."
}
```

Simple Analyzer

按照非字母切分(符号被过滤), 小写处理

POST /_analyze
{
  "analyzer": "simple",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Whitespace Analyzer

按照空格切分，不转小写

POST /_analyze
{
  "analyzer": "whitespace",
  "text":"The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

为指定字段指定分词器

PUT /my-index-000001/_doc/1
{
  "title": "The 2 QUICK Brown-Foxes jumped overthe lazy dog's bone."
}

GET /my-index-000001/_search
{
  "query": {
    "match": {
      "title": "dog"
   }
 }
}

测试搜索

PUT /my-index-000001
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "whitespace",
        "search_analyzer": "simple"
     }
   }
 }
}

GET /my-index-000001/_mapping

PUT /my-index-000001/_doc/1
{
  "title": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

GET /my-index-000001/_search
{
  "query": {
    "match": {
      "title": "dog's jumped"
   }
 }
}

IK中文分词器

使用用默认的分词器standard

POST /_analyze
{
  "analyzer": "standard",
  "text":"中华人民共和国国歌"
}

IK分词器
- 下载：https://github.com/medcl/elasticsearch-analysis-ik
- 解压到plugins/ik
  - chmod 777 elasticsearch-analysis-ik-7.8.0.zip
  - cd /usr/local/elk/plugins
  - mkdir ik
  - cd ik
  - cp /opt/soft/elasticsearch-analysis-ik-7.8.0.zip .
  - unzip elasticsearch-analysis-ik-7.8.0.zip
  - rm elasticsearch-analysis-ik-7.8.0.zip
- 重启es
  - kill -15 5691
  - bin/elasticsearch -d -p fx.pid

测试

POST /_analyze
{
  "analyzer": "ik_max_word",
  "text":"中华人民共和国国歌"
}

POST /_analyze
{
  "analyzer": "ik_smart",
  "text":"中华人民共和国国歌"
}

ik_max_word 和 ik_smart 什么区别
- ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”，会穷尽各种可能的组合；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”。
查看词库
```
head config/main.dic
```

自定义词库

修改配置文件config/IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM
"http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 扩展配置</comment>
        <!--用户可以在这里配置自己的扩展字典 -->
        <entry key="ext_dict">fx.dic</entry>
        <!--用户可以在这里配置自己的扩展停止词字典-->
        <entry key="ext_stopwords"></entry>
        <!--用户可以在这里配置远程扩展字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--用户可以在这里配置远程扩展停止词字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry>
-->
</properties>

cat fx.dic
网红
社畜

需要重启ES生效

POST /my-index-000001/_analyze
{
  "analyzer": "ik_max_word",
  "text":"社畜"
}

POST /my-index-000001/_analyze
{
  "analyzer": "ik_smart",
  "text":"网红"
}

IK分词器支持热更新，但是不稳定，可以通过修改源码实现，有兴趣可以研究！

Cry丶

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录