ELK高级搜索八之分词器实战

yangyanping20108

已于 2023-07-19 10:05:01 修改

阅读量281

点赞数

分类专栏：搜索文章标签： elk 微服务分布式

于 2023-07-18 21:03:40 首次发布

本文链接：https://blog.csdn.net/yangyanping20108/article/details/131796386

版权

搜索专栏收录该内容

10 篇文章 0 订阅

订阅专栏

分词器的介绍和使用

什么是分词器

分词器 接受一个字符串作为输入，将这个字符串拆分成独立的词或 语汇单元（token） （可能会丢弃一些标点符号等字符），然后输出一个 语汇单元流（token stream） 。

有趣的是用于词汇识别的算法。 whitespace （空白字符）分词器按空白字符 —— 空格、tabs、换行符等等进行简单拆分 —— 然后假定连续的非空格字符组成了一个语汇单元。

将用户输入的一段文本，按照一定逻辑，分析成多个词语的一种工具。常用的内置分词器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer

standard analyzer

标准分析器是默认分词器，如果未指定，则使用该分词器。

POST http://127.0.0.1:9200/_analyze
{
 "analyzer":"standard",
 "text":"我是程序员"
}


{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "程",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "序",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        }
    ]
}

simple analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成 term，而且所有的 term 都是小写的。

POST http://127.0.0.1:9200/_analyze
{
  "analyzer":"simple",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 2
        },
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

whitespace analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

POST  http://127.0.0.1:9200/_analyze
{
  "analyzer":"whitespace",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "word",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 1
        },
        {
            "token": "a",
            "start_offset": 8,
            "end_offset": 9,
            "type": "word",
            "position": 2
        },
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

stop analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持，默认使用了 english 停止词

stopwords 预定义的停止词列表，比如 (the,a,an,this,of,at)等等。

POST  http://127.0.0.1:9200/_analyze

{
  "analyzer":"stop",
  "text":"this is a book"
}

{
    "tokens": [
        {
            "token": "book",
            "start_offset": 10,
            "end_offset": 14,
            "type": "word",
            "position": 3
        }
    ]
}

中文分词器

安装
下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases。解压到 es/plugins/ik中，如图：

ik分词器的使用

ik_max_word: 会将文本做最细粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国，中华人民，中华，华人，人民共和国，人民，共和国，共和，国，国歌”，会穷尽各种可能的组合；
ik_smart: 会做最粗粒度的拆分，比如会将“中华人民共和国国歌”拆分为“中华人民共和国，国歌”。

配置文件

文件	描述
IKAnalyzer.cfg.xml	用来配置自定义词库
main.dic	ik原生内置的中文词库，总共有27万多条，只要是这些单词，都会被分在一起
preposition.dic	介词
quantifier.dic	放了一些单位相关的词，量词
suffix.dic	放了一些后缀
surname.dic	中国的姓氏
stopword.dic	英文停用词

IKAnalyzer.cfg.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict"></entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<!-- <entry key="remote_ext_dict">words_location</entry> -->
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

使用ik_smart分词：

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"中华人民共和国国歌"
}


{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

使用ik_max_word分词：

POST localhost:9200/_analyze
{
    "analyzer":"ik_max_word",
    "text":"中华人民共和国国歌"
}

{
    "tokens": [
        {
            "token": "中华人民共和国",
            "start_offset": 0,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "中华人民",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "中华",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "华人",
            "start_offset": 1,
            "end_offset": 3,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "人民共和国",
            "start_offset": 2,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 4
        },
        {
            "token": "人民",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 5
        },
        {
            "token": "共和国",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 6
        },
        {
            "token": "共和",
            "start_offset": 4,
            "end_offset": 6,
            "type": "CN_WORD",
            "position": 7
        },
        {
            "token": "国",
            "start_offset": 6,
            "end_offset": 7,
            "type": "CN_CHAR",
            "position": 8
        },
        {
            "token": "国歌",
            "start_offset": 7,
            "end_offset": 9,
            "type": "CN_WORD",
            "position": 9
        }
    ]
}

自定义词库

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"魔兽世界"
}


//分词
{
    "tokens": [
        {
            "token": "魔兽",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "世界",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

在/plugins/ik/config/ 文件夹下创建mydic.dic 文件，添加内容“魔兽世界”

修改IKAnalyzer.cfg.xml后重启ES，在测试分词效果

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict">mydic.dic</entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
</properties>

POST localhost:9200/_analyze
{
    "analyzer":"ik_smart",
    "text":"魔兽世界"
}

//分词
{
    "tokens": [
        {
            "token": "魔兽世界",
            "start_offset": 0,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

最佳实践

两种分词器使用的最佳实践是：索引时用ik_max_word，在搜索时用ik_smart。即：索引时最大化的将书籍名称分词，搜索时更精确的搜索到想要的结果。

举个例子：我想搜索书籍，输入“天龙八部”，我此时的想法是想搜索出“天龙八部”的这本书籍，而不是其它的小说书籍，也就是书籍信息中必须只有天龙八部这个词。

ik_max_word 的分词效果：

POST /_analyze
{
 "analyzer":"ik_max_word",
 "text":"天龙八部"
}


----结果----
{
  "tokens" : [
    {
      "token" : "天龙八部",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "天龙",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "八部",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "八",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "TYPE_CNUM",
      "position" : 3
    },
    {
      "token" : "部",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "COUNT",
      "position" : 4
    }
  ]
}

ik_smart的分词效果：

POST /_analyze
{
 "analyzer":"ik_smart",
 "text":"天龙八部"
}


----结果----
{
  "tokens" : [
    {
      "token" : "天龙八部",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 0
    }
  ]
}

看到两个分词器的区别了吧，因为 “天龙八部” 是一个词，所以ik_smart不再细粒度分了。此时，我们可以在索引时使用 ik_max_word，在搜索时用ik_smart。

PUT /book_index/_mapping
{
  "properties": {
    "bookName": {
      "type": "text",
       "analyzer" : "ik_max_word",
       "search_analyzer" : "ik_smart",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    }
  }
}

总结

　　1.通常情况下，对于分词查询，文档指定的字段使用 ik_max_word 分析器进行分词，客户端使用match查询即可满足需求。

POST /book_index/_search
{
  "query": {
    "match": {
      "bookName": "天龙八部"
    }
  }
}

2.特殊情况下，业务中既需要ik_max_word 和 ik_smart 两种模式进行查询，新建二级字段(辅助字段)来查询对应的信息，如果需要优先级排序，则指定boost权重分数。

BoolQueryBuilder qb = QueryBuilders.boolQuery();
            // 电子书>标签>二级分类>作者
            // 推荐词权重：最强>最强xx>xx最强>xx最强xx>最xx
            BoolQueryBuilder boolQueryBuilder = new BoolQueryBuilder();
            //精确匹配
            boolQueryBuilder.should(QueryBuilders.termsQuery(fileName + ".keyword", new String[]{value})).boost(100);
            //前缀匹配
            boolQueryBuilder.should(QueryBuilders.prefixQuery(fileName, value)).boost(80);
            //模糊查询
            boolQueryBuilder.should(QueryBuilders.wildcardQuery(fileName, value)).boost(60);
            //匹配查询
            boolQueryBuilder.should(QueryBuilders.matchQuery(fileName, value)).boost(40);
            qb.must(boolQueryBuilder);

参考：esmapping映射管理 · Elasticsearch · 看云

yangyanping20108

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
ELK高级搜索八之分词器实战

分词器接受一个字符串作为输入，将这个字符串拆分成独立的词或语汇单元（token）（可能会丢弃一些标点符号等字符），然后输出一个语汇单元流（token stream）。有趣的是用于词汇识别的算法。whitespace（空白字符）分词器按空白字符 —— 空格、tabs、换行符等等进行简单拆分 —— 然后假定连续的非空格字符组成了一个语汇单元。将用户输入的一段文本，按照一定逻辑，分析成多个词语的一种工具。常用的内置分词器。
复制链接

扫一扫