Elasticsearch(三)----Elasticsearch中默认的分词器

最新推荐文章于 2024-05-10 02:06:55 发布

Catalina_yep

最新推荐文章于 2024-05-10 02:06:55 发布

阅读量2.6k

点赞数 1

分类专栏： Elasticsearch

本文链接：https://blog.csdn.net/Apple_Andy/article/details/109704291

版权

Elasticsearch 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

一.standard analyzer—标准的分词器

处理英语语法的分词器。切分后的key_words：set, the, shape, to, semi, transparent, by, calling, set_trans, 5。这种分词器也是Elasticsearch中默认的分词器。切分过程中不会忽略停止词（如：the、a、an等）。会进行单词的大小写转换、过滤连接符（-）或括号等常见符号。

二.simple analyzer — 简单分词器

切分后的key_words：set, the, shape, to, semi, transparent, by, calling, set, trans。就是将数据切分成一个个的单词。使用较少，经常会破坏英语语法。

三.whitespace analyzer - 空白符分词器

切分后的key_words：Set, the, shape, to, semi-transparent, by, calling, set_trans(5)。就是根据空白符号切分数据。如：空格、制表符等。使用较少，经常会破坏英语语法。

四.language analyzer - 语言分词器

如英语分词器（english）等。切分后的key_words：set, shape, semi, transpar, call, set_tran, 5。根据英语语法分词，会忽略停止词、转换大小写、单复数转换、时态转换等，应用分词器分词功能类似standard analyzer。
注意：Elasticsearch中提供的常用分词器都是英语相关的分词器，对中文的分词都是一字一词。

五.中文分词器

IK中文分词器，可以直接下载：注意：要下载elasticsearch-analysis-ik-xxx.zip，不要下载source
https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v6.8.4

1.IK配置文件

配置文件有：
main.dic ：
IK中内置的词典。 main dictionary。记录了IK统计的所有中文单词。一行一词。文件中未记录的单词，IK无法实现有效分词。如：雨女无瓜。不建议修改当前文件中的单词。这个是最核心的中文单词库。就好像，很多的网络词不会收集到辞海中一样。
quantifier.dic ：
IK内置的数据单位词典
suffix.dic ：
IK内置的后缀词典
surname.dic ：
IK内置的姓氏词典
stopword.dic ：
IK内置的英文停用词
preposition.dic ：
IK内置的中文停用词（介词）
IKAnalyzer.cfg.xml ：
用于配置自定义词库的
自定义词库是用户手工提供的特殊词典，类似网络热词，特定业务用词等。
ext_dict （默认不存在，需要自己手动创建）
自定义词库，配置方式为相对于IKAnalyzer.cfg.xml文件所在位置的相对路径寻址方式。相当于是用户自定义的一个main.dic文件。是对main.dic文件的扩展。测试当前版本配置ext_dict无效，可以使用main.dic配置

Catalina_yep

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Elasticsearch(三)----Elasticsearch中默认的分词器

一.standard analyzer—标准的分词器处理英语语法的分词器。切分后的key_words：set, the, shape, to, semi, transparent, by, calling, set_trans, 5。这种分词器也是Elasticsearch中默认的分词器。切分过程中不会忽略停止词（如：the、a、an等）。会进行单词的大小写转换、过滤连接符（-）或括号等常见符号。二.simple analyzer — 简单分词器切分后的key_words：set, the, shap
复制链接

扫一扫