Elasticsearch Analyzer 的内部机制

最新推荐文章于 2024-07-09 15:52:30 发布

微风中的一只小刺猬

最新推荐文章于 2024-07-09 15:52:30 发布

阅读量477

点赞数

分类专栏： ElasticSearch 文章标签： Elasticsearch Analyzer 分词

ElasticSearch 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

转自：https://www.aliyun.com/jiaocheng/785749.html

摘要：1本文将介绍各种Analyzer,以及他们各种的应用场景。涉及到的概念CharacterfilterTokenizerTokenfilterAnalyzerTermqueryAnalyzer一般由三部分构成,characterfilters、tokenizers、tokenfilters。掌握了Analyzer的原理,就可以根据我们的应用场景配置Analyzer。Elasticsearch有10种分词器(Tokenizer)、31种tokenfilter,3种characte
1 本文将介绍各种 Analyzer,以及他们各种的应用场景。
涉及到的概念
Character filter
Tokenizer
Token filter
Analyzer
Term query
Analyzer 一般由三部分构成,character filters、tokenizers、token filters。掌握了 Analyzer 的原理,就可以根据我们的应用场景配置 Analyzer。

Elasticsearch 有10种分词器(Tokenizer)、31种 token filter,3种 character filter,一大堆配置项。此外,还有还可以安装 plugin 扩展功能。这些都是搭建 analyzer 的原材料。

2 Analyzer 的组成要素
Analyzer 的内部就是一条流水线
Step 1 字符过滤(Character filter)
Step 2 分词 (Tokenization)
Step 3 Token 过滤(Token filtering)

Elasticsearch 已经默认构造了 8个 Analyzer。若无法满足我们的需求,可以通过「Setting API」构造 Analyzer。

PUT /my-index/_settings
{
"index": {
"analysis": {
"analyzer": {
"customHTMLSnowball": {
"type": "custom",
"char_filter": [
"html_strip"
],
"tokenizer": "standard",
"filter": [
"lowercase",
"stop",
"snowball"
]
}
}
}
}
}
以上自定义的 Analyzer名为 customHTMLSnowball, 代表的含义:

移除 html 标签 (html_strip character filter),比如

。

分词,去除标点符号(standard tokenizer)

把大写的单词转为小写(lowercase token filter)

过滤停用词(stop token filter),比如「the」「they」「i」「a」「an」「and」。

提取词干(snowball token filter,snowball 雪球算法是提取英文词干最常用的一种算法。)

cats -> cat

catty -> cat

stemmer -> stem

stemming -> stem

stemmed -> stem
The two lazy dogs, were slower than the less lazy dog
一图胜前言,这段文本交给 customHTMLSnowball ,它是这样处理的。

3 如何选择合适的 Analyzer?
3.1 大篇幅的英文改选用哪种 analyzer?
当我们的搜索场景为:英文博文、英文新闻、英文论坛帖等大段的文本时,最好使用包含 stemming token filter 的 analyzer。

常见的 stemming token filter 有这几种: stemmer, snowball, porter_stem。

拿 snowball token filter 举例,它把 sing/ sings / singing 都转化词干 sing。并且丢弃了「they」「are」两个停用词。不管用户搜 sing、sings、singing, 他的搜索结果都是基于「sing」这个term,所得的结果集都一样。

GET http://localhost:9200/_analyze?text=I%20sing%20he%20sings%20they%20are%20singing&;analyzer=snowball // Output (abbreviated) { "tokens": [ {"token": "i", "position": 1, ...}, {"token": "sing", "position": 2, ...}, {"token": "he", "position": 3, ...}, {"token": "sing", "position": 4, ...}, {"token": "sing", "position": 7, ...}, ] }

词干提取在英文搜索种应用广泛,但是也有局限:

词干提取对中文意义不大(毫无意义?)。

搜索专业术语,人名时,词干提取反而让搜索结果变差。

eg: flying fish 与 fly fishing 意思差之千里,但经过 snowball 处理后的他们的词根(Term)相同 fli fish。

当用户搜索「假蝇钓鱼」信息时,出来的却是「飞鱼」的结果,搜索结果十分不理想。

此类场景,建议使用精准搜索,采用简单的分词策略(不提取词干,只 lowercase)+ Fuzzy query 可能是更好的选择。
3.2 该选用哪种 analyzer 处理中文?
英文的分词比较简单,根据空格,标点符号就可以分的八九不离十。但是中文词与词之间没有空格,德文偶尔两个词会连在一起,使用默认的 standard analyzer 就不灵光了。

> curl -XGET 'localhost:9200/_analyze?analyzer=standard⪯tty=true' -d '耶稣登山宝训' { "tokens" : [ { "token" : "耶", "start_offset" : 0, "end_offset" : 1, "type" : "", "position" : 1 }, { "token" : "稣", "start_offset" : 1, "end_offset" : 2, "type" : "", "position" : 2 }, { "token" : "登", "start_offset" : 2, "end_offset" : 3, "type" : "", "position" : 3 }, { "token" : "山", "start_offset" : 3, "end_offset" : 4, "type" : "", "position" : 4 }, { "token" : "宝", "start_offset" : 4, "end_offset" : 5, "type" : "", "position" : 5 }, { "token" : "训", "start_offset" : 5, "end_offset" : 6, "type" : "", "position" : 6 } ] }

standard analyzer 将「耶稣登山宝训」处理为5个独立的字,这不太靠谱。比较理想的结果应该为["耶稣", "登山宝训"]。

此时我们需要借助一些插件(plugin)来处理中文的分词。mmseg 是处理中文一个比较靠谱的插件。安装后可以引入 mmseg-analyzer,处理中文还不错。

3.3 Searching Tokens Exactly 精准搜索
当我们搜索用户名(username),商品分类(category),标签(tag)时,希望精准搜索。建索引时最好不要再分词、也不要提取词干,完全可以跳过 analyzer 这一步。

可以在某个字段的 mapping 中指定 "index": "not_analyzed",从而直接把原始文本转为 term。
4 IK中文分词器配置
先测试ik分词器的基本功能

POST _analyze?pretty
{
"analyzer": "ik_smart",
"text": "中华人民共和国国歌"
}
结果:

{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 1
}
]
}
可以看出:通过ik_smart明显很智能的将"中华人民共和国国歌"进行了正确的分词。

另外一个例子:

POST _analyze?pretty
{
"analyzer": "ik_smart",
"text": "王者荣耀是最好玩的游戏"
}
结果:

{
"tokens": [
{
"token": "王者",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "荣耀",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "最",
"start_offset": 5,
"end_offset": 6,
"type": "CN_CHAR",
"position": 2
},
{
"token": "好玩",
"start_offset": 6,
"end_offset": 8,
"type": "CN_WORD",
"position": 3
},
{
"token": "游戏",
"start_offset": 9,
"end_offset": 11,
"type": "CN_WORD",
"position": 4
}
]
}
如果结果跟我的不一样,那就对了,中文ik分词词库里面将“王者荣耀”是分开的,但是我们又不愿意将其分开,根据github上面的指示可以配置

IKAnalyzer.cfg.xml 目录在:elasticsearch-5.4.0/plugins/ik/config

<?xml version="1.0" encoding="UTF-8"?>

IK Analyzer 扩展配置

custom/mydict.dic;custom/single_word_low_freq.dic

custom/ext_stopword.dic

//TODO

配置完了之后就可以看到刚才的结果了

顺便测试一下ik_max_word

POST _analyze?pretty
{
"analyzer": "ik_max_word",
"text": "中华人民共和国国歌"
}
结果看看就行了
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
}
]
}
再看看github上面的一个例子

POST /index/fulltext/_mapping
{
"fulltext": {
"_all": {
"analyzer": "ik_smart"
},
"properties": {
"content": {
"type": "text"
}
}
}
}
存一些值

POST /index/fulltext/1
{
"content": "美国留给伊拉克的是个烂摊子吗"
}
POST /index/fulltext/2
{
"content": "公安部:各地校车将享最高路权"
}
POST /index/fulltext/3
{
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
POST /index/fulltext/4
{
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首"
}
取值

POST /index/fulltext/_search
{
"query": {
"match": {
"content": "中国"
}
}
}
结果:

{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1.0869478,
"hits": [
{
"_index": "index",
"_type": "fulltext",
"_id": "4",
"_score": 1.0869478,
"_source": {
"content": "中国驻洛杉矶领事馆遭亚裔男子枪击嫌犯已自首"
}
},
{
"_index": "index",
"_type": "fulltext",
"_id": "3",
"_score": 0.61094594,
"_source": {
"content": "中韩渔警冲突调查:韩警平均每天扣1艘中国渔船"
}
},
{
"_index": "index",
"_type": "fulltext",
"_id": "1",
"_score": 0.27179778,
"_source": {
"content": "美国留给伊拉克的是个烂摊子吗"
}
}
]
}
}
es会按照分词进行索引,然后根据你的查询条件按照分数的高低给出结果

官网有一个例子,可以学习学习:https://github.com/medcl/elasticsearch-analysis-ik

看另一个有趣的例子

PUT /index1
{
"settings": {
"refresh_interval": "5s",
"number_of_shards" : 1,
"number_of_replicas" : 0
},
"mappings": {
"_default_":{
"_all": { "enabled":false }
},
"resource": {
"dynamic": false,
"properties": {
"title": {
"type": "text",
"fields": {
"cn": {
"type": "text",
"analyzer": "ik_smart"
},
"en": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
field的作用有二:

1.比如一个string类型可以映射成text类型来进行全文检索,keyword类型作为排序和聚合; 2 相当于起了个别名,使用不同的分类器

批量插入值

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星驰最新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星驰最好看的新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星驰最新电影,最好,新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新电影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }
取值

POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"fox",
"fields": "title"
}
}
}
结果

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
原因,使用title里面查询fox,而title使用的是Standard标准分词器,被索引的是foxes,所以不会有结果,下面这种情况就会有结果了

POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"fox",
"fields": "title.en"
}
}
}
结果就不列出来了,因为title.en使用的是english分词器

对比一下下面的输出,体会一下field的使用

GET /index1/resource/_search
{
"query": {
"match": {
"title.cn": "the最好游戏"
}
}
}
POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"the最新游戏",
"fields": [ "title", "title.cn", "title.en" ]
}
}
}
POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"the最新",
"fields": "title.cn"
}
}
}
根据结果体会体会用法

下面使用“王者荣耀做测试”,这里可以看到前面配置的HotWords.php是一把双刃剑,将“王者荣耀”放在里面之后,“王者荣耀”这个词就是一个整体,不会被切分成“王者”和“荣耀”,但是就是要搜索王者怎么办呢,这里就体现出fields的强大了,具体看下面

先存入数据

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 6 } }
{ "title": "王者荣耀最好玩的游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 7 } }
{ "title": "王者荣耀最好玩的新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 8 } }
{ "title": "王者荣耀最新游戏,最好玩,新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 9 } }
{ "title": "最最最最好的新新新新游戏" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 10 } }
{ "title": "I'm not happy about the foxes" }
POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"王者荣耀",
"fields": "title.cn"
}
}
}
#下面会没有结果返回
POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"王者",
"fields": "title.cn"
}
}
}
POST /index1/resource/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query":"王者",
"fields": "title"
}
}
}
对比结果就可以一目了然了,结果略!

所以一开始业务的需求要相当了解,才能有好的映射(mapping)被设计,搜索的时候也会省事不少

参考:

https://github.com/medcl/elasticsearch-analysis-ik

http://keenwon.com/1404.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html#_example_output

微风中的一只小刺猬

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。