ElasticSearch_(6)分词器的介绍和使用

最新推荐文章于 2024-06-02 14:31:57 发布

Mikowoo007

最新推荐文章于 2024-06-02 14:31:57 发布

阅读量278

点赞数

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/Mikowoo007/article/details/106558318

版权

ElasticSearch 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

分词器
将用户输入的一段文本，按照一定逻辑，分析成多个词语的一种工具

`standard analyzer`

标准分析器是默认分词器，如果未指定，则使用该分词器

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"standard",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

`simple analyzer`

simple 分析器当它遇到只要不是字母的字符、就将文本解析成term、而且所有的term都是小写的【把所有的非字母都去掉】

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"simple",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

`whitespace analyzer`

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms 【通过空格将语句分割】

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"whitespace",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "The",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3-points",
            "start_offset": 9,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "Curry!",
            "start_offset": 29,
            "end_offset": 35,
            "type": "word",
            "position": 5
        }
    ]
}

`stop analyzer`

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持，默认使用了 english 停止器
stopwords 预定义的停止词列表，比如 (the,an,a,this,of,at) 等等

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"stop",
	"text":"The best 3-points shooter is Curry!"
}
'

返回 json

{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

`language analyzer`

(特定的语言的分词器，比如说，english，英文分词器)，内置语言：arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, frech, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norweigian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"stop",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "point",
            "start_offset": 11,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "curri",
            "start_offset": 29,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

`pattern analyzer`

用于正则表达式来将文本分割成terms，默认的正则表达式是 \W+ (非单词字符)

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"pattern",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个 json

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "word",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 6
        }
    ]
}

练习

建立索引和分词器

curl -X PUT "localhost:9200/my_index" -H 'Content-Type:application/json' -d '
{
	"settings":{
		"analysis":{
			"analyzer":{
				"my_analyzer":{
					"type":"whitespace"
				}
			}
		}
	},
	"mappings":{
		"properties":{
			"name":{
				"type":"text"
			},
			"team_name":{
				"type":"text"
			},
			"position":{
				"type":"text"
			},
			"play_year":{
				"type":"long"
			},
			"jerse_no":{
				"type":"keyword"
			},
			"title":{
				"type":"text",
				"analyzer":"my_analyzer"
			}
		}
	}
}
'

添加数据

curl -X PUT 'localhost:9200/my_index/_doc/1' -H 'Content-Type:application/json' -d '
{
	"name":"库里",
	"team_name":"勇士",
	"position":"控球后卫",
	"play_year":10,
	"jserse_no":"30",
	"title":"The best 3-points shooter is Curry!"
}
'

查询

只有查询条件满足分词后的情况，才能查询出来

curl -X POST 'localhost:9200/my_index/_search' -H 'Content-Type:application/json' -d '
{
	"query":{
		"match":{
			"title":"Curry"
		}
	}
}
'

在这里插入图片描述

curl -X PUT 'localhost:9200/my_index/_doc/1' -H 'Content-Type:application/json' -d '
{
	"name":"库里",
	"team_name":"勇士",
	"position":"控球后卫",
	"play_year":10,
	"jserse_no":"30",
	"title":"The best 3-points shooter is Curry!"
}
'

在这里插入图片描述

常见中文分词器的使用

使用默认分析器对中文进行分词

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"standard",
	"text": "火箭明年总冠军"
}
'

返回一个 json

{
    "tokens": [
        {
            "token": "火",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "箭",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "明",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "年",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "总",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "冠",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "军",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        }
    ]
}

常见分词器 `smartCN`

一个简单的中文或英文混合文本的分词器
安装相关分词器
进入ES的安装bin目录下 /Users/xxx/workspace/elasticsearch-7.2.0/bin

sh elasticsearch-plugin install analysis-smartcn

在这里插入图片描述
安装后重新启动
使用

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"smartcn",
	"text": "火箭明年总冠军"
}
'

返回一个 json

{
    "tokens": [
        {
            "token": "火箭",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "明年",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "总",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        },
        {
            "token": "冠军",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 3
        }
    ]
}

卸载分词器

sh elasticsearch-plugin remove analysis-smartcn

常见分词器 `IK分词器`

更智能更友好的中文分词器

下载地址 https://github.com/medcl/elasticsearch-analysis-ik/releases
下载与ES对应的版本, ES为7.2
解压文件，并移动到安装目录plugins 下

mv elasticsearch-analysis-ik-7.2.0 /Users/wuyihong/workspace/elasticsearch-7.2.0/plugins/

重新启动 ES
使用

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"ik_max_word",
	"text": "火箭明年总冠军"
}
'

返回 json

{
    "tokens": [
        {
            "token": "火箭",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "明年",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "总冠军",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "冠军",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}

Mikowoo007

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
ElasticSearch_(6)分词器的介绍和使用

分词器将用户输入的一段文本，按照一定逻辑，分析成多个词语的一种工具standard analyzer标准分析器是默认分词器，如果未指定，则使用该分词器curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '{ "analyzer":"standard", "text":"The best 3-points shooter is Curry!"}'返回一个json{ "tok
复制链接

扫一扫