ElasticSearch_(6)分词器的介绍和使用

分词器
  将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具

standard analyzer

标准分析器是默认分词器,如果未指定,则使用该分词器

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"standard",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

simple analyzer

simple 分析器当它遇到只要不是字母的字符、就将文本解析成term、而且所有的term都是小写的【把所有的非字母都去掉】

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"simple",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

whitespace analyzer

whitespace 分析器,当它遇到空白字符时,就将文本解析成terms 【通过空格将语句分割】

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"whitespace",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "The",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3-points",
            "start_offset": 9,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 4
        },
        {
            "token": "Curry!",
            "start_offset": 29,
            "end_offset": 35,
            "type": "word",
            "position": 5
        }
    ]
}

stop analyzer

stop 分析器和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持, 默认使用了 english 停止器
stopwords 预定义的停止词列表,比如 (the,an,a,this,of,at) 等等

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"stop",
	"text":"The best 3-points shooter is Curry!"
}
'

返回 json

{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 2
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 3
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 5
        }
    ]
}

language analyzer

(特定的语言的分词器,比如说,english,英文分词器),内置语言:arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, frech, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norweigian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"stop",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个json

{
    "tokens": [
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "<NUM>",
            "position": 2
        },
        {
            "token": "point",
            "start_offset": 11,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "curri",
            "start_offset": 29,
            "end_offset": 34,
            "type": "<ALPHANUM>",
            "position": 6
        }
    ]
}

pattern analyzer

用于正则表达式来将文本分割成terms,默认的正则表达式是 \W+ (非单词字符)

curl -X PUT "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"pattern",
	"text":"The best 3-points shooter is Curry!"
}
'

返回一个 json

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "best",
            "start_offset": 4,
            "end_offset": 8,
            "type": "word",
            "position": 1
        },
        {
            "token": "3",
            "start_offset": 9,
            "end_offset": 10,
            "type": "word",
            "position": 2
        },
        {
            "token": "points",
            "start_offset": 11,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "shooter",
            "start_offset": 18,
            "end_offset": 25,
            "type": "word",
            "position": 4
        },
        {
            "token": "is",
            "start_offset": 26,
            "end_offset": 28,
            "type": "word",
            "position": 5
        },
        {
            "token": "curry",
            "start_offset": 29,
            "end_offset": 34,
            "type": "word",
            "position": 6
        }
    ]
}

练习

建立索引和分词器

curl -X PUT "localhost:9200/my_index" -H 'Content-Type:application/json' -d '
{
	"settings":{
		"analysis":{
			"analyzer":{
				"my_analyzer":{
					"type":"whitespace"
				}
			}
		}
	},
	"mappings":{
		"properties":{
			"name":{
				"type":"text"
			},
			"team_name":{
				"type":"text"
			},
			"position":{
				"type":"text"
			},
			"play_year":{
				"type":"long"
			},
			"jerse_no":{
				"type":"keyword"
			},
			"title":{
				"type":"text",
				"analyzer":"my_analyzer"
			}
		}
	}
}
'

添加数据

curl -X PUT 'localhost:9200/my_index/_doc/1' -H 'Content-Type:application/json' -d '
{
	"name":"库里",
	"team_name":"勇士",
	"position":"控球后卫",
	"play_year":10,
	"jserse_no":"30",
	"title":"The best 3-points shooter is Curry!"
}
'

查询

只有查询条件满足分词后的情况,才能查询出来

curl -X POST 'localhost:9200/my_index/_search' -H 'Content-Type:application/json' -d '
{
	"query":{
		"match":{
			"title":"Curry"
		}
	}
}
'

在这里插入图片描述

curl -X PUT 'localhost:9200/my_index/_doc/1' -H 'Content-Type:application/json' -d '
{
	"name":"库里",
	"team_name":"勇士",
	"position":"控球后卫",
	"play_year":10,
	"jserse_no":"30",
	"title":"The best 3-points shooter is Curry!"
}
'

在这里插入图片描述

常见中文分词器的使用

使用默认分析器对 中文进行分词

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"standard",
	"text": "火箭明年总冠军"
}
'

返回一个 json

{
    "tokens": [
        {
            "token": "火",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "箭",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "明",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "年",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "总",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        },
        {
            "token": "冠",
            "start_offset": 5,
            "end_offset": 6,
            "type": "<IDEOGRAPHIC>",
            "position": 5
        },
        {
            "token": "军",
            "start_offset": 6,
            "end_offset": 7,
            "type": "<IDEOGRAPHIC>",
            "position": 6
        }
    ]
}

常见分词器 smartCN

一个简单的中文或英文混合文本的分词器
安装相关分词器
进入ES的安装bin目录下 /Users/xxx/workspace/elasticsearch-7.2.0/bin

sh elasticsearch-plugin install analysis-smartcn

在这里插入图片描述
安装后重新启动
使用

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"smartcn",
	"text": "火箭明年总冠军"
}
'

返回一个 json

{
    "tokens": [
        {
            "token": "火箭",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0
        },
        {
            "token": "明年",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1
        },
        {
            "token": "总",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 2
        },
        {
            "token": "冠军",
            "start_offset": 5,
            "end_offset": 7,
            "type": "word",
            "position": 3
        }
    ]
}

卸载分词器

sh elasticsearch-plugin remove analysis-smartcn

常见分词器 IK分词器

更智能更友好的中文分词器

下载地址 https://github.com/medcl/elasticsearch-analysis-ik/releases
下载与ES对应的版本, ES为7.2
解压文件,并移动到安装目录plugins

mv elasticsearch-analysis-ik-7.2.0 /Users/wuyihong/workspace/elasticsearch-7.2.0/plugins/

重新启动 ES
使用

curl -X POST "localhost:9200/_analyze" -H 'Content-Type:application/json' -d '
{
	"analyzer":"ik_max_word",
	"text": "火箭明年总冠军"
}
'

返回 json

{
    "tokens": [
        {
            "token": "火箭",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "明年",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        },
        {
            "token": "总冠军",
            "start_offset": 4,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "冠军",
            "start_offset": 5,
            "end_offset": 7,
            "type": "CN_WORD",
            "position": 3
        }
    ]
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
智慧校园整体解决方案是响应国家教育信息化政策,结合教育改革和技术创新的产物。该方案以物联网、大数据、人工智能和移动互联技术为基础,旨在打造一个安全、高效、互动且环保的教育环境。方案强调从数字化校园向智慧校园的转变,通过自动数据采集、智能分析和按需服务,实现校园业务的智能化管理。 方案的总体设计原则包括应用至上、分层设计和互联互通,确保系统能够满足不同用户角色的需求,并实现数据和资源的整合与共享。框架设计涵盖了校园安全、管理、教学、环境等多个方面,构建了一个全面的校园应用生态系统。这包括智慧安全系统、校园身份识别、智能排课及选课系统、智慧学习系统、精品录播教室方案等,以支持个性化学习和教学评估。 建设内容突出了智慧安全和智慧管理的重要性。智慧安全管理通过分布式录播系统和紧急预案一键启动功能,增强校园安全预警和事件响应能力。智慧管理系统则利用物联网技术,实现人员和设备的智能管理,提高校园运营效率。 智慧教学部分,方案提供了智慧学习系统和精品录播教室方案,支持专业级学习硬件和智能化网络管理,促进个性化学习和教学资源的高效利用。同时,教学质量评估中心和资源应用平台的建设,旨在提升教学评估的科学性和教育资源的共享性。 智慧环境建设则侧重于基于物联网的设备管理,通过智慧教室管理系统实现教室环境的智能控制和能效管理,打造绿色、节能的校园环境。电子班牌和校园信息发布系统的建设,将作为智慧校园的核心和入口,提供教务、一卡通、图书馆等系统的集成信息。 总体而言,智慧校园整体解决方案通过集成先进技术,不仅提升了校园的信息化水平,而且优化了教学和管理流程,为学生、教师和家长提供了更加便捷、个性化的教育体验。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值