ElasticSearch 搜索引擎入门到实战 5--常见的内置分词器使用

什么是分词器 
将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具

常用的内置分词器 
standard analyzer
simple analyzer
whitespace analyzer
stop analyzer
language analyzer
pattern analyzer

standard analyzer
标准分词器,是默认分词器,如果未指定,则使用该分词器 。

post /_analyze
{
  "analyzer":"standard",
   "text": "The best 3-points shooter is Curry!" 
}

//分词结果,注意:为省空间,人为的删了很多无关的东西,只看分词的结果
{
    "the",
    "best",
    "3",
    "points",
    "shooter",
    "is",
    "curry"
}

simple analyzer
simple 分词器,当它遇到只要不是字母的字符,就将文本解析成term,而且所有的term都是小写的

post /_analyze
{
  "analyzer":"simple",
   "text": "The best 3-points shooter is Curry!" 
}

//结果
{
  "tokens" : 
    {
      "the",
      "best",
      "points",
      "shooter",
      "is",
      "curry",
    }
}

whitespace analyzer
whitespace 分词器 ,当它遇到空白字符时,就将文本解析成terms

post /_analyze
{
  "analyzer":"whitespace",
   "text": "The best 3-points shooter is Curry!" 
}

//结果
{
  "tokens" : 
    {
      "The",
      "best",
      "3-points",
      "shooter",
      "is",
      "Curry!"
    }
}

stop analyzer
stop分词器 和 simple分词器 很像,唯一不同的是,stop分词器 增加了对删除停止词的支持,默认使用了english停止词
stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等

post /_analyze
{
  "analyzer":"stop",
   "text": "The best 3-points shooter is Curry!" 
}

//结果
{
  "tokens" : 
    {
      "best",
      "points",
      "shooter",
      "curry",
    }
}

language analyzer
特定的语言的分词器 ,比如说,english,英语分词器

post /_analyze
{
  "analyzer":"english",
   "text": "The best 3-points shooter is Curry!" 
}

//结果
{
  "tokens" : 
    {
      "best",
      "3",
      "point",
      "shooter",
      "curri",
    }
}

pattern analyzer
用正则表达式来将文本分割成terms,默认的正则表达式是\W+(非单词字符)

post /_analyze
{
  "analyzer":"pattern",
   "text": "The best 3-points shooter is Curry!" 
}

//结果
{
  "tokens" : 
    {
      "the",
      "best",
      "3",
      "points",
      "shooter",
      "is",
      "curry",
    }
}


使用案例
新建一个索引库,在创建mapping时,使用自定义的分词器

PUT /my_index
{
	"settings": {
		"analysis": {
			"analyzer": {
				"my_analyzer": {
					"type": "whitespace"
				}
			}
		}
	},
	"mappings": {
		"properties": {
			"name": {
				"type": "text"
			},
			"team_name": {
				"type": "text"
			},
			"position": {
				"type": "text"
			},
			"play_year": {
				"type": "long"
			},
			"jerse_no": {
				"type": "keyword"
			},
			"title": {
				"type": "text",
				"analyzer": "my_analyzer"
			}
		}
	}
}
PUT /my_index/_doc/1
 {
 "name": "库⾥",
 "team_name": "勇⼠",
 "position": "控球后卫",
 "play_year": 10,
 "jerse_no": "30",
 "title": "The best 3-points shooter is Curry!"
 }
POST /my_index/_search
 {
	"query": {
		"match": {
			"title": "Curry!"
		}
	}
}


 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值