分词与内置分词器

最新推荐文章于 2024-06-15 09:44:10 发布

@所谓伊人

最新推荐文章于 2024-06-15 09:44:10 发布

阅读量98

点赞数

分类专栏：电商网站搭建 Elasticsearch 文章标签：彼得·帕克超级英雄转变犯罪分子身份认同

本文链接：https://blog.csdn.net/qq_39750835/article/details/119543269

版权

电商网站搭建同时被 2 个专栏收录

84 篇文章 1 订阅

订阅专栏

Elasticsearch

23 篇文章 1 订阅

订阅专栏

1.全局分析

2.索引库分析

3. standard 分析的结果，大写会转换成小写

返回的分析结果：

{
	"tokens": [
		{
			"token": "my",
			"start_offset": 0,
			"end_offset": 2,
			"type": "<ALPHANUM>",
			"position": 0
		},
		{
			"token": "name",
			"start_offset": 3,
			"end_offset": 7,
			"type": "<ALPHANUM>",
			"position": 1
		},
		{
			"token": "is",
			"start_offset": 8,
			"end_offset": 10,
			"type": "<ALPHANUM>",
			"position": 2
		},
		{
			"token": "peter",
			"start_offset": 11,
			"end_offset": 16,
			"type": "<ALPHANUM>",
			"position": 3
		},
		{
			"token": "parker",
			"start_offset": 17,
			"end_offset": 23,
			"type": "<ALPHANUM>",
			"position": 4
		},
		{
			"token": "i",
			"start_offset": 24,
			"end_offset": 25,
			"type": "<ALPHANUM>",
			"position": 5
		},
		{
			"token": "am",
			"start_offset": 26,
			"end_offset": 28,
			"type": "<ALPHANUM>",
			"position": 6
		},
		{
			"token": "a",
			"start_offset": 29,
			"end_offset": 30,
			"type": "<ALPHANUM>",
			"position": 7
		},
		{
			"token": "super",
			"start_offset": 31,
			"end_offset": 36,
			"type": "<ALPHANUM>",
			"position": 8
		},
		{
			"token": "hero.i",
			"start_offset": 37,
			"end_offset": 43,
			"type": "<ALPHANUM>",
			"position": 9
		},
		{
			"token": "don't",
			"start_offset": 44,
			"end_offset": 49,
			"type": "<ALPHANUM>",
			"position": 10
		},
		{
			"token": "like",
			"start_offset": 50,
			"end_offset": 54,
			"type": "<ALPHANUM>",
			"position": 11
		},
		{
			"token": "the",
			"start_offset": 55,
			"end_offset": 58,
			"type": "<ALPHANUM>",
			"position": 12
		},
		{
			"token": "criminals",
			"start_offset": 59,
			"end_offset": 68,
			"type": "<ALPHANUM>",
			"position": 13
		}
	]
}

4. simple 会按照非字母进行拆分，也会将大写转为小写

don't被拆分成don和t

给词汇里加数字，分析结果也会去掉这些数字。因为simple 会按照非字母进行拆分。

5. whitespace 根据空格进行拆分 Parker,I 会被认为是一个单词，大写不会被转换成小写

分析的结果：

{
	"tokens": [
		{
			"token": "My",
			"start_offset": 0,
			"end_offset": 2,
			"type": "word",
			"position": 0
		},
		{
			"token": "name",
			"start_offset": 3,
			"end_offset": 7,
			"type": "word",
			"position": 1
		},
		{
			"token": "is",
			"start_offset": 8,
			"end_offset": 10,
			"type": "word",
			"position": 2
		},
		{
			"token": "Peter",
			"start_offset": 11,
			"end_offset": 16,
			"type": "word",
			"position": 3
		},
		{
			"token": "Parker,I",
			"start_offset": 17,
			"end_offset": 25,
			"type": "word",
			"position": 4
		},
		{
			"token": "am",
			"start_offset": 26,
			"end_offset": 28,
			"type": "word",
			"position": 5
		},
		{
			"token": "a",
			"start_offset": 29,
			"end_offset": 30,
			"type": "word",
			"position": 6
		},
		{
			"token": "Super",
			"start_offset": 31,
			"end_offset": 36,
			"type": "word",
			"position": 7
		},
		{
			"token": "Hero.",
			"start_offset": 37,
			"end_offset": 42,
			"type": "word",
			"position": 8
		},
		{
			"token": "I",
			"start_offset": 43,
			"end_offset": 44,
			"type": "word",
			"position": 9
		},
		{
			"token": "don't",
			"start_offset": 45,
			"end_offset": 50,
			"type": "word",
			"position": 10
		},
		{
			"token": "like",
			"start_offset": 51,
			"end_offset": 55,
			"type": "word",
			"position": 11
		},
		{
			"token": "the",
			"start_offset": 56,
			"end_offset": 59,
			"type": "word",
			"position": 12
		},
		{
			"token": "Criminals.",
			"start_offset": 60,
			"end_offset": 70,
			"type": "word",
			"position": 13
		}
	]
}

6. stop 像the、a、is 这种没有意义的词会被去掉

返回结果：

{
	"tokens": [
		{
			"token": "my",
			"start_offset": 0,
			"end_offset": 2,
			"type": "word",
			"position": 0
		},
		{
			"token": "name",
			"start_offset": 3,
			"end_offset": 7,
			"type": "word",
			"position": 1
		},
		{
			"token": "peter",
			"start_offset": 11,
			"end_offset": 16,
			"type": "word",
			"position": 3
		},
		{
			"token": "parker",
			"start_offset": 17,
			"end_offset": 23,
			"type": "word",
			"position": 4
		},
		{
			"token": "i",
			"start_offset": 24,
			"end_offset": 25,
			"type": "word",
			"position": 5
		},
		{
			"token": "am",
			"start_offset": 26,
			"end_offset": 28,
			"type": "word",
			"position": 6
		},
		{
			"token": "super",
			"start_offset": 31,
			"end_offset": 36,
			"type": "word",
			"position": 8
		},
		{
			"token": "hero",
			"start_offset": 37,
			"end_offset": 41,
			"type": "word",
			"position": 9
		},
		{
			"token": "i",
			"start_offset": 43,
			"end_offset": 44,
			"type": "word",
			"position": 10
		},
		{
			"token": "don",
			"start_offset": 45,
			"end_offset": 48,
			"type": "word",
			"position": 11
		},
		{
			"token": "t",
			"start_offset": 49,
			"end_offset": 50,
			"type": "word",
			"position": 12
		},
		{
			"token": "like",
			"start_offset": 51,
			"end_offset": 55,
			"type": "word",
			"position": 13
		},
		{
			"token": "criminals",
			"start_offset": 60,
			"end_offset": 69,
			"type": "word",
			"position": 15
		}
	]
}

7.keyword 不做分词。把整个文本作为一个单独的关键词。