分词与内置分词器

1.全局分析

2.索引库分析

3. standard 分析的结果,大写会转换成小写

 返回的分析结果:

{
	"tokens": [
		{
			"token": "my",
			"start_offset": 0,
			"end_offset": 2,
			"type": "<ALPHANUM>",
			"position": 0
		},
		{
			"token": "name",
			"start_offset": 3,
			"end_offset": 7,
			"type": "<ALPHANUM>",
			"position": 1
		},
		{
			"token": "is",
			"start_offset": 8,
			"end_offset": 10,
			"type": "<ALPHANUM>",
			"position": 2
		},
		{
			"token": "peter",
			"start_offset": 11,
			"end_offset": 16,
			"type": "<ALPHANUM>",
			"position": 3
		},
		{
			"token": "parker",
			"start_offset": 17,
			"end_offset": 23,
			"type": "<ALPHANUM>",
			"position": 4
		},
		{
			"token": "i",
			"start_offset": 24,
			"end_offset": 25,
			"type": "<ALPHANUM>",
			"position": 5
		},
		{
			"token": "am",
			"start_offset": 26,
			"end_offset": 28,
			"type": "<ALPHANUM>",
			"position": 6
		},
		{
			"token": "a",
			"start_offset": 29,
			"end_offset": 30,
			"type": "<ALPHANUM>",
			"position": 7
		},
		{
			"token": "super",
			"start_offset": 31,
			"end_offset": 36,
			"type": "<ALPHANUM>",
			"position": 8
		},
		{
			"token": "hero.i",
			"start_offset": 37,
			"end_offset": 43,
			"type": "<ALPHANUM>",
			"position": 9
		},
		{
			"token": "don't",
			"start_offset": 44,
			"end_offset": 49,
			"type": "<ALPHANUM>",
			"position": 10
		},
		{
			"token": "like",
			"start_offset": 50,
			"end_offset": 54,
			"type": "<ALPHANUM>",
			"position": 11
		},
		{
			"token": "the",
			"start_offset": 55,
			"end_offset": 58,
			"type": "<ALPHANUM>",
			"position": 12
		},
		{
			"token": "criminals",
			"start_offset": 59,
			"end_offset": 68,
			"type": "<ALPHANUM>",
			"position": 13
		}
	]
}

4.  simple  会按照非字母进行拆分,也会将大写转为小写

 don't被拆分成don和t

给词汇里加数字,分析结果也会去掉这些数字。因为simple 会按照非字母进行拆分。

5. whitespace  根据空格进行拆分  Parker,I 会被认为是一个单词,大写不会被转换成小写

 分析的结果:

{
	"tokens": [
		{
			"token": "My",
			"start_offset": 0,
			"end_offset": 2,
			"type": "word",
			"position": 0
		},
		{
			"token": "name",
			"start_offset": 3,
			"end_offset": 7,
			"type": "word",
			"position": 1
		},
		{
			"token": "is",
			"start_offset": 8,
			"end_offset": 10,
			"type": "word",
			"position": 2
		},
		{
			"token": "Peter",
			"start_offset": 11,
			"end_offset": 16,
			"type": "word",
			"position": 3
		},
		{
			"token": "Parker,I",
			"start_offset": 17,
			"end_offset": 25,
			"type": "word",
			"position": 4
		},
		{
			"token": "am",
			"start_offset": 26,
			"end_offset": 28,
			"type": "word",
			"position": 5
		},
		{
			"token": "a",
			"start_offset": 29,
			"end_offset": 30,
			"type": "word",
			"position": 6
		},
		{
			"token": "Super",
			"start_offset": 31,
			"end_offset": 36,
			"type": "word",
			"position": 7
		},
		{
			"token": "Hero.",
			"start_offset": 37,
			"end_offset": 42,
			"type": "word",
			"position": 8
		},
		{
			"token": "I",
			"start_offset": 43,
			"end_offset": 44,
			"type": "word",
			"position": 9
		},
		{
			"token": "don't",
			"start_offset": 45,
			"end_offset": 50,
			"type": "word",
			"position": 10
		},
		{
			"token": "like",
			"start_offset": 51,
			"end_offset": 55,
			"type": "word",
			"position": 11
		},
		{
			"token": "the",
			"start_offset": 56,
			"end_offset": 59,
			"type": "word",
			"position": 12
		},
		{
			"token": "Criminals.",
			"start_offset": 60,
			"end_offset": 70,
			"type": "word",
			"position": 13
		}
	]
}

6. stop   像the、a、is 这种没有意义的词会被去掉

 返回结果:

{
	"tokens": [
		{
			"token": "my",
			"start_offset": 0,
			"end_offset": 2,
			"type": "word",
			"position": 0
		},
		{
			"token": "name",
			"start_offset": 3,
			"end_offset": 7,
			"type": "word",
			"position": 1
		},
		{
			"token": "peter",
			"start_offset": 11,
			"end_offset": 16,
			"type": "word",
			"position": 3
		},
		{
			"token": "parker",
			"start_offset": 17,
			"end_offset": 23,
			"type": "word",
			"position": 4
		},
		{
			"token": "i",
			"start_offset": 24,
			"end_offset": 25,
			"type": "word",
			"position": 5
		},
		{
			"token": "am",
			"start_offset": 26,
			"end_offset": 28,
			"type": "word",
			"position": 6
		},
		{
			"token": "super",
			"start_offset": 31,
			"end_offset": 36,
			"type": "word",
			"position": 8
		},
		{
			"token": "hero",
			"start_offset": 37,
			"end_offset": 41,
			"type": "word",
			"position": 9
		},
		{
			"token": "i",
			"start_offset": 43,
			"end_offset": 44,
			"type": "word",
			"position": 10
		},
		{
			"token": "don",
			"start_offset": 45,
			"end_offset": 48,
			"type": "word",
			"position": 11
		},
		{
			"token": "t",
			"start_offset": 49,
			"end_offset": 50,
			"type": "word",
			"position": 12
		},
		{
			"token": "like",
			"start_offset": 51,
			"end_offset": 55,
			"type": "word",
			"position": 13
		},
		{
			"token": "criminals",
			"start_offset": 60,
			"end_offset": 69,
			"type": "word",
			"position": 15
		}
	]
}

7.keyword  不做分词。把整个文本作为一个单独的关键词。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

@所谓伊人

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值