ElasticSearch是自带分词器的,但是自带的分词器一般就只能对英文分词,对英文的分词只要识别空格就好了,还是很好做的(ES的这个分词器和Lucene的分词器很想,是不是直接使用Lucene的就不知道),自带的分词器对于中文就只能分成一个字一个字,这个显然是不能满足在开发中的要求的。
先看看自带的分词器的分词效果(还是使用Sense工具):
POST /_analyze
{
"analyzer":"standard",
"text":"中华人民共和国国歌"
}
得到的结果是下面这个:
{
"tokens": [
{
"token": "中",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "华",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "人",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "民",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "共",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "和",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "国",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"