背景
项目使用ES做搜索引擎,大家都知道query_string这个API是支持lucene语法的,所以我们使用这个API支持用户个性化的搜索。项目上线后,用户搜索发现特殊字符无法搜索到而且也无法高亮。
原因与解决
因为我们的index没有指定分词器所以默认使用的是standard分词器。standard分词器会根据特殊字符或者空格将字符串进行切割,分成一个个词进行存储,那么来看一下standard分词器会把带有特殊字符的字符串解析成哪些词进行存储呢?
GET _analyze
{
"analyzer": "standard",
"text": ["A2654|10|09|022"]
}
非常明显, 在经过standard分词器分词时,"A2654|10|09|022",已经被分成了4个词,显然没有了特殊符号,这也就意味着如果我的index使用的是standard分词器数据入库时已经没有特殊符号了,所以后续使用特殊符号搜索一定是无法搜索到的。
//设置索引
PUT test003
{
"mappings": {
"doc": {
"properties": {
"text": {
"analyzer": "standard",
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
//插入数据
POST test003/doc
{
"text":"A2654|10|09|022"
}
//查询
GET test003/_search
{
"query": {
"query_string": {
"query": "\\|"
}
}
}
一切都符合预期,这也就是为什么我们线上搜索不到数据的原因。
那么如何才能做到特殊字符搜索呢?
了解了一些分词器发现了ngram分词器。
NGram 分词器
看一下描述
什么意思呢?大概就是说会象是滑动窗口一样将字符进行指定长度的分割,对于那种没有空格很长的语言如德语很有效果。
这样看就很清楚了 ,NGram会根据我们指定窗口(分割长度)大小进行分词而不是特殊字符或者空格,这就意味着特殊字符也会被分成一个词。
那么既然是指定长的那么指定长度如何设置呢?还有其他的参数么?
现在我们来试一试
//设置index
PUT specialchar001
{
"settings": {
"analysis": {
"analyzer": {
"specialchar_analyzer": {
"tokenizer": "specialchar_tokenizer"
}
},
"tokenizer": {
"specialchar_tokenizer": {
"type": "ngram",
"min_gram": 1,
"max_gram": 2
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"analyzer": "specialchar_analyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
//插入数据
POST specialchar001/_doc
{
"text": "A2654|10|09|022"
}
//特殊字符查询
GET specialchar001/_search
{
"query": {
"query_string": {
"query": "\\|"
}
}
}
可以看到含有特殊字符的数据已经被查出来了。
那么我们看一下这个词被分成了什么样子
GET specialchar001/_analyze
{
"analyzer": "specialchar_analyzer",
"text": ["A2654|10|09|022"]
}
{
"tokens" : [
{
"token" : "A",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "A2",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "2",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 2
},
{
"token" : "26",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 3
},
{
"token" : "6",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 4
},
{
"token" : "65",
"start_offset" : 2,
"end_offset" : 4,
"type" : "word",
"position" : 5
},
{
"token" : "5",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 6
},
{
"token" : "54",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 7
},
{
"token" : "4",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 8
},
{
"token" : "4|",
"start_offset" : 4,
"end_offset" : 6,
"type" : "word",
"position" : 9
},
{
"token" : "|",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 10
},
{
"token" : "|1",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 11
},
{
"token" : "1",
"start_offset" : 6,
"end_offset" : 7,
"type" : "word",
"position" : 12
},
{
"token" : "10",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 13
},
{
"token" : "0",
"start_offset" : 7,
"end_offset" : 8,
"type" : "word",
"position" : 14
},
{
"token" : "0|",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 15
},
{
"token" : "|",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 16
},
{
"token" : "|0",
"start_offset" : 8,
"end_offset" : 10,
"type" : "word",
"position" : 17
},
{
"token" : "0",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 18
},
{
"token" : "09",
"start_offset" : 9,
"end_offset" : 11,
"type" : "word",
"position" : 19
},
{
"token" : "9",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 20
},
{
"token" : "9|",
"start_offset" : 10,
"end_offset" : 12,
"type" : "word",
"position" : 21
},
{
"token" : "|",
"start_offset" : 11,
"end_offset" : 12,
"type" : "word",
"position" : 22
},
{
"token" : "|0",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 23
},
{
"token" : "0",
"start_offset" : 12,
"end_offset" : 13,
"type" : "word",
"position" : 24
},
{
"token" : "02",
"start_offset" : 12,
"end_offset" : 14,
"type" : "word",
"position" : 25
},
{
"token" : "2",
"start_offset" : 13,
"end_offset" : 14,
"type" : "word",
"position" : 26
},
{
"token" : "22",
"start_offset" : 13,
"end_offset" : 15,
"type" : "word",
"position" : 27
},
{
"token" : "2",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 28
}
]
}
结果符合预期,同时也说明了问题。当我使用standard分词器我得词只有4个,而用ngram后29个词!这说明:使用ngram势必要占用更多的空间!
使用相同的数据插入选择不同的分词器可以看到确实如此