安装完中文分词器之后(包含两种:ik_max_word,ik_smart。第一种常用),并不表示他会马上用到index的倒排索引上(inverted index)或者在全文检索时(full-text search)对查询关键字进行分词。这个可以通过以下实验验证:
创建以下document:
POST /student/_doc/1
{
"name":"徐小小",
"address":"杭州",
"age":3,
"interests":"唱歌 画画 跳舞",
"birthday":"2017-06-19"
}
POST /student/_doc/2
{
"name":"刘德华",
"address":"香港",
"age":28,
"interests":"演戏 旅游 小",
"birthday":"1980-06-19"
}
POST /student/_doc/3
{
"name":"张小斐",
"address":"北京",
"age":28,
"interests":"小品 旅游 小米手机",
"birthday":"1990-06-19"
}
POST /student/_doc/4
{
"name":"王小宝",
"address":"德州",
"age":63,
"interests":"演戏 小品 打牌 小米电视",
"birthday":"1956-06-19"
}
POST /student/_doc/5
{
"name":"向华强",
"address":"香港",
"age":31,
"interests":"演戏",
"birthday":"1958-06-19"
}
执行以下命令,可以得到三条结果,而不是两条:
GET student/_search
{
"query": {
"match": {
"interests": {
"query": "小米"
}
}
}
}
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.8391738,
"hits": [
{
"_index": "student",
"_type": "_doc",
"_id": "4",
"_score": 0.8391738,
"_source": {
"name": "王小宝",
"address": "德州",
"age": 63,
"interests": "演戏 小品 打牌 小米电视",
"birthday": "1956-06-19"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "3",
"_score": 0.68324494,
"_source": {
"name": "张小斐",
"address": "北京",
"age": 28,
"interests": "小品 旅游 小米手机",
"birthday": "1990-06-19"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "2",
"_score": 0.21110918,
"_source": {
"name": "刘德华",
"address": "香港",
"age": 28,
"interests": "演戏 旅游 小",
"birthday": "1980-06-19"
}
}
]
}
}
手工指定查询分词器为ik_max_word, 查询“小米”竟然没有结果:
GET student/_search
{
"query": {
"match": {
"interests": {
"query": "小米",
"analyzer": "ik_max_word"
}
}
}
}
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
通过以上结果可以推测,不但查询关键字"小米"被分成了"小”和"米"两个字,在ES上保存的数据,在创建倒排索引时,也是按照每个字建立一个索引。这应该是由于该student索引使用的是默认的英文分词器导致的,使每个汉字都被当成一个词。
下面重建student索引,同时在创建时,指定该索引默认分词器是ik_max_word(最大化的分词),然后再做同样的实验,发现"小米"没有再被分成了两个词。
DELETE /student
#创建student索引,并指定分词器。这个分词器看上去会同时应用于倒排索引建立时分词和查询关键字分词。不##需要再指定缺省查询分词器:
# "default_search": {
# "type": "ik_max_word"
# }
#但是官方文档有提到可以单独设置缺省查询分词器。
#https://www.elastic.co/guide/en/elasticsearch/reference/current/specify-analyzer.html
#创建student索引
PUT student
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"type": "ik_max_word"
}
}
}
}
}
#插入数据
POST /student/_doc/1
{
"name":"徐小小",
"address":"杭州",
"age":3,
"interests":"唱歌 画画 跳舞",
"birthday":"2017-06-19"
}
POST /student/_doc/2
{
"name":"刘德华",
"address":"香港",
"age":28,
"interests":"演戏 旅游 小",
"birthday":"1980-06-19"
}
POST /student/_doc/3
{
"name":"张小斐",
"address":"北京",
"age":28,
"interests":"小品 旅游 小米手机",
"birthday":"1990-06-19"
}
POST /student/_doc/4
{
"name":"王小宝",
"address":"德州",
"age":63,
"interests":"演戏 小品 打牌 小米电视",
"birthday":"1956-06-19"
}
POST /student/_doc/5
{
"name":"向华强",
"address":"香港",
"age":31,
"interests":"演戏",
"birthday":"1958-06-19"
}
#查询"小米",可以得到正确结果
GET student/_search
{
"query": {
"match": {
"interests": "小米"
}
}
}
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.6288345,
"hits": [
{
"_index": "student",
"_type": "_doc",
"_id": "4",
"_score": 0.6288345,
"_source": {
"name": "王小宝",
"address": "德州",
"age": 63,
"interests": "演戏 小品 打牌 小米电视",
"birthday": "1956-06-19"
}
},
{
"_index": "student",
"_type": "_doc",
"_id": "3",
"_score": 0.2876821,
"_source": {
"name": "张小斐",
"address": "北京",
"age": 28,
"interests": "小品 旅游 小米手机",
"birthday": "1990-06-19"
}
}
]
}
}
如果清醒指定系统默认英文分词器作为搜索关键字分词器,就只能得到一条结果。变相验证ik_max_word被同时用于倒排索引的分词和搜索关键字的分词。
GET student/_search
{
"query": {
"match": {
"interests": {
"query": "小米",
"analyzer": "standard"
}
}
}
}
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.7721133,
"hits": [
{
"_index": "student",
"_type": "_doc",
"_id": "2",
"_score": 0.7721133,
"_source": {
"name": "刘德华",
"address": "香港",
"age": 28,
"interests": "演戏 旅游 小",
"birthday": "1980-06-19"
}
}
]
}
}