转载自:https://blog.csdn.net/chengyuqiang/column/info/18392,ES版本号6.3.0
高级别全文检索通常用于在全文本字段(如电子邮件正文)上运行全文检索。 他们了解如何分析被查询的字段,并在执行之前将每个字段的分析器(或search_analyzer)应用于查询字符串。
match查询
(1)引例
GET website/_search
{
"query": {
"term": {
"title": "centos升级"
}
}
}
返回
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
(2)and操作符
GET website/_search
{
"query": {
"match": {
"title": {
"query":"centos升级",
"operator":"and"
}
}
}
}
返回结果
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "website",
"_type": "blog",
"_id": "3",
"_score": 0.5753642,
"_source": {
"title": "CentOS升级gcc",
"author": "程裕强",
"postdate": "2016-12-25",
"abstract": "CentOS升级gcc",
"url": "http://url.cn/53868915"
}
}
]
}
}
(3)or操作符
GET website/_search
{
"query": {
"match": {
"title": {
"query":"centos升级",
"operator":"or"
}
}
}
}
返回
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.9227539,
"hits": [
{
"_index": "website",
"_type": "blog",
"_id": "6",
"_score": 0.9227539,
"_source": {
"title": "CentOS更换国内yum源",
"author": "程裕强",
"postdate": "2016-12-30",
"abstract": "CentOS更换国内yum源",
"url": "http://url.cn/53946911"
}
},
{
"_index": "website",
"_type": "blog",
"_id": "3",
"_score": 0.5753642,
"_source": {
"title": "CentOS升级gcc",
"author": "程裕强",
"postdate": "2016-12-25",
"abstract": "CentOS升级gcc",
"url": "http://url.cn/53868915"
}
}
]
}
}
总结:term代表精确匹配,title必须为centos升级才能被查出,match先分词再进行匹配,加上operator操作符,代表分词的结果中必须包含centos升级才能被查出。
match_phrase查询(短语查询)
match_phrase与match query类似,但用于匹配精确词语,可称为短语查询。
match_parase查询会将查询内容分词,分词器可以自定义,文档中同时满足以下两个条件才会被检索到:a.分词后所有个此项都要出现在该字段内;b.字段中的词项顺序要一致
(1)创建索引,插入数据
DELETE test
PUT test
PUT test/hello/1
{ "content":"World Hello"}
PUT test/hello/2
{ "content":"Hello World"}
PUT test/hello/3
{ "content":"I just said hello world"}
(2)使用match_phrase查询"hello word"
GET test/_search
{
"query": {
"match_phrase": {
"content": "hello world"
}
}
}
返回结果为
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test",
"_type": "hello",
"_id": "2",
"_score": 0.5753642,
"_source": {
"content": "Hello World"
}
},
{
"_index": "test",
"_type": "hello",
"_id": "3",
"_score": 0.5753642,
"_source": {
"content": "I just said hello world"
}
}
]
}
}
match_phrase_prefix查询(前缀查询)
match_phrase_prefix与match_phrase相同,只是它允许在文本中的最后一个词的前缀匹配。也就是说对match_phrase进行了扩展,查询内容的分词只要满足前缀匹配即可。
GET test/_search
{
"query": {
"match_phrase_prefix": {
"content": "hello worl"
}
}
}
返回
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5753642,
"hits": [
{
"_index": "test",
"_type": "hello",
"_id": "2",
"_score": 0.5753642,
"_source": {
"content": "Hello World"
}
},
{
"_index": "test",
"_type": "hello",
"_id": "3",
"_score": 0.5753642,
"_source": {
"content": "I just said hello world"
}
}
]
}
}
multi_match
multi_match查询是match查询的升级版,用于多字段检索
GET website/_search
{
"query": {
"multi_match": {
"query": "centos",
"fields": ["title","abstract"]
}
}
}
返回结果
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 5,
"max_score": 0.9227539,
"hits": [
{
"_index": "website",
"_type": "blog",
"_id": "6",
"_score": 0.9227539,
"_source": {
"title": "CentOS更换国内yum源",
"author": "程裕强",
"postdate": "2016-12-30",
"abstract": "CentOS更换国内yum源",
"url": "http://url.cn/53946911"
}
},
{
"_index": "website",
"_type": "blog",
"_id": "2",
"_score": 0.41360322,
"_source": {
"title": "watchman源码编译",
"author": "程裕强",
"postdate": "2016-12-23",
"abstract": "CentOS7.x的watchman源码编译",
"url": "http://url.cn/53844169"
}
},
{
"_index": "website",
"_type": "blog",
"_id": "3",
"_score": 0.2876821,
"_source": {
"title": "CentOS升级gcc",
"author": "程裕强",
"postdate": "2016-12-25",
"abstract": "CentOS升级gcc",
"url": "http://url.cn/53868915"
}
},
{
"_index": "website",
"_type": "blog",
"_id": "7",
"_score": 0.20725916,
"_source": {
"title": "搭建Ember开发环境",
"author": "程裕强",
"postdate": "2016-12-30",
"abstract": "CentOS下搭建Ember开发环境",
"url": "http://url.cn/53947507"
}
},
{
"_index": "website",
"_type": "blog",
"_id": "1",
"_score": 0.1627405,
"_source": {
"title": "Ambari源码编译",
"author": "程裕强",
"postdate": "2016-12-21",
"abstract": "CentOS7.x下的Ambari2.4源码编译",
"url": "http://url.cn/53788351"
}
}
]
}
}
可见文档中title和abstract字段有一个匹配就会被检索出来。
common_terms查询(常用词查询)
(1)停用词
有些词在文本中出现的频率非常高,但是对文本锁携带的基本信息不产生影响。比如英文中的a、an、the、of,中文的“的”、”了”、”着”、”是” 、标点符号等。文本经过分词之后,停用词通常被过滤掉,不会被进行索引。在检索的时候,用户的查询中如果含有停用词,检索系统也会将其过滤掉(因为用户输入的查询字符串也要进行分词处理)。排除停用词可以加快建立索引的速度,减小索引库文件的大小。
(2)虽然停用词对文档评分影响不大,但是有时停用词仍然具有重要意义,去除停用词显然不合适。如果去除停用词,就无法区分“happy”和”not happy”, “to be or not to be”就不能被索引,搜索的准确率就会降低。
(3)common_terms查询提供了一种解决方案,把查询分次后的词项分为重要词项(比如low frequency terms,低频词)和不重要词(high frequency terms which would previously have been stopwords,高频的停用词)。在搜索时,首先搜索与重要词匹配的文档,然后执行第二次搜索,搜索评分较小的高频词。
词项是高频词还是低频词,可以通过cutoff_frequency来设置阀值,取值可以是绝对频率 (>=1)或者相对频率(0.0 ~1.0)
GET website/_search
{
"query": {
"common": {
"title": {
"query": "to be",
"cutoff_frequency": 0.0001,
"low_freq_operator": "and"
}
}
}
}
返回结果
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
参考学习的博客上又返回
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.364739,
"hits": [
{
"_index": "website",
"_type": "blog",
"_id": "9",
"_score": 2.364739,
"_source": {
"title": "to be or not to be",
"author": "somebody",
"postdate": "2018-01-03",
"abstract": "to be or not to be,that is the question",
"url": "http://url/63991802"
}
}
]
}
}
不知道什么原因。