elasticsearch in查询_使用Elasticsearch实现同段和同句搜索

同句搜索要求搜索多个关键词时,返回的文章不只要包含关键词,而且这些关键词必须在同一句中。
同段搜素类似,只是范围为同一段落。

SpanQuery

同段、同句搜索,使用常用的 term、match 查询,没有找到办法可以实现。
Elasticsearch 提供了 SpanQuery,官方文档中如下的介绍:

Span queries are low-level positional queries which provide expert control over the order and proximity of the specified terms. These are typically used to implement very specific queries on legal documents or patents.

上面提到,SpanQuery 常常应用在法律或专利的特定搜索。这些领域,常常提供同段 /同句搜索 。
下面我们看一下三种类型的 SpanQuery,能否实现我们的需求:

准备数据

PUT articlePOST article/_mapping{  "properties": {    "maincontent": {      "type": "text"    }  }}POST article/_doc/1{   "maincontent":"the quick red fox jumps over the sleepy cat"}POST article/_doc/2{   "maincontent":"the quick brown fox jumps over the lazy dog"}

SpanTermQuery

SpanTermQuery 和 Term Query 类似, 下面的查询会返回 _id 为1的 doc。the quick red fox jumps over the sleepy cat

POST article/_search{  "profile": "true",  "query": {    "span_term": {      "maincontent": {        "value": "red"      }    }  }}

SpanNearQuery

SpanNearQuery 表示邻近搜索,查找多个 term 是否邻近,slop 可以设置邻近距离,如果设置为0,那么代表两个 term 是挨着的,相当于 matchphase in_order 参数,代表文档中的 term 和查询设置的 term 保持相同的顺序。

POST article/_search{  "query": {    "span_near": {      "clauses": [        {          "span_term": {            "maincontent": {              "value": "quick"            }          }        },        {          "span_term": {            "maincontent": {              "value": "brown"            }          }        }      ],      "slop": 0,      "in_order": true    }  }}

上面的查询会返回 _id 为2的 doc。

the quick brown fox jumps over the lazy dog

SpanNotQuery

SpanNotQuery 非常重要,它要求两个 SpanQuery 的跨度,不能够重合。


看下面的例子:

  • include: 匹配的 SpanQuery,例子为需要一个包含 quick 和 fox 两个词的邻近搜索。

  • exclude:设置一个 SpanQuery,要求include中的SpanQuery不能包含这个SpanQuery

POST article/_search{"query": {"span_not": {  "include": {    "span_near": {      "clauses": [        {          "span_term": {            "maincontent": {              "value": "quick"            }          }        },        {          "span_term": {            "maincontent": {              "value": "fox"            }          }        }      ],      "slop": 1,      "in_order": true    }  },  "exclude": {    "span_term": {      "maincontent": {        "value": "red"      }    }  }}}}

上面的查询会返回 _id 为2的 doc。
因为 _id 为1的文档,虽然 quick red fox 符合 include 中的 SpanQuery,但是 red 也符合exclude 中的 SpanQuery。

因此,这篇文章需要排除掉。the quick red fox jumps over the sleepy cat

同句/同段搜索原理

同句搜索,反向来说,就是搜索词不能够跨句。再进一步,就是搜索词之间不能够有。 、?、!等其他标点符号。


其对应的查询类似如下:

POST article/_search{  "query": {    "span_not": {      "include": {        "span_near": {          "clauses": [            {              "span_term": {                "maincontent": {                  "value": "word1"                }              }            },            {              "span_term": {                "maincontent": {                  "value": "word2"                }              }            }          ],          "slop": 1,          "in_order": true        }      },      "exclude": {        "span_term": {          "maincontent": {            "value": "。/?/!"          }        }      }    }  }}

同段搜素类似,对应分隔符变为 \n,或者  ,

同段/同句搜索实现

文本为HTML格式

创建索引

PUT sample1{  "settings": {    "number_of_replicas": 0,    "number_of_shards": 1,    "analysis": {      "analyzer": {        "maincontent_analyzer": {          "type": "custom",          "char_filter": [            "sentence_paragrah_mapping",            "html_strip"          ],          "tokenizer": "ik_max_word"        }      },      "char_filter": {        "sentence_paragrah_mapping": {          "type": "mapping",          "mappings": [            """

=> \u0020paragraph\u0020"""

, """ => \u0020sentence\u0020paragraph\u0020 """, """

=> \u0020paragraph\u0020"""

, """ => \u0020sentence\u0020paragraph\u0020 """, """

=> \u0020paragraph\u0020"""

, """ => \u0020sentence\u0020paragraph\u0020 """, """! => \u0020sentence\u0020 """, """? => \u0020sentence\u0020 """, """。=> \u0020sentence\u0020 """, """?=> \u0020sentence\u0020 """, """!=> \u0020sentence\u0020""" ] } } } }, "mappings": { "properties": { "mainContent": { "type": "text", "analyzer": "maincontent_analyzer", "search_analyzer": "ik_smart" } } }}

我们创建了一个名称为 sentence_paragrah_mapping 的 char filter,它的目的有两个:

替换p , h1 , h2标签为统一的分段符:paragraph

替换中英文 ! , ? , 。 标点符号为统一的分页符:sentence

有几个细节,需要说明:

  • paragraph 和 sentence 前后都需要添加空格,并且需要使用 Unicode \u0020 表示空格。

# 期望hello world! => hello world sentence# 不合理的配置,可能会出现下面的情况hello world! => hello worldsentence
  •  , 的结尾标签需要添加 paragraphsentence 两个分隔符,避免结尾没有标点符号的情况
# 期望<h1>hello worldh1> <p>hello chinap> => paragraph hello world sentence paragraph hello china sentence# p>,h1>,h2>只使用paragraph替换的结果# 此时 hello world hello china 为同句<h1>hello worldh1> <p>hello chinap> => paragraph hello world  paragraph hello china sentence# 上面配置结果有些冗余:有两个连续的paragraph# 如果能保证HTML文本都符合标准,可以只替换p>,h1>,h2>,不替换<p>,<h1>,<h2><h1>hello worldh1> <p>hello chinap> => paragraph hello world sentence paragraph paragraph hello china sentence
  • 注意 sentence_paragrah_mapping 和 html_strip 的配置顺序

插入测试数据

POST sample1/_doc/1{  "mainContent":"

java python javascript

oracle mysql sqlserver

"} # 测试分词POST sample1/_analyze{ "text": ["

java python javascript

oracle mysql sqlserver

"], "analyzer": "maincontent_analyzer"}# 返回结果{ "tokens" : [ { "token" : "paragraph", "start_offset" : 1, "end_offset" : 2, "type" : "ENGLISH", "position" : 0 }, { "token" : "java", "start_offset" : 3, "end_offset" : 7, "type" : "ENGLISH", "position" : 1 }, { "token" : "python", "start_offset" : 8, "end_offset" : 14, "type" : "ENGLISH", "position" : 2 }, { "token" : "javascript", "start_offset" : 15, "end_offset" : 25, "type" : "ENGLISH", "position" : 3 }, { "token" : "sentence", "start_offset" : 26, "end_offset" : 28, "type" : "ENGLISH", "position" : 4 }, { "token" : "paragraph", "start_offset" : 28, "end_offset" : 28, "type" : "ENGLISH", "position" : 5 }, { "token" : "paragraph", "start_offset" : 30, "end_offset" : 31, "type" : "ENGLISH", "position" : 6 }, { "token" : "oracle", "start_offset" : 32, "end_offset" : 38, "type" : "ENGLISH", "position" : 7 }, { "token" : "mysql", "start_offset" : 39, "end_offset" : 44, "type" : "ENGLISH", "position" : 8 }, { "token" : "sqlserver", "start_offset" : 45, "end_offset" : 54, "type" : "ENGLISH", "position" : 9 }, { "token" : "sentence", "start_offset" : 55, "end_offset" : 57, "type" : "ENGLISH", "position" : 10 }, { "token" : "paragraph", "start_offset" : 57, "end_offset" : 57, "type" : "ENGLISH", "position" : 11 } ]}

测试查询

  • 同段查询:java python

GET sample1/_search{  "query": {    "span_not": {      "include": {        "span_near": {          "clauses": [            {              "span_term": {                "mainContent": {                  "value": "java"                }              }            },            {              "span_term": {                "mainContent": {                  "value": "python"                }              }            }          ],          "slop": 12,          "in_order": false        }      },      "exclude": {        "span_term": {          "mainContent": {            "value": "paragraph"          }        }      }    }  }}//结果{  "took" : 0,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 1,      "relation" : "eq"    },    "max_score" : 0.1655603,    "hits" : [      {        "_index" : "sample1",        "_type" : "_doc",        "_id" : "1",        "_score" : 0.1655603,        "_source" : {          "mainContent" : "

java python javascript

oracle mysql sqlserver

" } } ] }}
  • 同段查询:java oracle

GET sample1/_search{  "query": {    "span_not": {      "include": {        "span_near": {          "clauses": [            {              "span_term": {                "mainContent": {                  "value": "java"                }              }            },            {              "span_term": {                "mainContent": {                  "value": "oracle"                }              }            }          ],          "slop": 12,          "in_order": false        }      },      "exclude": {        "span_term": {          "mainContent": {            "value": "paragraph"          }        }      }    }  }}#结果:没有文档返回{  "took" : 0,  "timed_out" : false,  "_shards" : {    "total" : 1,    "successful" : 1,    "skipped" : 0,    "failed" : 0  },  "hits" : {    "total" : {      "value" : 0,      "relation" : "eq"    },    "max_score" : null,    "hits" : [ ]  }}

纯文本格式

纯文本和 HTML 的区别是段落分割符不同,使用 \n.

创建索引

PUT sample2{  "settings": {    "number_of_replicas": 0,    "number_of_shards": 1,    "analysis": {      "analyzer": {        "maincontent_analyzer": {          "type": "custom",          "char_filter": [            "sentence_paragrah_mapping"          ],          "tokenizer": "ik_max_word"        }      },      "char_filter": {        "sentence_paragrah_mapping": {          "type": "mapping",          "mappings": [            """\n => \u0020sentence\u0020paragraph\u0020 """,            """! => \u0020sentence\u0020 """,            """? => \u0020sentence\u0020 """,            """。=> \u0020sentence\u0020 """,            """?=> \u0020sentence\u0020 """,            """!=> \u0020sentence\u0020"""          ]        }      }    }  },  "mappings": {    "properties": {      "mainContent": {        "type": "text",        "analyzer": "maincontent_analyzer",        "search_analyzer": "ik_smart"      }    }  }}

测试分词

POST sample2/_analyze{  "text": ["java python javascript\noracle mysql sqlserver"],  "analyzer": "maincontent_analyzer"}# 结果{  "tokens" : [    {      "token" : "java",      "start_offset" : 0,      "end_offset" : 4,      "type" : "ENGLISH",      "position" : 0    },    {      "token" : "python",      "start_offset" : 5,      "end_offset" : 11,      "type" : "ENGLISH",      "position" : 1    },    {      "token" : "javascript",      "start_offset" : 12,      "end_offset" : 22,      "type" : "ENGLISH",      "position" : 2    },    {      "token" : "sentence",      "start_offset" : 22,      "end_offset" : 22,      "type" : "ENGLISH",      "position" : 3    },    {      "token" : "paragraph",      "start_offset" : 22,      "end_offset" : 22,      "type" : "ENGLISH",      "position" : 4    },    {      "token" : "oracle",      "start_offset" : 23,      "end_offset" : 29,      "type" : "ENGLISH",      "position" : 5    },    {      "token" : "mysql",      "start_offset" : 30,      "end_offset" : 35,      "type" : "ENGLISH",      "position" : 6    },    {      "token" : "sqlserver",      "start_offset" : 36,      "end_offset" : 45,      "type" : "ENGLISH",      "position" : 7    }  ]}

正文完

作者:trycatchfinal

本文编辑:妃尔

原文地址:http://elasticsearch.cn/article/13677

嗨,互动起来吧!

喜欢这篇文章么?

欢迎留下你想说的,留言 100% 精选哦!

Elastic 社区公众号长期征稿,如果您有 Elastic  技术的相关文章,也欢迎投稿至本公众号,一起进步! 投稿请添加微信:medcl123

招聘信息

Job board

社区招聘栏目是一个新的尝试,帮助社区的小伙伴找到心仪的职位,也帮助企业找到所需的人才,为伯乐和千里马牵线搭桥。有招聘需求的企业和正在求职的社区小伙伴,可以联系微信 medcl123 提交招聘需求和发布个人简历信息。

db11ac4178de22c58e9f784778563a72.png

对这家公司有兴趣的小伙伴赶紧扫码联系招聘者吧。

Elastic中文社区公众号 (elastic-cn)

为您汇集 Elastic 社区的最新动态、精选干货文章、精华讨论、文档资料、翻译与版本发布等。

3da4ff875f7672551eb072d97cabdd6c.png

喜欢本篇内容就请给我们点个[在看]吧

34defcc9decd1bcac272e93c1df6086f.png
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值