ElasticSearch检索你的数据(二）

最新推荐文章于 2024-05-15 17:25:07 发布

666呀

最新推荐文章于 2024-05-15 17:25:07 发布

阅读量351

点赞数

分类专栏： elasticsearch 文章标签： elasticsearch

本文链接：https://blog.csdn.net/Suubyy/article/details/118420788

版权

elasticsearch 专栏收录该内容

39 篇文章 7 订阅

订阅专栏

文章目录

ElasticSearch检索你的数据(二）
- 高亮显示

ElasticSearch检索你的数据(二）

高亮显示

高亮显示可以让你在检索的结果中获取高亮的代码片段，因此你能向用户展示查询匹配到的数据。当你请求高亮的时候，返回的响应为每一个检索记录包含额外的highlight元素，其中包含高亮字段和高亮的片段。

高亮显示的字段必须有真实的值。如果未存储该字段，则加载_source并从中提取相关字段。

例如，获取

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": { "content": "kimchy" }
  },
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}
'

ElasticSearch支持三种高亮显示unified,plain, fvh

统一高亮

统一高亮使用 Lucene 统一高亮。这个高亮将文本分成句子，并使用 BM25 算法对单个句子进行评分，就好像它们是语料库中的文档一样。它还支持准确的短语和多术语（模糊、前缀、正则表达式）高亮显示。这是默认的高亮。

`Plain`高亮

plain高亮使用了标准的Lucene高亮器。它试图从理解短语查询中的单词重要性和任何单词定位标准的角度反映查询匹配逻辑。

提示：普通高亮显示最适合突出显示单个字段中的简单查询匹配。为了准确地反映查询逻辑，它创建了一个很小的内存索引，并通过Lucene的查询执行计划器重新运行原始查询条件，以访问当前文档的低级匹配信息。这将对需要突出显示的每个字段和文档重复。如果你想用复杂的查询在很多文档中突出显示很多字段，我们建议在posts或term_vector字段上使用统一的突出显示。

`Fast vector` 高亮

fvh高亮器使用了Lucene的Fast Vector高亮器。这个高亮器可以使用在映射中term_vector设置为with_positions_offsets的字段上。

可以使用boundary_scanner 进行定制。
需要将 term_vector 设置为 with_positions_offsets 。这会增加索引的大小
可以将多个字段的匹配组合为一个结果。可以看matched_fields
可以为不同位置的匹配分配不同的权重，

提示：fvh 高亮不支持跨度查询。如果您需要对跨度查询的支持，请尝试替代高亮，例如plain高亮。

覆盖全局设置

你可以指定全局高亮设置并且可以有选择性的为单个字段覆盖它们。

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "number_of_fragments" : 3,
    "fragment_size" : 150,
    "fields" : {
      "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
      "blog.title" : { "number_of_fragments" : 0 },
      "blog.author" : { "number_of_fragments" : 0 },
      "blog.comment" : { "number_of_fragments" : 5, "order" : "score" }
    }
  }
}
'

指定高亮查询

你可以在指定一个highlight_query来添加额外的高亮信息。例如以下查询中highlight_query中包含检索查询和重新评分查询两个查询。如果不使用highlight_query,只会高亮检索查询。

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "comment": {
        "query": "foo bar"
      }
    }
  },
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match_phrase": {
          "comment": {
            "query": "foo bar",
            "slop": 1
          }
        }
      },
      "rescore_query_weight": 10
    }
  },
  "_source": false,
  "highlight": {
    "order": "score",
    "fields": {
      "comment": {
        "fragment_size": 150,
        "number_of_fragments": 3,
        "highlight_query": {
          "bool": {
            "must": {
              "match": {
                "comment": {
                  "query": "foo bar"
                }
              }
            },
            "should": {
              "match_phrase": {
                "comment": {
                  "query": "foo bar",
                  "slop": 1,
                  "boost": 10.0
                }
              }
            },
            "minimum_should_match": 0
          }
        }
      }
    }
  }
}
'

设置高亮类型

type字段运行强制指定一个字段的高亮类型。这个值可以设置为unified,plain ,fvh。以下例子是一个强制使用plain类型的高亮：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": { "user.id": "kimchy" }
  },
  "highlight": {
    "fields": {
      "comment": { "type": "plain" }
    }
  }
}
'

配置高亮标签

默认情况下高亮文本将被<em><.em>标签包裹起来。标签可以使用pre_tags和post_tags设置。例如：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "pre_tags" : ["<tag1>"],
    "post_tags" : ["</tag1>"],
    "fields" : {
      "body" : {}
    }
  }
}
'

当我们使用fast vecotr高亮的时候，你可以指定额外的标签并对importance进行排序。

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "pre_tags" : ["<tag1>", "<tag2>"],
    "post_tags" : ["</tag1>", "</tag2>"],
    "fields" : {
      "body" : {}
    }
  }
}
'

你可以使用内置类型：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "tags_schema" : "styled",
    "fields" : {
      "comment" : {}
    }
  }
}
'

高亮`source`

强制高亮一个基于source的字段，尽管这个字段是单独存储的。默认为false：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "fields" : {
      "comment" : {"force_source" : true}
    }
  }
}
'

高亮所有字段

默认情况下，只有包含查询匹配的字段才会被高亮。设置require_filed_match为false来高亮所有字段。

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "require_field_match": false,
    "fields": {
      "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }
    }
  }
}
'

合并多个匹配的字段

提示：只支持fvh高亮器

Fast Vector高亮器可以合并匹配到的多个字段为一个单独字段，并且高亮这个单独的字段。对于以不同的方式分析同一字符串的多字段这是非常直观的。所有的matched_fields必须将term_vector设置为with_positions_offset，但是只有匹配到合并的字段才会被加载

根据以下示例，comment被english分析器分析，comment.plain被标准分析器分析：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
      "query": "comment.plain:running scissors",
      "fields": [ "comment" ]
    }
  },
  "highlight": {
    "order": "score",
    "fields": {
      "comment": {
        "matched_fields": [ "comment", "comment.plain" ],
        "type": "fvh"
      }
    }
  }
}
'

以上的run with scissors和running with scissors将高亮running和scissors，但是run不会被高亮。如果这两个短语在很大的文档中，然后running with scissors片段列表会排在run with scissors之上，因为该片段中有更多的匹配项。

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
      "query": "running scissors",
      "fields": ["comment", "comment.plain^10"]
    }
  },
  "highlight": {
    "order": "score",
    "fields": {
      "comment": {
        "matched_fields": ["comment", "comment.plain"],
        "type" : "fvh"
      }
    }
  }
}
'

上面高亮了run，running，scissors，，但仍然将running with scissors排在run with scissors之上，因为普通匹配（running）被增强了。

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
      "query": "running scissors",
      "fields": [ "comment", "comment.plain^10" ]
    }
  },
  "highlight": {
    "order": "score",
    "fields": {
      "comment": {
        "matched_fields": [ "comment.plain" ],
        "type": "fvh"
      }
    }
  }
}
'

明确高亮字段顺序

ElasticSearch高亮字段是按照send的顺序，但是根据规范每个JSON和对象都是无序的。如果你需要明确字段高亮的顺序，你可以指定fileds为一个数组：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "highlight": {
    "fields": [
      { "title": {} },
      { "text": {} }
    ]
  }
}
'

Elasticsearch 中内置的高亮器都不关心字段的高亮顺序，但插件可能会关心。

控制高亮片段

每一个高亮字段可以以字符为单位控制高亮片段的大小（默认100），以及返回的最大片段数（默认是5）。例如：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "fields" : {
      "comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
    }
  }
}
'

最重要的是，可以指定需要按分数排序高亮的片段：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "order" : "score",
    "fields" : {
      "comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
    }
  }
}
'

如果number_of_fragments设置为0，然后就不会产生任何片段，取而代之的是返回字段的全部并且高亮。如果高亮短文本这是非常方便的。以下示例将会忽略fragments_size：

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query" : {
    "match": { "user.id": "kimchy" }
  },
  "highlight" : {
    "fields" : {
      "body" : {},
      "blog.title" : {"number_of_fragments" : 0}
    }
  }
}
'

当使用fvh的时候，你可以使用fragment_offset参数来控制高亮的边界。

在以下例子中，没有匹配到高亮的片段，默认情况下不会返回任何数据。相反，我们可以通过no_match_size（默认0）设置希望返回文本的长度并从其开头的位置返回文本片段。实际长度可能比指定的更短或更长，因为它试图在单词边界上截断

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": { "user.id": "kimchy" }
  },
  "highlight": {
    "fields": {
      "comment": {
        "fragment_size": 150,
        "number_of_fragments": 3,
        "no_match_size": 150
      }
    }
  }
}
'

指定`plain`高亮器的片段器

当你使用plain高亮器的时候，你可以选择simple和span两种片段器。

curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_phrase": { "message": "number 1" }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "plain",
        "fragment_size": 15,
        "number_of_fragments": 3,
        "fragmenter": "simple"
      }
    }
  }
}
'

响应：

{
  ...
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.6011951,
    "hits": [
      {
        "_index": "my-index-000001",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.6011951,
        "_source": {
          "message": "some message with the number 1",
          "context": "bar"
        },
        "highlight": {
          "message": [
            " with the <em>number</em>",
            " <em>1</em>"
          ]
        }
      }
    ]
  }
}

curl -X GET "localhost:9200/my-index-000001/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match_phrase": { "message": "number 1" }
  },
  "highlight": {
    "fields": {
      "message": {
        "type": "plain",
        "fragment_size": 15,
        "number_of_fragments": 3,
        "fragmenter": "span"
      }
    }
  }
}
'

响应：

{
  ...
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.6011951,
    "hits": [
      {
        "_index": "my-index-000001",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.6011951,
        "_source": {
          "message": "some message with the number 1",
          "context": "bar"
        },
        "highlight": {
          "message": [
            " with the <em>number</em> <em>1</em>"
          ]
        }
      }
    ]
  }
}

如果 number_of_fragments 选项设置为 0，则使用 NullFragmenter，它根本不会对文本进行分段。这对于突出显示文档或字段的全部内容很有用。

高亮器内部工作原理

给定一个查询和文本（文档字段的内容），高亮器的目标就是为查询发现更好的文本片段，在发现的的片段中高亮查询的术语。为此高亮器需要解决以下问题：

如果将文本分解为片段
如果在所有的片段中找到最好的片段
如果高亮片段中的术语。

如何将文本分解为片段

相关设置：fragment_size、fragmenter、高亮type、boundary_chars、boundary_max_scan、boundary_scanner、boundary_scanner_locale。

plain高亮器首先使用给定的分析器分析文本，并从中创建一个分词流。plain高亮器使用一个非常简单的算法将分词流查分为片段。他会循环遍历分词流中的术语，并且每次当前术语的end_offset超过了fragment_size乘以已创建片段数量的时候，新的片段将会被创建。使用span片段器可以进行更多的计算，以避免在高亮的术语之间分割文本。但总体而言，由于中断仅由 fragment_size 完成，因此某些片段可能非常奇怪，例如以标点符号开头。

Unified或 FVH 高亮器通过利用 Java 的 BreakIterator 可以更好地将文本分解为片段。只要 fragment_size 允许，这就确保了一个片段是一个有效的句子。