14.ES 之 nested 详解(2019-05-22)

最新推荐文章于 2024-08-14 09:32:56 发布

eighthroute

最新推荐文章于 2024-08-14 09:32:56 发布

阅读量5.7k

点赞数 5

分类专栏： ElasticSearch

本文链接：https://blog.csdn.net/eighthroute/article/details/98223524

版权

ElasticSearch 专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1.问题引入：
由于在 ES 里新建、删除、更新单个文档都是原子性的，那么将相关实体保存在同一文档里面是有意义的。
PUT /blog/_doc/1
{
"title":"Nest eggs",
"body":"Making your money work...",
"tags":[
"cash",
"shares"
],
"comments":[
{
"name":"John Smith",
"comment":"Great article",
"age":28,
"stars":4,
"date":"2014-09-01"
},
{
"name":"Alice White",
"comment":"More like this please",
"age":31,
"stars":5,
"date":"2014-10-22"
}
]
}

因为所有的内容都在同一文档里，在查询的时候就没有必要拼接多份文档，因此检索性能会更好。但是，上面的文档会匹配这样的一个查询：
POST /blog/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"comments.name":"Alice"
}
},
{
"match":{
"comments.age":28
}
}
]
}
}
}

居然有结果！但是 Alice 是 31，不是 28 啊！
造成这种交叉对象匹配的原因是因为 JSON 文档在 Lucene 底层会被打平成一个简单的键值格式，就像这样：
{
"title": ["eggs","nest"],
"body": ["making","money","work","your"],
"tags": ["cash","shares"],
"comments.name": ["alice","john","smith","white"],
"comments.comment": ["article","great","like","more","please","this"],
"comments.age": [28, 31],
"comments.stars": [4, 5],
"comments.date": ["2014-09-01","2014-10-22"]
}

显然，像这种 'Alice'/'31'，'john'/'2014-09-01' 间的关联性就不可避免地丢失了。
虽然 object 类型的字段对于保存单一的 object 很有用，但是从检索的角度来说，这对于保存一个 object 数组却是无用的。

2.引入 nested 解决问题：
nested object 就是为了解决上述问题而设计出来的。通过将 comments 字段映射为 nested 类型，而不是 object 类型，每个 nested object 将会作为一个隐藏的单独文档进行保存。如下：
{
"comments.name": ["john", "smith"],
"comments.comment": ["article", "great"],
"comments.age": [28],
"comments.stars": [4],
"comments.date": ["2014-09-01"]
}
{
"comments.name": ["alice","white"],
"comments.comment": ["like","more","please","this"],
"comments.age": [31],
"comments.stars": [5],
"comments.date": ["2014-10-22"]
}
{
"title": ["eggs","nest"],
"body": ["making","money","work","your"],
"tags": ["cash","shares"]
}

通过分开给每个 nested object 进行保存，object 内部字段间的关系就能保持。当执行查询时，只会匹配同时出现在相同的 nested object 里的结果。不仅如此，由于 nested objects 保存数据的方式，在查询的时候将根文档和 nested objects 文档拼接是很快的，就跟把他们当成一个单独的文档一样快。这些额外的 nested objects 文档是隐藏的，我们不能直接接触。为了更新、增加或者移除一个 nested 对象，必须重新插入整个文档。要记住一点：查询请求返回的结果不仅仅包括 nested 对象，而是整个文档。

3.nested 设置 mapping：
1).删除之前的索引：
DELETE /blog

2).创建一个nested 字段很简单，只要在你通常指定 object 类型的地方，改成 nested 类型就行：
PUT /blog
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"body": {
"type": "text"
},
"tags": {
"type": "keyword"
},
"comments": {
"type": "nested",
"properties": {
"name": {
"type": "text"
},
"comment": {
"type": "text"
},
"age": {
"type": "short"
},
"stars": {
"type": "short"
},
"date": {
"type": "date"
}
}
}
}
}
}

3).插入之前的文档：
PUT /blog/_doc/1
{
"title":"Nest eggs",
"body":"Making your money work...",
"tags":[
"cash",
"shares"
],
"comments":[
{
"name":"John Smith",
"comment":"Great article",
"age":28,
"stars":4,
"date":"2014-09-01"
},
{
"name":"Alice White",
"comment":"More like this please",
"age":31,
"stars":5,
"date":"2014-10-22"
}
]
}

4).再用以前的方法就搜索不到了：
POST /blog/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"comments.name":"Alice"
}
},
{
"match":{
"comments.age":28
}
}
]
}
}
}

4.nested 之搜索：
nested object 作为一个独立隐藏文档单独建索引，因此，我们不能直接查询它们。取而代之，我们必须使用 nested 查询或者 nested filter 来获取它们：
GET /blog/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"title":"eggs"
}
},
{
"nested":{
"path":"comments",
"query":{
"bool":{
"must":[
{
"match":{
"comments.name":"john"
}
},
{
"match":{
"comments.age":28
}
}
]
}
}
}
}
]
}
}
}

一个 nested 字段可以包含其他的 nested 字段。相似地，一个 nested 查询可以包含其他 nested 查询。只要你希望，你就可以使用嵌套层。当然，一个 nested 查询可以匹配多个 nested 文档。每个匹配的 nested 文档都有它自己相关评分，但是这些评分必须归为一个总分应用于根文档上。默认会平均所有匹配的 nested 文档的分数。当然，也可以通过设定 score_mode 参数为 avg，max，sum 或者甚至为none(根文档获得一致评分1.0)。
GET /blog/_search
{
"query":{
"bool":{
"must":[
{
"match":{
"title":"eggs"
}
},
{
"nested":{
"path":"comments",
"score_mode":"max",
"query":{
"bool":{
"must":[
{
"match":{
"comments.name":"john"
}
},
{
"match":{
"comments.age":28
}
}
]
}
}
}
}
]
}
}
}

nested 过滤跟 nested 查询非常像，只是过滤不接受评分。

5.nested 之排序：
1).我们可以基于 nested 字段的值进行排序。再插入一份文档：
PUT /blog/_doc/2
{
"title":"Investment secrets",
"body":"What they don't tell you ...",
"tags":[
"shares",
"equities"
],
"comments":[
{
"name":"Mary Brown",
"comment":"Lies, lies, lies",
"age":42,
"stars":1,
"date":"2014-10-18"
},
{
"name":"John Smith",
"comment":"You're making it up!",
"age":28,
"stars":2,
"date":"2014-10-16"
}
]
}

2).设想我们想要检索在10月份被评论的博客文章，同时按每篇文章收到的最低星级排序。检索请求应该类似如下：
POST /blog/_search
{
"query": {
"nested": {
"path": "comments",
"query": {
"range": {
"comments.date": {
"gte": "2014-10-01",
"lt": "2014-11-01"
}
}
}
}
},
"sort": {
"comments.stars": {
"order": "asc",
"mode": "min"
}
}
}

6.nested 之聚合：
1).与在搜索时需要使用特定的 nested 查询来获取 nested object 一样，特定的 nested 聚合同样能让我们对 nested object 内的字段进行聚合：
POST /blog/_search
{
"aggs":{
"comments":{
"nested":{①
"path":"comments"
},
"aggs":{
"by_month":{
"date_histogram":{②
"field":"comments.date",
"interval":"month",
"format":"yyyy-MM"
},
"aggs":{
"avg_stars":{
"avg":{③
"field":"comments.stars"
}
}
}
}
}
}
}
}

①：nested 聚合进入 nested 评论对象。
②：评论基于 comments.date 字段聚合成月。
③：对于每一个簇，计算 stars 的平均值。
结果表明，聚合的确发生在 nested 文本层：
...
"aggregations": {
"comments": {
"doc_count": 4,
"by_month": {
"buckets": [
{
"key_as_string": "2014-09",
"key": 1409529600000,
"doc_count": 1,
"avg_stars": {
"value": 4
}
},
{
"key_as_string": "2014-10",
"key": 1412121600000,
"doc_count": 3,
"avg_stars": {
"value": 2.6666666666666665
}
}
]
}
}
}

2).反嵌套聚合:
一个 nested 聚合只能接入 nested 文本内的字段，它不能看到根文本或者不同 nested 文本内的字段。但是，我们可以通过一个反嵌套聚合跳出 nested 局域进入父层。比如我们基于评论者的年龄找出哪些标签是评论者感兴趣的。comment.age 是一个 nested 字段，而 tags 位于根文本中：
POST /blog/_search
{
"aggs":{
"comments":{
"nested":{①
"path":"comments"
},
"aggs":{
"age_group":{
"histogram":{②
"field":"comments.age",
"interval":10
},
"aggs":{
"blogposts":{
"reverse_nested":{},③
"aggs":{
"tags":{
"terms":{④
"field":"tags"
}
}
}
}
}
}
}
}
}
}

①：nested 聚合进入 nested 评论对象。
②：histogram（直方图）聚合在 comments.age 字段上分组，每10年一组。
③：reverse_nested 聚合跳转回根文本。
④：terms 聚合计算每个年龄组的流行 terms。
下面是简化结果：
..
"aggregations": {
"comments": {
"doc_count": 4,
"age_group": {
"buckets": [
{
"key": 20,
"doc_count": 2,
"blogposts": {
"doc_count": 2,
"tags": {
"doc_count_error_upper_bound": 0,
"buckets": [
{ "key": "shares", "doc_count": 2 },
{ "key": "cash", "doc_count": 1 },
{ "key": "equities", "doc_count": 1 }
]
}
}
},
...