elasticsearch 基础 —— Parent-Child 父子关系（5.X老版本）

最新推荐文章于 2024-04-14 20:06:20 发布

Java全栈研发大联盟

最新推荐文章于 2024-04-14 20:06:20 发布

阅读量1.7k

点赞数

分类专栏： elasticSearch 文章标签： elasticsearch 基础 —— Parent-Child 父子

原文链接：https://blog.csdn.net/ctwy291314/article/details/81407925

版权

elasticSearch 专栏收录该内容

9 篇文章

订阅专栏

本文深入探讨ElasticSearch中的Parent-Child模型，解析其优势、应用及查询方式，包括如何建立父子关系、批量创建文档、以及如何通过子文档查询父文档或反之。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

ElasticSearch 中的Parent-Child关系和nested模型是相似的，两个都可以用于复杂的数据结构中，区别是 nested 类型的文档是把所有的实体聚合到一个文档中而Parent-Child现对于比较独立，每个实体即为一个文档
Parent-Child 优点
1、父文档更新时不用重新为子文档建立索引
2、子文档的增加、修改、删除是对父文档和其他子文档没有任何影响的，这非常适用于子文档非常大并且跟新频繁的场景
3、子文档也可以查询结果返回

ElasticSearch 内部维护一个map来保存Parent-Child之间的关系(解释：map的key为父文档id,map的value为子文档的id)，正是由于这个map，所以关联查询能够做到响应速度很快，但是确实有个限制是Parent 文档和所有的Child 文档都必须保存到同一个shard中
ElasticSearch parent-child ID的映射是存到Doc value 中的，有足够的内存时响应是很快的。当这个map很大的时候，还是有要有一部分存储在硬盘中的。

parent-Child Mapping

为了建立Parent-Child 模型我们需要在创建mapping的时候指定父文档和子文档或者在子文档创建之前利用update-index API 来指定

例如：我们有个公司，其子公司分布在全国各地，我要分析员工和子公司的关系
我们使用Parent-Child 结构
我们需要建立employee（员工） type 和 branch（子公司） type 并且指定 branch（子公司）为_parent

PUT /company
{
   "mappings": {   //创建branch（子公司）和employee之间的父和子的关系
        "branch": {},
         "employee": {
             "_parent": {  //指定employee的父文档为branch（子公司）类型
                      "type": "branch" 
              }
         }
     }
}

Indexing Parents and Children

创建父索引和创建其他索引并没有区别，父文档并不需要知道他们的子文档

POST /company/branch/_bulk    //这里URL的后面加了一个_bulk,是批量创建的意思，在本例里面是批量创建branch
{ "index": { "_id": "london" }}  //
{ "name": "London Westminster", "city": "London", "country": "UK" }
{ "index": { "_id": "liverpool" }}
{ "name": "Liverpool Central", "city": "Liverpool", "country": "UK" }
{ "index": { "_id": "paris" }}
{ "name": "Champs Élysées", "city": "Paris", "country": "France" }

创建子文档的时候你必须指出他们的父文档的id

PUT /company/employee/1?parent=london //这里我们创建了一个employee对象，指定其父亲的id是london
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}

指定parent id 有两个目的：他是父文档和子文档的关联，而且他也保证了父文档和子文档会存储在同一个shard中，
在routing那个章节我们解释了ElasticSearch 如何利用routing的值来决定分配到shard中的，如果文档没有指定routing的值的化，那么默认为_id,_id的计算公式为

shard = hash(routing) % number_of_primary_shards

但是，如果指定了 parent id 那么routing的值就不是_id 了而是 parent id，换句话说就是父文档和子文档是具有相同的routing的值来确保他们会分配到同一个shard中的
当我们用GET请求来检索子文档时，我们需要指定parent id，并且创建索引、更新索引、还有删除索引都需要指定parent id，不像搜索的请求，他会分发到所有的shard中，这些single-document请求只会发送到存储它的shard中。如果没有指定parent id 也许请求会发送到一个错误的shard中
当我们使用buk API 时也需要指定parent id

POST /company/employee/_bulk   //_bulk是批量添加的意思，在本例中是批量添加employee
{ "index": { "_id": 2, "parent": "london" }}  //添加employee的时候指定其父亲的id
{ "name": "Mark Thomas", "dob": "1982-05-16", "hobby": "diving" }
{ "index": { "_id": 3, "parent": "liverpool" }} //添加employee的时候指定其父亲的id
{ "name": "Barry Smith", "dob": "1979-04-01", "hobby": "hiking" }
{ "index": { "_id": 4, "parent": "paris" }} //添加employee的时候指定其父亲的id
{ "name": "Adrien Grand", "dob": "1987-05-11", "hobby": "horses" }

Finding Parents by Their Children

has_child 和 filter 可以根据子文档的内容来查询父文档，例如我们可以用这样的语句搜索所有分公司，出生在1980年以后的员工

GET /company/branch/_search  //对所有的branch（子公司）进行搜索
{
  "query": {
    "has_child": {   //has_child的英文意思是"是否有孩子"
      "type": "employee",  //孩子的类型是employee
      "query": {
        "range": {    
          "dob": {    
            "gte": "1980-01-01" // employee的dob字段的值大于1980-01-01
          }
        }    //搜索结果：返回符合条件的子公司
      }
    }
  }
}

has_child 查询会匹配到多个子文档，每个文档都会有不同的关联得分。这些得分如何减少父文档的单个得分取决于分数模型的参数。默认参数为none，即会忽略子文档的得分，并且父文档会加1.0.

下面的查询执行结果会同时返回london 还有 liverpool （提示：前面我们已经新建了4个employee,其中id为1的employee的name是"Alice Smith"，而id为3的employee的name是‘“Barry Smith”，显然id为1的employee更匹配我们的查询条件，它的父亲是london,而id为3的employee的父亲是liverpool）但是london 会得到一个更好的得分，因为Alice Smith 更加匹配london

GET /company/branch/_search //对所有的branch（子公司）进行搜索
{
  "query": {
    "has_child": {   //has_child的英文意思是"是否有孩子"
      "type":       "employee",   //孩子的类型是employee
      "score_mode": "max", //按匹配的相似度排序匹配的结果
      "query": {
        "match": {
          "name": "Alice Smith"  //查询条件：employee的name字段匹配"Alice Smith"
        }
      }
    }
  }
}

min_children and max_children

has_child 和 filter 都有min_children 和 max_children 两个参数，作用是返回那些具有子文档个数与之相匹配的父文档数据
下面的查询会返回具有两个员工以上的分公司

GET /company/branch/_search
{
  "query": {
    "has_child": {
      "type":         "employee",
      "min_children": 2,  //这里已经明确指定了min_children的数量是2
      "query": {
        "match_all": {}
      }
    }
  }
}

Finding Children by Their Parents

和nested 查询只能返回根节点数据不同的是，Parent-Child 结构父文档和子文档都是相对独立的，并且可以被单独查询，has_child 查询可以根据子文档返回父文档而 has_parent查询会根据父文档返回子文档
和has_child 查询很相似，下面的查询会返回那些工作在uk的员工employee

GET /company/employee/_search //对所有的employee对象进行查询
{
  "query": {
    "has_parent": { //has_parent的英文意思是“是否有此父亲”
      "type": "branch",   //父亲的类型是branch（子公司）
      "query": { 
        "match": {
          "country": "UK"  //branch（子公司）的country（国家）是“UK”
        }
      }
    }      
  }
}

has_parent 查询也支持score_mode模式，但是它只有两种设置none(默认)和score，每个子文档可以只拥有一个父文档，所以就没有必要将分数分给多个子文档了，这仅仅取决于你使用none还是score模式了

Grandparents and Grandchildren

parent-child 关系不仅仅可以有两代，他可以具有多代关系，但是所有关联的数据都必须分到同一个shard中去。
我们稍微修改下之前的列子，叫county 成为branch 的父文档

PUT /company
{
  "mappings": {
    "country": {},  
    "branch": {    
      "_parent": {   //这里指定branch（子公司）的父文档是“country”（国家）
        "type": "country" 
      }
    },
    "employee": {
      "_parent": {   //这里指定employee(员工)的父文档是“branch”（子公司）
        "type": "branch" 
      }
    }
  }
}

Countries 与 branches 只是简单的父子关系，所以我们用相同的方式来创建索引数据

POST /company/country/_bulk  //_bulk是批量创建的意思,在这里是批量创建country（国家）
{ "index": { "_id": "uk" }}
{ "name": "UK" }
{ "index": { "_id": "france" }}
{ "name": "France" }
 
POST /company/branch/_bulk   //_bulk是批量创建的意思,在这里是批量创建branch(子公司)
{ "index": { "_id": "london", "parent": "uk" }} //这里通过指定parent的id为“uk”
{ "name": "London Westmintster" }
{ "index": { "_id": "liverpool", "parent": "uk" }} //这里通过指定parent的id为“uk”
{ "name": "Liverpool Central" }
{ "index": { "_id": "paris", "parent": "france" }}//这里通过指定parent的id为“france”
{ "name": "Champs Élysées" }

parent id 保证了每个branch和他们的父文档都被分配到了同一个shard中了，
如果和之前一样，我们来创建employee 数据，会发生什么？

PUT /company/employee/1?parent=london  //创建id为1的employee对象，并指定其parent的id为london
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}

shard 会根据文档的parent ID—london 来分配employee 文档，但是这个london 文档会根据他的parent id uk（london的父亲是uk（“英国”））来分配，所以employee文档和country、branch 很有可能被分配到不同的shard中(解释：虽然他们被分配到不同的shard中了，但是他们之间的关联关系还是存在的)。
所以我们需要一个额外的参数routing保证所有关联的文档被分配到同一个shard中。

PUT /company/employee/1?parent=london&routing=uk //通过routing指定他所属的shard和谁的一样，这里我们指定和uk的一样
{
  "name":  "Alice Smith",
  "dob":   "1970-10-24",
  "hobby": "hiking"
}

parent 参数仍然用于子文档和父文档的关联，routing 参数是用于保证文档被分配到哪个shard中去
查询和聚合对于多级的文档也仍然有效，例如：问了找到哪些城市的员工喜欢hiking

GET /company/country/_search //对所有的country进行搜索
{
  "query": {
    "has_child": {  //has_child的英文是“是否有此孩子”的意思
      "type": "branch",  //孩子的类型是branch（子公司）
      "query": {
        "has_child": {   //查询branch下是否有此孩子
          "type": "employee",  //孩子的类型是employee（员工）
          "query": {
            "match": {
              "hobby": "hiking" //employee的hobby是"hiking"
            }
          }
        }
      }
    }
  }
}