Elastic Certified Engineer复习记录-复习题详解篇-索引数据（2）

最新推荐文章于 2022-10-19 23:05:23 发布

死敌wen

最新推荐文章于 2022-10-19 23:05:23 发布

阅读量300

点赞数

分类专栏： ECE考试开发基础程序人生

本文链接：https://blog.csdn.net/weixin_40601534/article/details/113993435

版权

开发基础同时被 3 个专栏收录

28 篇文章 0 订阅

订阅专栏

程序人生

24 篇文章 0 订阅

订阅专栏

ECE考试

21 篇文章 3 订阅

订阅专栏

MAPPINGS AND TEXT ANALYSIS

索引和文档的分析（分词）

GOAL: Model relational data

目标：规整带关系的数据模型

REQUIRED SETUP:

初始化步骤
建议docker-compose文件：1e1k_base_cluster.yml

a running Elasticsearch cluster with at least one node and a Kibana instance,
1. 运行一个至少有1个节点的ES集群，以及1个kibana节点
the cluster has no index with name hamlet,
1. 保证这个集群里没有叫hamlet的索引
the cluster has no template that applies to indices starting by `hamlet
1. 保证这个集群里没有能匹配以hamlet开头的索引模板
```
DELETE hamlet_*
DELETE _template/hamlet_*
```

第1题，对象（object）型数据

Create the index hamlet_1 with one primary shard and no replicas
1. 创建一个包含1分片0副本的索引hamlet_1
Add some documents to hamlet_1 by running the following command
1. 用下面的命令给hamlet_插入一些数据

Verify that the items of the relationship array cannot be searched independently - e.g., searching for a friend named Gertrude will return 1 hit

校验一下relationship字段数组里的元素不能被独立搜索，比如搜索"name": "Gertrude"而且"type": "friend"的数据有一个返回

PUT hamlet_1/_doc/_bulk
{"index":{"_index":"hamlet_1","_id":"C0"}}
{"name":"HAMLET","relationship":[{"name":"HORATIO","type":"friend"},{"name":"GERTRUDE","type":"mother"}]}
{"index":{"_index":"hamlet_1","_id":"C1"}}
{"name":"KING CLAUDIUS","relationship":[{"name":"HAMLET","type":"nephew"}]}

第1题，题解

创建索引

PUT hamlet_1
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}

插数据，运行上面的命令，过程略。数据结构：GET hamlet_1

{
  "hamlet_1" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "relationship" : {
          "properties" : {
            "name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "type" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1606270886689",
        "number_of_shards" : "1",
        "number_of_replicas" : "0",
        "uuid" : "BaWwDy_eSaKPaynt8rWW3g",
        "version" : {
          "created" : "7020199"
        },
        "provided_name" : "hamlet_1"
      }
    }
  }
}

校验数据

POST hamlet_1/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "relationship.type": "friend"
          }
        },
        {
          "match": {
            "relationship.name": "Gertrude"
          }
        }
      ]
    }
  }
}

返回值

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.2199391,
    "hits" : [
      {
        "_index" : "hamlet_1",
        "_type" : "_doc",
        "_id" : "C0",
        "_score" : 1.2199391,
        "_source" : {
          "name" : "HAMLET",
          "relationship" : [
            {
              "name" : "HORATIO",
              "type" : "friend"
            },
            {
              "name" : "GERTRUDE",
              "type" : "mother"
            }
          ]
        }
      }
    ]
  }
}

第1题，题解说明

这题主要考察object型的数据，对ES来说所有的字段都支持数组，所以relationship这个数组里可以保存多个object型的数据。
- 在没指定数据结构的时候，ES会尝试按数据的结构匹配合理的索引结构，像relationship这种带嵌套结构的数据会默认被解析成object型的数据
- object型的数据是一个类似 map 结构的数据，可以通过里面的key进行检索，但是它和nested型数据的区别在于，列表中的所有对象会被当作一个整体来搜索，而nested型数据的每个对象中的字段可以分别进行搜索
1. 参考链接
2. 页面路径：Mapping =》 Field datatypes =》 Object

第2题，嵌套（nested）型数据

Create the index hamlet_2 with one primary shard and no replicas
1. 创建一个含有1分片0副本的索引hamlet_2
Define a mapping for the default type “_doc” of hamlet_2, so that the inner objects of the relationship field
1. 给hamlet_2的type是默认的"_doc"，同时它的字段需要满足以下条件
2. can be searched independently,
  1. 字段可以被独立搜索
3. have only unanalyzed fields
  1. 只有没分词的字段
Reindex hamlet_1 to hamlet_2
1. 把hamlet_1 reindex 到 hamlet_2里面
Verify that the items of the relationship array can now be searched independently - e.g., searching for a friend named Gertrude will return no hits
1. 校验一下relationship数组里的元素可以被独立搜索，比如，搜索"type": "friend" 而且 "name":"Gertrude"的数据没有返回

第2题，题解

创建索引

PUT hamlet_2
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "relationship": {
        "type": "nested"
      }
    }
  }
}

reindex

POST _reindex
{
  "source": {
    "index": "hamlet_1"
  },
  "dest": {
    "index": "hamlet_2"
  }
}

校验数据

直接请求

POST hamlet_2/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "relationship.type": "friend"
          }
        },
        {
          "match": {
            "relationship.name": "Gertrude"
          }
        }
      ]
    }
  }
}

返回值

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

嵌套检索

POST hamlet_2/_search
{
  "query": {
    "nested": {
      "path": "relationship",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "relationship.type": "friend"
              }
            },
            {
              "match": {
                "relationship.name": "Gertrude"
              }
            }
          ]
        }
      }
    }
  }
}

返回值

{
  "took" : 178,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

第2题，题解说明

这题主要考察嵌套（nested）类型数据，它和对象（object）型数据的区别在于nested型数据可以通过指定路径（path）的方式对指定层/位置的数据进行分别的检索
1. 参考链接-nested-datatype
2. 页面路径：Mapping =》 Field datatypes =》 Nested

第3题，父子文档（`parent-join`)

Add more documents to hamlet_2 by running the following command

用下面命令给hamlet_2多塞点数据

POST _bulk
{"index":{"_index":"hamlet_2", "_id":"LO"}}
{"line_number":"1.4.1","speaker":"HAMLET","text_entry":"The air bites shrewdly; it is very cold."}
{"index":{"_index":"hamlet_2","_id":"L1"}}
{"line_number":"1.4.2","speaker":"HORATIO","text_entry":"It is a nipping and an eager air."}
{"index":{"_index":"hamlet_2","_id":"L2"}}
{"line_number":"1.4.3","speaker":"HAMLET","text_entry":"What hour now?"}

Create the index hamlet_3 with only one primary shard and no replicas
1. 创建一个1分片0副本的索引hamlet_3
Copy the mapping of hamlet_2 into hamlet_3, but also add a join field to define a relation between a character (the parent) and a line (the child). The name of such field is “character_or_line”
1. 把hamlet_2的索引结构拷贝到hamlet_3里，同时添加一个名叫character_or_line的join字段来描述character（父文档）和line（子文档）的关系，
Reindex hamlet_2 to hamlet_3
1. 把hamlet_2 reindex 到 hamlet_3里面
Create a script named init_lines and save it into the cluster state. The script:
1. has a parameter named characterId,
2. adds the field character_or_line to the document,
3. sets the value of character_or_line.name to “line” ,
4. sets the value of character_or_line.parent to the value of the characterId parameter
Update the document with id C0 (i.e., the character document of Hamlet) by adding the field character_or_line and setting its character_or_line.name value to “character”
Update the documents in hamlet_3 that have “HAMLET” as a speaker, by running the init_lines script with characterId set to “C0”

第3题，题解

添加数据，略。

创建索引

PUT hamlet_3
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "character_or_line": {
        "type": "join",
        "relations": {
          "character": "line"
        }
      }
    }
  }
}

reindex

POST _reindex
{
  "source": {
    "index": "hamlet_2"
  },
  "dest": {
    "index": "hamlet_3"
  }
}

创建script

PUT _ingest/pipeline/character_update_pipeline
{
  "description": "set the 'character_or_linne', 'character_or_line.name', 'character_or_line.parent'",
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": """
          ctx.character_or_line = new HashMap(); 
          ctx.character_or_line.name = "line";
          ctx.character_or_line.parent = params.characterId;
          """,
          "params": {
            "characterId": "C0"
          }
      }
    }
  ]
}

（由于join field需要routing配置）添加新数据

POST hamlet_3/_doc/C2?routing=C0
{
  "line_number": "1.2.1",
  "speaker": "KING CLAUDIUS",
  "text_entry": "Though yet of Hamlet our dear brothers death"
}

套用刚才的script定点更新

POST hamlet_3/_update_by_query?routing=C0&pipeline=character_update_pipeline
{
  "query":{
    "term":{
      "_id":"C2"
    }
  }
}

这里如果不加routing的设置直接进行更新，可能会报这个错：大意是对于父子关联的字段，routing是必须存在的。

{
  "took": 10,
  "timed_out": false,
  "total": 1,
  "updated": 0,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1,
  "throttled_until_millis": 0,
  "failures": [
    {
      "index": "hamlet_3",
      "type": "_doc",
      "id": "C2",
      "cause": {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "[routing] is missing for join field [character_or_line]"
        }
      },
      "status": 400
    }
  ]
}

校验数据：GET hamlet_3/_doc/C2

返回值

{
  "_index" : "hamlet_3",
  "_type" : "_doc",
  "_id" : "C2",
  "_version" : 4,
  "_seq_no" : 5,
  "_primary_term" : 1,
  "_routing" : "C0",
  "found" : true,
  "_source" : {
    "character_or_line" : {
      "parent" : "C0",
      "name" : "line"
    },
    "line_number" : "1.2.1",
    "text_entry" : "Though yet of Hamlet our dear brothers death",
    "speaker" : "KING CLAUDIUS"
  }
}

第3题，题解说明

这题主要考察的是父子关联数据（parent join），reindex和_update_by_query
- 关联数据可以代替部分关系型数据库的联表查询，但是毕竟是文档型数据存储，ES这部分的处理做的有些差强人意。
- 在校验结果的部分主要关注的是原始文档里不存在character_or_line和_routing字段，在处理完之后会添上
- reindex和_update_by_query其他章节已经讲过，这里略。
1. 参考链接
2. 页面路径：Mapping =》 Field datatypes =》 Join

第3题，拓展

@老杨还提供了另一种题解方式，但是会存在一些问题，比如子文档需要指定routing，但是用 script 做 _update_by_query 的时候又不能直接更新这个属性。

创建script

POST _scripts/character_update_script
{
  "script": {
    "lang": "painless",
    "source": """
      Map map = new HashMap();
      map.name = "line";
      map.parent = params.characterId;
      ctx._source.character_or_line = map;
    """
  }
}

创建指定routing用的pipeline

PUT _ingest/pipeline/set_routing
{
  "description": "assign the routing attribute for doc",
  "processors": [
    {
      "script": {
        "lang": "painless",
        "source": "ctx._routing = 'C0'"
      }
    }
  ]
}

对文档进行定点更新

POST hamlet_3/_update_by_query?pipeline=set_routing
{
  "query":{
    "term":{
      "_id":"C2"
    }
  },
  "script": {
    "id": "character_update_script",
    "params": {
      "characterId": "C0"
    }
  }
}

校验数据同上，略。

死敌wen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elastic Certified Engineer复习记录-复习题详解篇-索引数据（2）

MAPPINGS AND TEXT ANALYSIS索引和文档的分析（分词）GOAL: Model relational data目标：规整带关系的数据模型REQUIRED SETUP:初始化步骤建议docker-compose文件：1e1k_base_cluster.ymla running Elasticsearch cluster with at least one node and a Kibana instance,运行一个至少有1个节点的ES集群，以及1个kibana节点
复制链接

扫一扫