Elastic Certified Engineer复习记录-复习题详解篇-分词

最新推荐文章于 2021-12-16 10:53:51 发布

死敌wen

最新推荐文章于 2021-12-16 10:53:51 发布

阅读量165

点赞数

分类专栏： ECE考试开发基础程序人生

本文链接：https://blog.csdn.net/weixin_40601534/article/details/113993377

版权

开发基础同时被 3 个专栏收录

28 篇文章 0 订阅

订阅专栏

程序人生

24 篇文章 0 订阅

订阅专栏

ECE考试

21 篇文章 3 订阅

订阅专栏

Analysis

分析（分词）

GOAL: set the analyzer on data index against requirements

目标：按要求创建索引
建议docker-compose文件：1e1k_base_cluster.yml

第1题，为数据字段指定分词器

Create the index hamlet_1 with one primary shard and no replicas
1. 创建一个1分片0副本的索引hamlet_1
Define a mapping for the default type “_doc” of hamlet_1, so that
1. 设置hamlet_1的索引配置，使它满足以下条件
2. the type has three fields, named speaker, line_number, and text_entry,
  1. 它有3个字段speaker, line_number 和 text_entry
3. text_entry is associated with the language “english” analyzer
  1. text_entry的分析器是"english"

Add some documents to hamlet_1 by running the following command

通过下面的命令给hamlet_1插入数据

PUT hamlet_1/_bulk
{"index":{"_index":"hamlet_1","_id":0}}
{"line_number":"1.1.1","speaker":"BERNARDO","text_entry":"Whos there?"}
{"index":{"_index":"hamlet_1","_id":1}}
{"line_number":"1.1.2","speaker":"FRANCISCO","text_entry":"Nay, answer me: stand, and unfold yourself."}
{"index":{"_index":"hamlet_1","_id":2}}
{"line_number":"1.1.3","speaker":"BERNARDO","text_entry":"Long live the king!"}
{"index":{"_index":"hamlet_1","_id":3}}
{"line_number":"1.2.1","speaker":"KING CLAUDIUS","text_entry":"Though yet of Hamlet our dear brothers death"}

第1题，题解

创建索引

PUT hamlet_1
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "properties": {
      "speaker": {
        "type": "text"
      },
      "line_number": {
        "type": "text"
      },
      "text_entry": {
        "type": "text",
        "analyzer": "english"
      }
    }
  }
}

插入数据，略。

第1题，题解说明

这题主要考察的是索引配置中的分词器（analyzer）的设置，ES在处理文本数据时，会尝试通过分词器将文本数据拆散成词元，然后再对词元进行倒排索引
1. 参考链接-analysis，参考链接
2. 页面路径：Analysis
3. 页面路径：Mapping =》 Mapping parameters =》 analyzer

第2题，自定义分词器

Create the index hamlet_2 with one primary shard and no replicas
1. 创建1分片0副本的索引hamlet_2
Add to hamlet_2 a custom analyzer named shy_hamlet_analyzer, consisting of
1. 给hamlet_2添加一个满足下面要求的自定义分词器shy_hamlet_analyzer
2. a char filter to replace the characters “Hamlet” with “[CENSORED]”,
  1. 它有一个词元转换器可以把 “Hamlet” 替换成 “[CENSORED]”
3. a tokenizer to split tokens on whitespaces and columns,
  1. 它有一个词元提取器通过空格（whitespaces）和逗号（columns）把语句拆开
4. a token filter to ignore any token with less than 5 characters
  1. 一个词元过滤器来剔除掉所有长度小于5个字符的词元
Define a mapping for the default type “_doc” of hamlet_2, so that
1. 给hamlet_2的默认type设置以下索引设置
2. the type has one field named text_entry,
  1. 它有一个字段叫text_entry
3. text_entry is associated with the shy_hamlet_analyzer created in the previous step
  1. text_entry的分词器是上面设置的shy_hamlet_analyzer
Reindex the text_entry field of hamlet_1 into hamlet_2
1. 把text_entry字段从hamlet_1 reindex 到 hamlet_2
Verify that documents have been reindexed to hamlet_2 as expected - e.g., by searching for “censored” into the text_entry field
1. 校验一下所有数据都被 reindex到hamlet_2了，比如通过在text_entry里搜“censored”

第2题，题解

创建索引

PUT hamlet_2
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "shy_hamlet_analyzer": {
          "char_filter": [
            "hamlet_char_filter"
          ],
          "tokenizer": "hamlet_tokenizer",
          "filter": [
            "hamlet_filter"
          ]
        }
      },
      "char_filter": {
        "hamlet_char_filter": {
          "type": "mapping",
          "mappings": [
            "Hamlet => [CENSORED]"
          ]
        }
      },
      "tokenizer": {
        "hamlet_tokenizer": {
          "type": "pattern",
          "pattern": "[\\s,]"
        }
      },
      "filter": {
        "hamlet_filter": {
          "type": "length",
          "min": 5
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_entry": {
        "type": "text",
        "analyzer": "shy_hamlet_analyzer"
      }
    }
  }
}

reindex

POST _reindex
{
  "source": {
    "index": "hamlet_1",
    "_source": ["text_entry"]
  },
  "dest": {
    "index": "hamlet_2"
  }
}

数据校验

POST hamlet_2/_search
{
  "query": {
    "match": {
      "text_entry": "[CENSORED]"
    }
  }
}

返回值

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.65592396,
    "hits" : [
      {
        "_index" : "hamlet_2",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.65592396,
        "_source" : {
          "text_entry" : "Though yet of Hamlet our dear brothers death"
        }
      }
    ]
  }
}

第2题，题解说明

这题主要考察分词器里面的各种个性化配置
- analyzer里主要包含tokenizer，char_filter，filter和position_increment_gap
  - tokenizer：词元提取器，通过什么方式把字符串拆开（比如本题中的空格和逗号）
  - char_filter：词元转换器，通过什么方式把解析出来的词元进行清洗
  - filter：词元过滤器，以什么规则对解析出来对词元进行过滤
  - position_increment_gap：词元位置打散参数，为了防止短语之间对词元太过接近而设置的打散步长
- reindex，我们之前的使用大都是直接全量导数据，这里用到了它对字段对限制，指定字段的复制
1. 参考链接-custom-analyzer，参考链接-pattern-tokenizer，参考链接-mapping-char-filter，参考链接-analyzer-length-filter
  1. 页面路径-custom-analyzer：Analysis =》 Analyzers =》 Custom Analyzer
  2. 页面路径-pattern-tokenizer：Analysis =》 Tokenizers
    =》 Pattern Tokenizer
  3. 页面路径-mapping-charfilter：Analysis =》 Character Filters =》 Mapping Char Filter
  4. 页面路径-analyzer-length-filter：Analysis =》 Token Filters =》 Length Token Filter
2. 参考链接-reindex
  1. 页面路径：Document APIs =》 =》 Reindex API

死敌wen

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Elastic Certified Engineer复习记录-复习题详解篇-分词

Analysis分析（分词）GOAL: set the analyzer on data index against requirements目标：按要求创建索引建议docker-compose文件：1e1k_base_cluster.yml第1题，为数据字段指定分词器Create the index hamlet_1 with one primary shard and no replicas创建一个1分片0副本的索引hamlet_1Define a mapping for the
复制链接

扫一扫

专栏目录