ElasticSearch 进阶（一）

chinusyan

已于 2022-10-20 17:36:06 修改

阅读量672

点赞数

分类专栏：分布式文章标签： elasticsearch 大数据搜索引擎

于 2022-10-20 14:35:03 首次发布

本文链接：https://blog.csdn.net/chinus_yan/article/details/127425881

版权

分布式专栏收录该内容

25 篇文章 2 订阅

订阅专栏

一、ES之存储结构mapping解读

ES之存储结构mapping解读

3.1 字段数据类型

3.1.1 文本类型

文本族包括以下字段类型:

text，即用于全文本内容(如电子邮件正文或产品描述)的传统字段类型。
match_only_text 是 text 的空间优化变体，它禁用评分，并在需要位置的查询上执行较慢。它最适合为日志消息编制索引。

text 字段接受以下参数:

curl -XPOST http://localhost:9200/index/_mapping -H 'Content-Type:application/json' -d'
{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_smart"
        }
    }
}'


analyzer	The analyzer which should be used for the text field, both at index-time and at search-time (unless overridden by the search_analyzer). Defaults to the default index analyzer, or the standard analyzer.

二、文本分析（Text analysis）

2.1 Built-in analyzer reference

2.1.1 Standard analyzer

standard 分词器是默认的分词器，如果没有指定其它的分词器，则使用该分词器。
它提供了grammar based tokenization(基于Unicode文本分割算法，如Unicode标准附录#29所述)，并且适用于大多数语言。

POST /_analyze
{
  "analyzer": "standard",
  "text": "中华民族伟大复兴"
}

standard 分词器接受以下参数:

max_token_length:
最大令牌长度。如果看到的令牌超过了这个长度，则按max_token_length间隔对其进行分割。默认为255
stopwords
预定义的停止词列表，如_english_，或包含停止词列表的数组。默认为_none_。
stopwords_path
包含停止字的文件的路径。

// 自定义分词器
PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST /my-index-000001/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Standard analyzer 包括：

Tokenizer

Standard Tokenizer

Token Filters

Lower Case Token Filter
Stop Token Filter (disabled by default)

2.2 Token filter reference

如果您需要在配置参数之外定制标准分词器，那么您需要将其重新创建为定制分词器并修改它，通常是通过添加令牌过滤器。

PUT /standard_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       
          ]
        }
      }
    }
  }
}