elasticsearch text类型聚合操作

最新推荐文章于 2024-07-25 16:30:40 发布

ASN_forever

最新推荐文章于 2024-07-25 16:30:40 发布

阅读量6k

点赞数 2

分类专栏： elk 大数据

本文链接：https://blog.csdn.net/ASN_forever/article/details/103720686

版权

大数据同时被 2 个专栏收录

19 篇文章 0 订阅

订阅专栏

elk

5 篇文章 0 订阅

订阅专栏

基础：es版本6.0

text类型时分析型类型，默认是不允许进行聚合操作的。如果想对text类型的域（就是字段、属性的意思）进行聚合操作，需要设置其fielddata为true。但这样设置完了只是满足聚合要求了，而无法满足精准聚合，就是说text类型还是会进行分词分析过程，导致针对一个text类型的域进行聚合的时候，可能会不准确。因此还需要设置此字段的fileds子域为keyword类型，经过这两处设置之后就可以进行精准聚合操作了。

下面是测试过程。

首先创建一个索引my_index，并指定其类型名为my_type（6版本每个索引只支持一个type），并为其设置了映射规则：testText域的类型为text类型

PUT my_index
{
  "mappings" : {
        "my_type" : {
            "properties" : {
				"testText" : {
                    "type" : "text"
                }
            }
        }
    }
}

接下来插入一条文档

POST my_index/my_type
{
  "testText":"v1/v2"
}

接下来尝试分桶聚合操作

POST /my_index/my_type/_search
{
  "size" : 0,
    "aggs" : { 
        "buk" : { 
            "terms" : { 
              "field" : "testText"
            }
        }
    }
}

结果报错如下

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my_index",
        "node": "WsOTyQlISXKvOkxoqlqAJA",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
        }
      }
    ],
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
      "caused_by": {
        "type": "illegal_argument_exception",
        "reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
      }
    }
  },
  "status": 400
}

报错意思是说，默认情况下，在text类型的字段上禁用Fielddata，因为会占用很大的内存。如果实在想对text类型进行聚合，可以在对应字段上设置fielddata=true，以便通过取消反转索引将fielddata加载到内存中。要实现聚合，建议直接设置类型为keyword而不是text。

那么我们先不管占不占内存，先按照提示设置fielddata=true试试

POST /my_index/_mapping/my_type
{
  "properties": {
    "testText": { 
      "type": "text",
      "fielddata": true
    }
  }
}

设置完之后再次进行之前的聚合操作，得到如下结果

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "buk": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "v1",
          "doc_count": 1
        },
        {
          "key": "v2",
          "doc_count": 1
        }
      ]
    }
  }
}

可以看到对testText字段进行聚合操作后得到了两个桶，第一个桶的值为“v1”，第二个桶的值为“v2”，且都各自对应一个文档。分析一下，我们只存了一个文档，testText值为“v1/v2”。因为text会进行分词，默认使用的分词器会把“/”省略掉，倒排索引后生成“v1”,“v2”两个token，因此对testText进行聚合操作时，会分别匹配“v1”,“v2”两个token而不是输入的“v1/v2”。所以这就是开头说的对text类型字段进行聚合可能会不准确。

接下来解决不准确的问题！

为testText字段设置keyword类型的子字段

POST /my_index/_mapping/my_type
{
  "properties": {
      "testText": { 
        "type": "text",
        "fielddata": true,
         "fields": {
                "subField": {
                  "type": "keyword",
                  "ignore_above": 256
                }
            }
        }
    }
}

接下来测试聚合操作，注意这时候是对testText字段的子字段subField进行聚合（实际上子字段的值就是父字段的值，因此可以代替父字段进行聚合），但如果不添加新的文档的话，会发现没有结果，因为旧文档已经按照旧的映射规则创建了倒排索引了，所以新的聚合不会查到数据。我这里直接新增新的文档再进行聚合。

POST my_index/my_type
{
  "testText":"v3/v4"
}
POST /my_index/my_type/_search
{
  "size" : 0,
    "aggs" : { 
        "buk" : { 
            "terms" : { 
              "field" : "testText.subField"
            }
        }
    }
}

插入了新的值“v3/v4”，并对subField进行聚合，结果如下

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "buk": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "v3/v4",
          "doc_count": 1
        }
      ]
    }
  }
}

可以看到只有一个桶，且值为输入的值“v3/v4”，这就实现精确聚合了。

但还是需要注意，这样会消耗内存，建议对需要聚合的字符串字段设置为keyword类型。