Elasticsearch-ES查询单字段去重

ES 语句

整体数据

GET wkl_test/_search
{
  "query": {
    "match_all": {}
  }
}

结果:

{
  "took" : 123,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aK0tFpABTkLj5j4c34pE",
        "_score" : 1.0,
        "_source" : {
          "name" : "zhangsan",
          "aa" : 1
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aa0uFpABTkLj5j4cFYrJ",
        "_score" : 1.0,
        "_source" : {
          "name" : "lisi",
          "aa" : 2
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aq0uFpABTkLj5j4cKYqF",
        "_score" : 1.0,
        "_source" : {
          "name" : "wangwu",
          "aa" : 2
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "a60uFpABTkLj5j4c2IoF",
        "_score" : 1.0,
        "_source" : {
          "name" : "maliu",
          "aa" : 2
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "bK1IFpABTkLj5j4cqYop",
        "_score" : 1.0,
        "_source" : {
          "name" : "gouqi",
          "aa" : 3
        }
      }
    ]
  }
}

1:collapse折叠功能- 查询去重后的数据列表(ES5.3之后支持)

  • 推荐原因:性能高,占内存小
  • 注意:使用此方式去重时,不会去除掉不存在去重字段的数据。
  • 去重字段只能是数字long类型或keyword。
  • Field Collapsing(字段折叠)不能与scroll、rescore以及search after 结合使用。
GET wkl_test/_search
{
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "aa"
  }
}

结果:hits 中total虽然=5,但是只返回了去重后的 3 条数据

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aK0tFpABTkLj5j4c34pE",
        "_score" : 1.0,
        "_source" : {
          "name" : "zhangsan",
          "aa" : 1
        },
        "fields" : {
          "aa" : [
            1
          ]
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aa0uFpABTkLj5j4cFYrJ",
        "_score" : 1.0,
        "_source" : {
          "name" : "lisi",
          "aa" : 2
        },
        "fields" : {
          "aa" : [
            2
          ]
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "bK1IFpABTkLj5j4cqYop",
        "_score" : 1.0,
        "_source" : {
          "name" : "gouqi",
          "aa" : 3
        },
        "fields" : {
          "aa" : [
            3
          ]
        }
      }
    ]
  }
}

2:cardinality - 查询去重后的数据总数

  • 聚合+cardinality:即去重计算,类似sql中 count(distinct),先去重再求和
  • 注意:使用此方式统计去重后的数量时,会去除掉不存在去重字段的数据。
GET wkl_test/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0, 
  "aggs": {
    "distinct_count": {
      "cardinality": {
        "field": "aa"
      }
    }
  }
}

结果:distinct_count = 3,说明去重后有3个,既aggregations聚合下,返回了按名字查询去重后的结果数,但是只有去重后的条数,没有具体的数据。

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "distinct_count" : {
      "value" : 3
    }
  }
}

3:整体语句

  • 使用collapse 折叠查询后,虽然返回了去重后的数据,但是total 还是所有的数据量
  • 使用 cardinality 聚合 ,虽然在aggs 聚合结果中返回了正确的数据量,但是hits中还是全部的数据
  • 所以我们需要 两个综合使用,如下:
GET wkl_test/_search
{
  "query": {
    "match_all": {}
  },
  "collapse": {
    "field": "aa"
  }, 
  "aggs": {
    "distinct_count": {
      "cardinality": {
        "field": "aa"
      }
    }
  }
}

结果:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 5,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aK0tFpABTkLj5j4c34pE",
        "_score" : 1.0,
        "_source" : {
          "name" : "zhangsan",
          "aa" : 1
        },
        "fields" : {
          "aa" : [
            1
          ]
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "aa0uFpABTkLj5j4cFYrJ",
        "_score" : 1.0,
        "_source" : {
          "name" : "lisi",
          "aa" : 2
        },
        "fields" : {
          "aa" : [
            2
          ]
        }
      },
      {
        "_index" : "wkl_test",
        "_type" : "_doc",
        "_id" : "bK1IFpABTkLj5j4cqYop",
        "_score" : 1.0,
        "_source" : {
          "name" : "gouqi",
          "aa" : 3
        },
        "fields" : {
          "aa" : [
            3
          ]
        }
      }
    ]
  },
  "aggregations" : {
    "distinct_count" : {
      "value" : 3
    }
  }
}

注:我们使用cardinality聚合后的distinct_count 作为去重后的总数,用 collapse 折叠后的列表作为数据结果集

分页使用解释说明:

  • 1.hits中total的总条数实际上是去重前的总条数,原数据条数,这里我们知道就行,分页中我们并不使用它。hits中数组的大小刚好等于courseAgg聚合的值,数组中的数据就是去重后的数据。

  • 2.aggregations中的courseAgg条数,这个才是去重后的实际条数,也是分页用的总条数。

  • 3.from 查询的偏移量,也就是从哪里开始查。

  • 4.size 查询条数,一次查几条。

  • 接下来,你就可以把它当做一个简单分页查询来用了,传入from和size就ok啦~

JAVA API使用

1:collapse 查询去重的结果集

// 使用collapse来指定去重的字段,例如"your_distinct_field"
            CollapseBuilder collapseBuilder = new CollapseBuilder("your_distinct_field");
            searchSourceBuilder.collapse(collapseBuilder);

2:cardinality - 查询去重后的数据总数

		// 添加一个cardinality聚合来计算去重字段的唯一值数量
         CardinalityAggregationBuilder aggregation = AggregationBuilders
                 .cardinality("distinct_count")//这里是聚合结果的字段名
                 .field("your_distinct_field")//这里是需要聚合的字段
                 .precisionThreshold(40000); // 根据需要调整精度阈值
         searchSourceBuilder.aggregation(aggregation);

3:整体使用

package com.wenge.system.utils;

import org.apache.http.HttpHost;
import org.elasticsearch.action.search.SearchRequest;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.SearchHits;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.metrics.CardinalityAggregationBuilder;
import org.elasticsearch.search.aggregations.metrics.ParsedCardinality;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.elasticsearch.search.collapse.CollapseBuilder;

import java.io.IOException;
import java.util.Map;

/**
 * @author wangkanglu
 * @version 1.0
 * @description
 * @date 2024-06-17 16:48
 */
public class TestES {

    public static void main(String[] args) throws IOException {
        //创建ES客户端
        RestHighLevelClient esClient = new RestHighLevelClient(
                RestClient.builder(new HttpHost("localhost",9200,"http"))
        );

        try {
            // 创建一个搜索请求并设置索引名
            SearchRequest searchRequest = new SearchRequest("your_index");

            // 构建搜索源构建器
            SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();

            // 设置查询条件,例如匹配所有文档,这里根据业务自己修改
            searchSourceBuilder.query(QueryBuilders.matchAllQuery());

            // 使用collapse来指定去重的字段,例如"your_distinct_field"
            CollapseBuilder collapseBuilder = new CollapseBuilder("your_distinct_field");
            searchSourceBuilder.collapse(collapseBuilder);

            // 添加一个cardinality聚合来计算去重字段的唯一值数量
            CardinalityAggregationBuilder aggregation = AggregationBuilders
                    .cardinality("distinct_count")//这里是聚合结果的字段名
                    .field("your_distinct_field")//这里是需要聚合的字段
                    .precisionThreshold(40000); // 根据需要调整精度阈值
            searchSourceBuilder.aggregation(aggregation);

            // 设置搜索源
            searchRequest.source(searchSourceBuilder);

            // 执行搜索
            SearchResponse searchResponse = esClient.search(searchRequest, RequestOptions.DEFAULT);

            SearchHit[] hits = searchResponse.getHits().getHits();
            for (SearchHit hit : hits) {
                Map<String, Object> sourceAsMap = hit.getSourceAsMap();
                System.out.println("去重结果: " + sourceAsMap);
            }

            // 处理搜索结果,获取去重数量
            ParsedCardinality parsedCardinality = searchResponse.getAggregations().get("distinct_count");
            long distinctCount = parsedCardinality.getValue();
            System.out.println("去重结果数量:" + distinctCount);

        } finally {
            // 关闭client
            esClient.close();
        }
    }
}

  • 5
    点赞
  • 11
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

苍煜

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值