elasticsearch aggregations_Elasticsearch

最新推荐文章于 2024-05-19 21:09:34 发布

weixin_39921023

最新推荐文章于 2024-05-19 21:09:34 发布

阅读量130

点赞数

文章标签： elasticsearch aggregations

本文主要研究Elasticsearch，参考文章 Get Started

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time.

- 相关概念

Near Realtime(NRT)：近乎实时的search platform，存在一些latency(通常1s)
Cluster：为nodes组成的集合来共同保存所有data并提供联合index和search功能。Cluster的默认名称为elasticsearch。可以添加node到指定的cluster。
Node： Cluster中的单个server，用来存放data并参与cluster的indexing和search功能。Node的名称默认为随机的UUID，在node启动时赋值，也可以自己命名。Node可以被添加进指定的cluster，默认情况下，每个node被添加进elasticsearch cluster。创建第一个node默认创建elasticsearch cluster并将自己添加进去。
Index： Index为具有类似特征的documents集合。Index的name必须是小写字符，name在indexing,search,update和delete documents时被使用。Cluster中可以具有任意数量的index。
Type：Type用作index的逻辑划分，可以在相同的index中存放多种类型的doc。
Document： Doc为index中最基本的信息单元并使用JSON格式表示。Doc物理上存放在index中，而且被给予type。
Shards & Replicas： Index中可以存放很多的data。当data很大时，需要将index拆分为多个shards，分别存放在不同的node。同时，shard可以备份多份，备份被称为replicas。

Shards在空间和并发度上优化，Replicas提高可靠度。定义index时可以指定shards和replicas的个数，并可以在任何时间更改replicas的个数。默认情况下，每个index具有5个primary shards和1个complete replica。在一个shard上最多Integer.MAX_VALUE-128个doc。

-安装

可以不用安装，使用elastic cloud上的elasticsearch service，AWS和GCP都有提供。但是是付费的。

使用Elasticsearch至少为JDK 8，安装Elasticsearch之前，需要先安装JDK 8。更多关于安装的信息参见安装。

我们安装的Elasticsearch的设置如下：

这样，Elasticsearch使用9200端口来访问其Rest API，创建了elasticsearch cluster，其中具有一个node my_first_node。执行安装目录中的elasticsearch.exe来启动，启动一个cluster elasticsearch，而且该cluster下只存在一个node。

针对不同语言，Elasticsearch提供不同的client,参见 Elasticsearch Client 。

-Cluster

安装并启动Elasticsearch后，我们具有一个默认cluster，该cluster中具有一个node。Elasticsearch提供REST API来与Cluster进行交互。

Rest API功能:

检查cluster,node,index health,status以及statistics
管理cluster,node,index data和metadata
对index执行CRUD和search操作
执行高级search，如：paging,sorting,filtering,scripting,aggregations

使用_cat API来查看cluster health:

curl -X GET "localhost:9200/_cat/health?v"

response:

Green表示cluster完全起作用；Yellow表示所有data是可靠的但是一些repicas还没有分配；Red表示一些data不可靠，cluster只发挥部分功能。

下面命令获取Cluster下的所有node：

curl -X GET "localhost:9200/_cat/nodes?v"

下面命令显示Cluster中所有index:

curl -X GET "localhost:9200/_cat/indices?v"

因为没有创建index，所以这里没有index的信息。

-Index

下面命令创建name为customer的index：

curl -X PUT "localhost:9200/customer?pretty"
curl -X GET "localhost:9200/_cat/indices?v"

命令中的pretty表示pretty-print JSON response。

由结果可知，customer index具有5个primary shards和1个replica，但具有0个doc。

下面的命令用来删除customer index：

curl -X DELETE "localhost:9200/customer?pretty"
curl -X GET "localhost:9200/_cat/indices?v"

-Document

下面我们为customer index创建简单的Document，其ID为1，type为_doc:

curl -X PUT "localhost:9200/customer/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "name": "John Doe"
}

Response:

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

注意：Elasticsearch并不要求在将doc放入index之前需要显式创建index。上面的例子中，若index不存在，会自动创建customer index。

下面的命令获取customer index中type为_doc，id为1的Document信息：

curl -X GET "localhost:9200/customer/_doc/1?pretty"

Response:

{
  "_index" : "customer",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 25,
  "_primary_term" : 1,
  "found" : true,
  "_source" : { "name": "John Doe" }
}

在Elasticsearch中，使用下面的格式来操作Document:

<HTTP Verb> /<Index>/<Type>/<ID>

修改Document分为替换原来的doc或者更新原来的doc.

我们使用下面的命令来替换customer index中type为_doc且ID为1的document的内容。

curl -X PUT "localhost:9200/customer/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
  "name": "Jane Doe"
}

若使用不同的ID，该命令会新创建一个document。index中的ID是optional，若不提供，则会生成随机ID，该ID会放在response中返回。

使用下面的命令来显示更新Document的内容：

curl -X POST "localhost:9200/customer/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
  "doc": { "name": "Jane Doe", "age": 20 }
}

也可以使用script来更新Document:

curl -X POST "localhost:9200/customer/_doc/1/_update?pretty" -H 'Content-Type: application/json' -d'
{
  "script" : "ctx._source.age += 5"
}

上面将age增加5，这里ctx._source指代当前的document。

下面命令用来删除Document:

curl -X DELETE "localhost:9200/customer/_doc/2?pretty"

注意：删除整个index比删除index下的所有documents更高效。

批处理 _bulk

下面的API同时为index 1和index 2创建或更新doc：

POST /customer/_doc/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }

下面的API更新id为1的doc并删除id为2的doc：

POST /customer/_doc/_bulk?pretty
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}

批处理的命令并不会因为其中一个action的失败而失败，response中包含每个action的status可供检查结果。

搜索 _search

搜索使用_search endpoint，有2种方式来执行search:通过REST request API来发送search parameters；通过Rest request body来发送search parameters。

下面REST request API在bank index中执行search操作，q=*表示匹配index中的所有documents并使用account_number来升序排序结果。response中包含该search使用时间和返回结果集。

curl -X GET "localhost:9200/bank/_search?q=*&sort=account_number:asc&pretty"

下面使用REST request body来search:

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}

注意：当返回search result时， Elasticsearch并不会保留server端的资源或者结果集的cursor。

Elasticsearch提供JSON-style的领域相关语言来执行query，被称为Query DSL。

下面API获取bank index中的一个document， size默认为10。

GET /bank/_search
{
  "query": { "match_all": {} },
  "size": 1
}

下面API获取bank index中第10至第19个document。该特性常用来对search result进行分页。

GET /bank/_search
{
  "query": { "match_all": {} },
  "from": 10,
  "size": 10
}

默认情况下，search会返回full JSON document，但也可以返回指定的field。下面只获得account_number和balance field。

GET /bank/_search
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}

下面的search只获取account_number为20的结果：类似于SQL中的where

GET /bank/_search
{
  "query": { "match": { "account_number": 20 } }
}

下面只获取address中包含mill的account:

GET /bank/_search
{
  "query": { "match": { "address": "mill" } }
}

下面只获取address中包含mill或lane的account: 类似于in

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } }
}

下面获取address中包含phrase "mill lane"的account:

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

Bool query：使用bool query允许我们将小query合并为大query。must表示所有match必须为true。 should表示其中一个match符合即可。must_not表示任何一个match都应该是false。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

当然，我们可以组合使用must, should，must_not来获取复杂的结果。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}

Range query：用来限制value range，通常用于数字和时间。

下面查询balance在20000-30000之间的account：

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

aggregation 用来group并获取数据中的统计信息。

下面API按account的state进行group，按数量倒序排序(默认)并返回top 10(默认)

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"
      }
    }
  }
}

类似于下面的SQL:

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC LIMIT 10;

下面仍是按照state进行group并计算group中balance的平均值并以average_balance降序排序：

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_state": {
      "terms": {
        "field": "state.keyword"，
		"order": {
          "average_balance": "desc"
		}
      },
      "aggs": {
        "average_balance": {
          "avg": {
            "field": "balance"
          }
        }
      }
    }
  }
}

下面以age range进行group并对每个age range以gender进行group，计算每个group中的平均balance：

GET /bank/_search
{
  "size": 0,
  "aggs": {
    "group_by_age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "group_by_gender": {
          "terms": {
            "field": "gender.keyword"
          },
          "aggs": {
            "average_balance": {
              "avg": {
                "field": "balance"
              }
            }
          }
        }
      }
    }
  }
}

关于aggregation，参见 search-aggregations 。

学习更多Elasticsearch，请参看 ElasticSearch 中的不同细节部分。其中也可以找到logstash和kibana的document.

weixin_39921023

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫