Elasticsearch Reference [5.2] » Getting Started

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

Elasticsearch是可高度扩展的、开源的、全字段搜索与分析的引擎。允许快速的存储、搜索以及分析大量的数据,性能接近实时。一般用来作为进行复杂搜索特征和需求的应用的解决方案。

Here are a few sample use-cases that Elasticsearch could be used for:

  • You run an online web store where you allow your customers to search for products that you sell. In this case, you can use Elasticsearch to store your entire product catalog and inventory and provide search and autocomplete suggestions for them.
  • You want to collect log or transaction data and you want to analyze and mine this data to look for trends, statistics, summarizations, or anomalies. In this case, you can use Logstash (part of the Elasticsearch/Logstash/Kibana stack) to collect, aggregate, and parse your data, and then have Logstash feed this data into Elasticsearch. Once the data is in Elasticsearch, you can run searches and aggregations to mine any information that is of interest to you.
  • You run a price alerting platform which allows price-savvy customers to specify a rule like "I am interested in buying a specific electronic gadget and I want to be notified if the price of gadget falls below $X from any vendor within the next month". In this case you can scrape vendor prices, push them into Elasticsearch and use its reverse-search (Percolator) capability to match price movements against customer queries and eventually push the alerts out to the customer once matches are found.
  • You have analytics/business-intelligence needs and want to quickly investigate, analyze, visualize, and ask ad-hoc questions on a lot of data (think millions or billions of records). In this case, you can use Elasticsearch to store your data and then use Kibana (part of the Elasticsearch/Logstash/Kibana stack) to build custom dashboards that can visualize aspects of your data that are important to you. Additionally, you can use the Elasticsearch aggregations functionality to perform complex business intelligence queries against your data.

以下是一些Elasticsearch可以支持的简单用例:
  • 经营在线网店时,允许用户在线搜索商品。这种场景下,可以使用Elasticsearch存储所有商品的特征和分类,并为上品提供搜索功能以及完全的建议。
  • 可以用来搜集日志或者交易数据,然后分析或者挖掘这些数据以获取购买趋势、统计数据、汇总数据、或者异常。这种情况下,你可以使用Logstash(Elasticsearch/Logstash/Kibana技术栈的一部分)去收集、聚合或者分析数据,然后将Logstash数据传输给Elasticsearch。一旦数据存入Elasticsearch,可以搜索或者聚合数据以挖掘你感兴趣的信息。
  • 可以运行价格提醒平台,支持客户设定一些价格提醒规则:例如对某种商品比较感兴趣,当此商品价格低于某个价格时,可以通知此客户。这种情况下,你可以抓取买方价格,然后将它们推送到Elasticsearch,然后使用逆向搜索能力对价格变动和客户需求相匹配,最终将找到的价格匹配的商品通知客户。
  • 有分析/商业智能需求,想快速调查、分析、可视化或者询问大量数据的临时问题(想象一下百万或者十亿级别的记录)。这种情况下,你可以使用Elasticsearch存储数据,然后使用Kibana来创建自定义的仪表盘,可以将你认为重要的数据可视化显示。另外,可以使用Elasticsearch聚合功能执行复杂的商业智能需求。


For the rest of this tutorial, I will guide you through the process of getting Elasticsearch up and running, taking a peek inside it, and performing basic operations like indexing, searching, and modifying your data. At the end of this tutorial, you should have a good idea of what Elasticsearch is, how it works, and hopefully be inspired to see how you can use it to either build sophisticated search applications or to mine intelligence from your data.

本文剩下的部分,将会引导你启动Elasticsearch、运行、执行基本操作诸如索引、搜索以及修正数据。本文最后,你会发现Elasticsearch是什么、怎样工作、以及对你使用它创建搜索应用或者数据挖掘智能化方面是多么有帮助。








1、Basic Concepts edit

基本概念





There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.

Elasticsearch有一些核心概念。理解这些概念有助于简化学习过程。


Near Realtime (NRT) edit

接近实时


Elasticsearch is a near real time search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.

Elasticsearch是接近实时搜索的平台。这就意味着从你开始索引文档开始到文档变为可搜索状态会有轻微的延迟(通常是1s)。


Cluster edit

集群

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is "elasticsearch". This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.

Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-devlogging-stage, and logging-prod for the development, staging, and production clusters.


Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.


集群是一个或者多个节点(servers)组成,共同存储全部数据并提供索引和搜索能力。集群通过唯一的名字区分,默认是elasticsearch。这个名字非常重要,因为一个节点只能加入一个集群,就是通过这个名字加入集群的。

确保不会使用相同的集群名字,否则节点将会加入错误的集群。例如你可以使用logging-dev,logging-stage以及logging-prod用于开发、级连、以及生产环境。

注意:集群只有一个节点是正确的。而且,你以后可能会有多个独立的集群,每个集群都有自己独一无二的集群名字。


Node edit

节点


A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.

一个节点就是一个单独的server,作为集群的一部分,存储数据,参与集群的索引以及搜索操作。就像集群一样,一个节点可以用一个唯一的名字标识,默认是UUID,首次启动时分配。如果不想使用默认名字,可以使用任意节点名字。这个名字对于你确定哪些servers和Elasticsearch集群中哪些节点相关是非常重要的。


A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named  elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named  elasticsearch.

一个节点可以通过集群名字配置为加入一个特定的集群。默认的,每个节点设定为加入名为elasticsearch的集群,即如果启动一些节点-假定它们可以发现对方-它们会自动形成并加入一个名为elasticsearch的集群。


In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named  elasticsearch.

在一个单独的集群中,要多少节点就可以有多少节点。而且,如果没有其他的Elasticsearch节点运行在你的网络中,启动一个单独的节点将默认的创建一个名为elasticsearch的集群。


Index edit

索引

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.


In a single cluster, you can define as many indexes as you want.

索引是具有某些相似特征的文档的集合。例如,可以用一个索引存储用户数据,一个索引存储生产目录,以及另外一个索引存储其他有序数据。一个索引用一个名字标识(必须是小写字母),这个名字用来找到这个索引,当执行索引、搜索、更新以及删除操作时,都会用到这个名字。

在一个单独的集群中,索引的个数可以要多少有多少。


Type edit

类型


Within an index, you can define one or more types. A type is a logical category/partition of your index whose semantics is completely up to you. In general, a type is defined for documents that have a set of common fields. For example, let’s assume you run a blogging platform and store all your data in a single index. In this index, you may define a type for user data, another type for blog data, and yet another type for comments data.

索引内部,可以定义一种或者多种类型。一个类型就是一个逻辑上目录,它的语义对于你来说是完整的。大体来说,一种类型可以定义为包含一系列共同字段的文档。例如,让我们假定你运行了一个博客平台,在一个单独的索引中存储你所有的数据。在这个索引中,你可以为用户数据定义一种类型,为博文数据定义一种类型,为评论数据定义一种类型。


Document edit

文档

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is an ubiquitous internet data interchange format.


Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.

一个文档是可以被索引的基础信息单元。例如,你可以对每个单独的客户创建一个文档,为每个单独的产品创建另一个文档,为一个单独的命令创建一个文档。这个文档使用JSON格式表达,这是一种无处不在的互联网数据交互方式。

在索引/类型内部,想要存储多少文档就可以存储多少。注意:尽管文档物理上必须存储在索引中,但是实际上一个文档必须索引到或者分配到索引内部的某个类型中。


Shards & Replicas edit

分片以及备份


An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

一个索引可以存储大量的数据,可以存储超出单个节点的硬件限制。例如,一个索引包含十亿个文档,占有1TB的磁盘空间,那么一个单独节点的磁盘是无法存放的,或者说对于单个节点来说,存储这么数据对于搜索来说也太慢了。


To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.

为了解决这个问题,Elasticsearch提供将索引划分成多个称谓shards的数据分片。当创建一个索引时,可以简单的定义shards的数量。每个shard自己都是具有全功能并且独立的“索引”,可以存放到集群中的任何一个节点上。

Sharding is important for two primary reasons:

  • It allows you to horizontally split/scale your content volume
  • It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

Sharding(分片)的重要性有以下两个原因:
  • 允许水平分割或者扩展数据
  • 允许在shards之间进行分布式或者并行操作,这样可以提高性能或者吞吐量

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.

shard的分布式机制以及文档聚合机制是完全由Elasticsearch管理并对用户来说是透明的。

在网络或者云环境下,任何时候都可能发生任何错误,因此高度建议采用备份机制以避免数据丢失。Elasticsearch允许对索引的shards备份一份或者多份。

Replication is important for two primary reasons:

  • It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
  • It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.


备份的重要性有两个原因:
  • 提供某个shard或者节点失效时的高可用性。出于这个原因,备份的shards之间不应当分布在相同的节点上。
  • 允许扩展搜索的数据量或者吞储量,这样可以并行搜索所有备份。

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards). The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may change the number of replicas dynamically anytime but you cannot change the number of shards after-the-fact.

总的来说,每个索引可以划分成多个shards。一个索引可以没有备份或者有很多备份。一旦备份了,每个索引都会有主要shards(原始的shard,其他备份shard都是从这个shard上拷贝)和备份shard。shards和备份的数量可以在创建索引时指定。在索引创建之后,可以动态的改变备份数,但是无法更改shards的数量(备份数可改,但是主要shards数目不能改)


By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.

默认情况下,每个Elasticsearch中的索引分配5个主shards,每个主shard都有一个备份,即集群至少要有两个节点,索引将会有5个主要shards和另外5个备份shards(一个完整的备份),即每个索引都会有10个shards。

Note

Each Elasticsearch shard is a Lucene index. There is a maximum number of documents you can have in a single Lucene index. As of  LUCENE-5843, the limit is  2,147,483,519(= Integer.MAX_VALUE - 128) documents. You can monitor shard sizes using the  _cat/shards api.

每个Elasticsearch shard是一个Lucene索引。每个单独的Lucene索引的文档数都有最大值。就像 Lucene-5843,限制是2,147,483,519(=整型的最大值-128)个文档。你可以通过食用_cat/shards api来监控shard的尺寸。





2、Installationedit


安装部署

Elasticsearch requires at least Java 8. Specifically as of this writing, it is recommended that you use
the Oracle JDK version 1.8.0_73. Java installation varies from platform to platform so we won’t go
into those details here. Oracle’s recommended installation documentation can be found on
Oracle’s website. Suffice to say, before you install Elasticsearch, please check your Java version
first by running (and then install/upgrade accordingly if needed):

Elasticsearch至少需要Java 8。本文推荐你使用Oracle JDK version 1.8.0_73。在不同的平台上安装Java
会有很大不同,这里我们就不讨论这些细节了。Oracle的推荐安装文档可以在Oracle网站上找到。在安装
Elasticsearch之前,最好检查一下java版本。

java -version
echo $JAVA_HOME

Once we have Java set up, we can then download and run Elasticsearch. The binaries are available
from  www.elastic.co/downloads along with all the releases that have been made in the past.
For each release, you have a choice among a  zip or  tar archive, or a  DEB or  RPM package.
For simplicity, let’s use the tar file.

一旦安装好Java,可以下载并运行Elasticsearch。从www.elastic.co/downloads可以找到二进制文件,
也可以在过去的下载列表中获取之前所有的发布版本。对于每个发布版,可以选择zip或者tar格式,也可以
选择DEB或者RPM压缩包。为简单起见,这里使用tar格式文件。

Let’s download the Elasticsearch 5.2.0 tar as follows (Windows users should download the zip package):

像下面一样下载Elasticsearch 5.2.0(Windows用户可以下载zip压缩包):




Then extract it as follows (Windows users should unzip the zip package):

然后解压文件:

tar -xvf elasticsearch-5.2.0.tar.gz


It will then create a bunch of files and folders in your current directory. We then go into the bin directory as follows:

将会在当前目录中创建大量的文件和文件夹。然后进入bin目录:


cd elasticsearch-5.2.0/bin


And now we are ready to start our node and single cluster (Windows users should run the elasticsearch.bat file):

现在可以启动程序,即启动单节点集群:

./elasticsearch



If everything goes well, you should see a bunch of messages that look like below:

如果所有事情都ok的话,你可以看到如下的信息:

[2016-09-16T14:17:51,251][INFO ][o.e.n.Node               ] [] initializing ...
[2016-09-16T14:17:51,329][INFO ][o.e.e.NodeEnvironment    ] [6-bjhwl] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [317.7gb], net total_space [453.6gb], spins? [no], types [ext4]
[2016-09-16T14:17:51,330][INFO ][o.e.e.NodeEnvironment    ] [6-bjhwl] heap size [1.9gb], compressed ordinary object pointers [true]
[2016-09-16T14:17:51,333][INFO ][o.e.n.Node               ] [6-bjhwl] node name [6-bjhwl] derived from node ID; set [node.name] to override
[2016-09-16T14:17:51,334][INFO ][o.e.n.Node               ] [6-bjhwl] version[5.2.0], pid[21261], build[f5daa16/2016-09-16T09:12:24.346Z], OS[Linux/4.4.0-36-generic/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_60/25.60-b23]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [aggs-matrix-stats]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [ingest-common]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-expression]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-groovy]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-mustache]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [lang-painless]
[2016-09-16T14:17:51,967][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [percolator]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [reindex]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [transport-netty3]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded module [transport-netty4]
[2016-09-16T14:17:51,968][INFO ][o.e.p.PluginsService     ] [6-bjhwl] loaded plugin [mapper-murmur3]
[2016-09-16T14:17:53,521][INFO ][o.e.n.Node               ] [6-bjhwl] initialized
[2016-09-16T14:17:53,521][INFO ][o.e.n.Node               ] [6-bjhwl] starting ...
[2016-09-16T14:17:53,671][INFO ][o.e.t.TransportService   ] [6-bjhwl] publish_address {192.168.8.112:9300}, bound_addresses {{192.168.8.112:9300}
[2016-09-16T14:17:53,676][WARN ][o.e.b.BootstrapCheck     ] [6-bjhwl] max virtual memory areas vm.max_map_count [65530] likely too low, increase to at least [262144]
[2016-09-16T14:17:56,731][INFO ][o.e.h.HttpServer         ] [6-bjhwl] publish_address {192.168.8.112:9200}, bound_addresses {[::1]:9200}, {192.168.8.112:9200}
[2016-09-16T14:17:56,732][INFO ][o.e.g.GatewayService     ] [6-bjhwl] recovered [0] indices into cluster_state
[2016-09-16T14:17:56,748][INFO ][o.e.n.Node               ] [6-bjhwl] started

Without going too much into detail, we can see that our node named "6-bjhwl" (which will be a different set of characters in
your case) has started and elected itself as a master in a single cluster. Don’t worry yet at the moment what master means.
The main thing that is important here is that we have started one node within one cluster.

不需要仔细观察细节,就可以看到名为"6-bjhwl”(每个启动的节点都会不同)的节点已经启动并将自己选为单节点集群的
master节点。这时候不用着急知道master的含义。现在最重要的事情是我们已经启动了单节点集群。


As mentioned previously, we can override either the cluster or node name. This can be done from the command line when
starting Elasticsearch as follows:

就像前面提到的,我们可以自己命名集群或者节点的名字。这个可以通过命令行实现,如下所示:

./elasticsearch -Ecluster.name=my_cluster_name -Enode.name=my_node_name

Also note the line marked http with information about the HTTP address ( 192.168.8.112) and port ( 9200) that our node is
reachable from. By default, Elasticsearch uses port  9200 to provide access to its REST API. This port is configurable if
necessary.

需要知道的是:可以通过http地址192.168.8.112,端口9200来访问节点。默认情况下,Elasticsearch使用端口9200提供
REST API的访问方式。如果需要,可以配置此端口。









3、Exploring Your Cluster edit

探索集群


The REST API edit

REST API


Now that we have our node (and cluster) up and running, the next step is to understand how to communicate with it. Fortunately, Elasticsearch provides a very comprehensive and powerful REST API that you can use to interact with your cluster. Among the few things that can be done with the API are as follows:

现在我们的节点(或者集群)已经启动并运行,下一步就是理解如何通信。幸运的是,Elasticsearch提供非常方便的REST API,可以用来和集群进行交互,能够做的事包括以下:


  • Check your cluster, node, and index health, status, and statistics
  • Administer your cluster, node, and index data and metadata
  • Perform CRUD (Create, Read, Update, and Delete) and search operations against your indexes
  • Execute advanced search operations such as paging, sorting, filtering, scripting, aggregations, and many others
  • 检查集群、节点、索引健康、状态、以及统计信息
  • 管理集群、节点,以及索引数据和元数据
  • 执行CRUD(创建、读取、更新、删除)以及搜索操作
  • 执行高级搜索操作例如分页、排序、过滤、脚本执行、聚合以及很多其他的操作












4、Cluster Health edit

集群健康


Let’s start with a basic health check, which we can use to see how our cluster is doing. We’ll be using curl to do this but you can use any tool that allows you to make HTTP/REST calls. Let’s assume that we are still on the same node where we started Elasticsearch on and open another command shell window.

让我们开始基本的集群健康检查,可以用来查看集群是怎样运行的。将使用curl命令来查看,但是你可以使用任何支持HTTP/REST调用的工具。假定我们在启动Elasticsearch的节点上,并开启另外一个命令行窗口。


To check the cluster health, we will be using the  _cat API. You can run the command below in  Kibana’s Console by clicking "VIEW IN CONSOLE" or with  curl by clicking the "COPY AS CURL" link below and pasting it into a terminal.


想要检查集群健康,将使用_cat API。你可以在Kibana's控制台运行这些命令,或者通过点击"COPY AS CURL”链接来在终端发送命令:


GET /_cat/health?v


And the response:

返回信息如下:


epoch    timestamp cluster      status  node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1475247709 17:01:49 elasticsearch green    1                  1             0        0    0    0     0              0                           -                        100.0%



We can see that our cluster named "elasticsearch" is up with a green status.

Whenever we ask for the cluster health, we either get green, yellow, or red. Green means everything is good (cluster is fully functional), yellow means all data is available but some replicas are not yet allocated (cluster is fully functional), and red means some data is not available for whatever reason. Note that even if a cluster is red, it still is partially functional (i.e. it will continue to serve search requests from the available shards) but you will likely need to fix it ASAP since you have missing data.


Also from the above response, we can see a total of 1 node and that we have 0 shards since we have no data in it yet. Note that since we are using the default cluster name (elasticsearch) and since Elasticsearch uses unicast network discovery by default to find other nodes on the same machine, it is possible that you could accidentally start up more than one node on your computer and have them all join a single cluster. In this scenario, you may see more than 1 node in the above response.

可以看到集群名字为“elasticsearch”, 启动后状态为“green”

无论什么时候查看集群健康状况,状态只能是green,yellow,red三种之一。green意味着所有事情都很好(集群功能良好),yellow意味着所有数据可用但是某些备份还没有分配(集群功能良好)。red意味着某些数据因为某些原因不可用。注意,即使集群是red,仍然部分可用(例如,集群仍然可以提供可用shards的搜索功能),但是你需要修复集群,因为你可能丢失了某些数据。


We can also get a list of nodes in our cluster as follows:

通过以下命令,可以获取集群列表:


GET /_cat/nodes?v


And the response:

返回信息如下:


ip          heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 10                    5                5    4.46                                         mdi           *        PB2SGZY



Here, we can see our one node named "PB2SGZY", which is the single node that is currently in our cluster.

这里可以看到节点名字为“PB2SGZY”, 这就是我们集群中的唯一的节点。










5、List All Indicesedit

Now let’s take a peek at our indices:

GET /_cat/indices?v

And the response:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Which simply means we have no indices yet in the cluster.
以上显示说明现在集群中并没有索引。

查看有索引的显示:











6、Create an Index edit

创建索引


Now let’s create an index named "customer" and then list all the indexes again:

现在创建名为“customer”的索引,然后列举出所有的索引

PUT /customer?pretty
GET /_cat/indices?v

The first command creates the index named "customer" using the PUT verb. We simply append  pretty to the end of the call to tell it to pretty-print the JSON response (if any).

第一行是创建名为“customer”的索引,使用PUT命令。我们使用pretty放在调用末尾用来告诉它返回漂亮打印的JSON返回信息。


And the response:

返回信息如下:

health status index    uuid    pri rep docs.count docs.deleted store.size pri.store.size
yellow open customer 95SQ4TSUT7mWBT7VNHH67A   5   1   0   0       260b           260b

The results of the second command tells us that we now have 1 index named customer and it has 5 primary shards and 1 replica (the defaults) and it contains 0 documents in it.

第二行的结果告诉我们:我们现在有一个名为customer的索引,它有5个主要shards以及一个备份(默认配置),包含0个文档。


You might also notice that the customer index has a yellow health tagged to it. Recall from our previous discussion that yellow means that some replicas are not (yet) allocated. The reason this happens for this index is because Elasticsearch by default created one replica for this index. Since we only have one node running at the moment, that one replica cannot yet be allocated (for high availability) until a later point in time when another node joins the cluster. Once that replica gets allocated onto a second node, the health status for this index will turn to green.

你可能注意到了:customer索引的状态为yellow。从前面的讨论中可以知道yellow意味着某些备份还没有分配。发生这种情况的原因是因为Elasticsearch默认情况下会为这个索引创建一个备份。因为我们目前只有一个节点,因此备份无法分配(为了高可用性),直到有其他节点能够加入到集群中。一旦备份分配到第二个节点上,健康状态就会变为green。











7、Index and Query a Document edit

索引以及请求文档

Let’s now put something into our customer index. Remember previously that in order to index a document, we must tell Elasticsearch which type in the index it should go to.


Let’s index a simple customer document into the customer index, "external" type, with an ID of 1 as follows:

现在往customer 索引中放一些东西。记住前面我们说了,要想索引一个文档,必须告诉Elasticsearch要把文档索引到哪个type中

PUT /customer/external/1?pretty
{
  "name": "John Doe"
}

And the response:

{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}

From the above, we can see that a new customer document was successfully created inside the customer index and the external type. The document also has an internal id of 1 which we specified at index time.


It is important to note that Elasticsearch does not require you to explicitly create an index first before you can index documents into it. In the previous example, Elasticsearch will automatically create the customer index if it didn’t already exist beforehand.

从上面来看,新customer文档已经成功创建,并存储到customer 索引中的external 类型。这个文档在索引的时候已经被分配一个内部的id 1。


Let’s now retrieve that document that we just indexed:

现在我们查看一下索引的文档内容

GET /customer/external/1?pretty

And the response:

返回信息如下:

{
  "_index" : "customer",
  "_type" : "external",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : { "name": "John Doe" }
}

Nothing out of the ordinary here other than a field,  found, stating that we found a document with the requested ID 1 and another field,  _source, which returns the full JSON document that we indexed from the previous step.

其他都很正常,出现了一个found字段,此字段标识了我们找到了请求ID 为1的文档,还有_source的字段,此字段返回了前面插入的JSON文档。











8、Delete an Index edit

删除一个索引


Now let’s delete the index that we just created and then list all the indexes again:

现在让我们删除刚才创建的索引,然后再列举所有索引:

DELETE /customer?pretty
GET /_cat/indices?v

And the response:

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size

Which means that the index was deleted successfully and we are now back to where we started with nothing in our cluster.

这就意味着索引已经成功删除了,现在我们回到了一无所有的空集群。


Before we move on, let’s take a closer look again at some of the API commands that we have learned so far:

在我们移动之前,让我们更仔细的查看一下我们所学的API命令:

PUT /customer
PUT /customer/external/1
{
  "name": "John Doe"
}
GET /customer/external/1
DELETE /customer

If we study the above commands carefully, we can actually see a pattern of how we access data in Elasticsearch. That pattern can be summarized as follows:

如果我们仔细学习上面的命令,我们可以看出Elasticsearch访问数据的规则。规则可以总结如下:

<REST Verb> /<Index>/<Type>/<ID>

This REST access pattern is pervasive throughout all the API commands that if you can simply remember it, you will have a good head start at mastering Elasticsearch.

REST访问规则可以访问所有API民营,如果你记住这个规则,对于掌握Elasticsearch来说是一个很好的开端。












9、Modifying Your Data edit

修正数据


Elasticsearch provides data manipulation and search capabilities in near real time. By default, you can expect a one second delay (refresh interval) from the time you index/update/delete your data until the time that it appears in your search results. This is an important distinction from other platforms like SQL wherein data is immediately available after a transaction is completed.

Elasticsearch提供接近实时的数据操作和搜索能力。默认情况下,你可以认为延迟为1秒,从你索引/更新/删除数据开始,到搜索结果出现为止。这个是和其他平台(例如SQL)很大的区别,即数据在事务操作完成之后立即可用。



Indexing/Replacing Documents edit

索引化/替代化文档


We’ve previously seen how we can index a single document. Let’s recall that command again:

我们之前看到是怎样索引一个单独文档的,让我们重新执行这个操作:

PUT /customer/external/1?pretty
{
  "name": "John Doe"
}

Again, the above will index the specified document into the customer index, external type, with the ID of 1. If we then executed the above command again with a different (or same) document, Elasticsearch will replace (i.e. reindex) a new document on top of the existing one with the ID of 1:

上面的命令将再次将特定的文档索引到customer 索引,external类型中,分配的ID为1.如果我们使用不同的文档来执行上述命令,则Elastisearch将会替代ID为1的文档内容:

PUT /customer/external/1?pretty
{
  "name": "Jane Doe"
}

The above changes the name of the document with the ID of 1 from "John Doe" to "Jane Doe". If, on the other hand, we use a different ID, a new document will be indexed and the existing document(s) already in the index remains untouched.

上面的命令改变在于将ID为1的文档从John Doe改为Jane Doe。如果,另一方面,我们使用一个不同的ID,新文档会索引到相应位置,而已存在的文档不会受影响。

PUT /customer/external/2?pretty
{
  "name": "Jane Doe"
}

The above indexes a new document with an ID of 2.


上面索引了一个新文档,ID为2

When indexing, the ID part is optional. If not specified, Elasticsearch will generate a random ID and then use it to index the document. The actual ID Elasticsearch generates (or whatever we specified explicitly in the previous examples) is returned as part of the index API call.

当索引发生时,ID part是可选的。如果没有指定,Elasticsearch将会产生一个随机ID,然后使用它索引文档。Elasticsearch产生的实际ID(或者我们在前面例子中显式指定的东西)会作为索引API调用返回信息的一部分。


This example shows how to index a document without an explicit ID:

例子显示了如果没有显式ID索引一个文档时会发生什么:

POST /customer/external?pretty
{
  "name": "Jane Doe"
}

Note that in the above case, we are using the  POST verb instead of PUT since we didn’t specify an ID.

注意上面的例子,我们使用POST方式而不是PUT方式,因为我们没有指定ID
实际上,如果不指定ID,则PUT方式会失败:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "No endpoint or operation is available at [external]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "No endpoint or operation is available at [external]"
  },
  "status" : 400
}











10、Updating Documents edit

更新文档


In addition to being able to index and replace documents, we can also update documents. Note though that Elasticsearch does not actually do in-place updates under the hood. Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.

为了能够索引以及替代文档,我们也可以更新文档。注意:Elasticsearch实际上不会在内部进行更换。无论我们何时进行更新,Elasticsearch都会删除旧的文档,然后索引一个新文档到某个槽中。


This example shows how to update our previous document (ID of 1) by changing the name field to "Jane Doe":

这个例子可以显示如何更新的:通过改变name field为“Jane Doe”来更新前面的文档(ID为1)

POST /customer/external/1/_update?pretty
{
  "doc": { "name": "Jane Doe" }
}

This example shows how to update our previous document (ID of 1) by changing the name field to "Jane Doe" and at the same time add an age field to it:

本里显示了是如何更新我们前面的文档(ID为1),通过改变name field到“Jane Doe”,同时增加一个age field到此文档:

POST /customer/external/1/_update?pretty
{
  "doc": { "name": "Jane Doe", "age": 20 }
}

Updates can also be performed by using simple scripts. This example uses a script to increment the age by 5:

更新也可以使用简单的脚本来执行。例子就是使用脚本对age增加5:

POST /customer/external/1/_update?pretty
{
  "script" : "ctx._source.age += 5"
}

In the above example,  ctx._source refers to the current source document that is about to be updated.

在上面的例子中,ctx._source指的是当前源文档,即将要被更新的文档。


Note that as of this writing, updates can only be performed on a single document at a time. In the future, Elasticsearch might provide the ability to update multiple documents given a query condition (like an  SQL UPDATE-WHERE statement).












11、Deleting Documentsedit


Deleting a document is fairly straightforward. This example shows how to delete our previous customer with the ID of 2:

删除一个文档相当的直接。这个例子展示了如何删除之前ID为2的记录:

DELETE /customer/external/2?pretty

See the Delete By Query API to delete all documents matching a specific query. It is worth noting that it is much more efficient to delete a whole index instead of deleting all documents with the Delete By Query API.

查看  Delete By Query API ,可以看到怎样删除匹配某个特定请求的所有文档。更加有效的是删除整个索引,而不是删除匹配的所有记录。










12、Batch Processing edit

批量处理


In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the  _bulk API. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as few network roundtrips as possible.

为了能够索引、更新、删除个别文档,Elasticsearch也提供了批量执行任何上述操作的能力,一般采用_bulk API。这种功能很重要,他提功了非常有效的机制,用来尽可能快的执行多个操作。


As a quick example, the following call indexes two documents (ID 1 - John Doe and ID 2 - Jane Doe) in one bulk operation:

以下为例子:同时索引两份文档在同一个bulk操作中:

POST /customer/external/_bulk?pretty
{"index":{"_id":"1"}}
{"name": "John Doe" }
{"index":{"_id":"2"}}
{"name": "Jane Doe" }

This example updates the first document (ID of 1) and then deletes the second document (ID of 2) in one bulk operation:

这个例子更新了第一个文档(ID为1),并删除了第二个文档(ID为2):


POST /customer/external/_bulk?pretty
{"update":{"_id":"1"}}
{"doc": { "name": "John Doe becomes Jane Doe" } }
{"delete":{"_id":"2"}}

Note above that for the delete action, there is no corresponding source document after it since deletes only require the ID of the document to be deleted.

注意上面的删除行为,删除操作只需要文档的ID,一旦删除之后,就没有相关的原文档了。


The Bulk API does not fail due to failures in one of the actions. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.

Bulk API不回因为其中某个行为的失败而失败。如果某个行为因为某种原因失败,他会继续处理其它的行为。当bulk API返回时,他会返回每个行为的状态(以请求的顺序),因此,你可以检查每个行为的返回状态就可以知道是否成功。











13、Exploring Your Data edit

探索数据

Sample Dataset edit


样本数据集合

Now that we’ve gotten a glimpse of the basics, let’s try to work on a more realistic dataset. I’ve prepared a sample of fictitious JSON documents of customer bank account information. Each document has the following schema:

现在我们已经知道基本操作,下面来操作一些更加实际的数据集合。我已经准备了一个JSON格式的文档样本-客户银行帐户信息。每个文档都包含以下字段:

{
    "account_number": 0,
    "balance": 16623,
    "firstname": "Bradshaw",
    "lastname": "Mckenzie",
    "age": 29,
    "gender": "F",
    "address": "244 Columbus Place",
    "employer": "Euron",
    "email":  "bradshawmckenzie@euron.com",
    "city": "Hobucken",
    "state": "CO"
}

For the curious, I generated this data from  www.json-generator.com/ so please ignore the actual values and semantics of the data as these are all randomly generated.

我是从网站随机生成的数据,请忽略实际含义。


Loading the Sample Dataset edit

加载样本数据集合


You can download the sample dataset (accounts.json) from  here. Extract it to our current directory and let’s load it into our cluster as follows:

你可以从这里下载样本数据集合。将它提取到我们当前目录,然后加载这些数据到我们的集群中:

curl -XPOST 'localhost:9200/bank/account/_bulk?pretty&refresh' --data-binary "@accounts.json"
curl 'localhost:9200/_cat/indices?v'

And the response:

返回信息:

health status index uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   bank  l7sSYV2cQXmu6_4rJWVIww   5   1       1000            0    128.6kb        128.6kb

Which means that we just successfully bulk indexed 1000 documents into the bank index (under the account type).

这就意味着,我们已经成功将1000条数据索引到bank 索引库中(在帐户type下)。











14、The Search API edit

搜索API


Now let’s start with some simple searches. There are two basic ways to run searches: one is by sending search parameters through the  REST request URI and the other by sending them through the  REST request body. The request body method allows you to be more expressive and also to define your searches in a more readable JSON format. We’ll try one example of the request URI method but for the remainder of this tutorial, we will exclusively be using the request body method.

现在让我们看一下一些简单的搜索操作。有两个基本方式执行搜索:一个是通过 REST request URI发送搜索参数,一个是通过 REST request body发送。request body方式表现力更强,而且可以以可读性更强的JSON格式定义搜索。我们将尝试使用request URI方式测试,后面所有测试都会采用这种方式。


The REST API for search is accessible from the  _search endpoint. This example returns all documents in the bank index:

搜索的REST API从_search末尾开始。返回所有索引库bank中文档的例子为:

GET /bank/_search?q=*&sort=account_number:asc&pretty

Let’s first dissect the search call. We are searching ( _search endpoint) in the bank index, and the  q=* parameter instructs Elasticsearch to match all documents in the index. The  sort=account_number:asc parameter indicates to sort the results using the  account_number field of each document in an ascending order. The  pretty parameter, again, just tells Elasticsearch to return pretty-printed JSON results.

先看一下搜索调用。我们正在搜索bank索引库(_search endpoint),q=*参数是指匹配索引库中所有文档。sort=account_number:asc是指按照文档中account_number升序排列返回的结果。然后就是pretty参数,只是用来告诉Elasticsearch返回比较好看的JSON结果。

And the response (partially shown):


{
  "took" : 63,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1000,
    "max_score" : null,
    "hits" : [ {
      "_index" : "bank",
      "_type" : "account",
      "_id" : "0",
      "sort": [0],
      "_score" : null,
      "_source" : {"account_number":0,"balance":16623,"firstname":"Bradshaw",
"lastname":"Mckenzie","age":29,"gender":"F","address":"244 Columbus Place","employer":"Euron","email":"bradshawmckenzie@euron.com","city":"Hobucken",
"state":"CO"}
    }, {
      "_index" : "bank",
      "_type" : "account",
      "_id" : "1",
      "sort": [1],
      "_score" : null,
      "_source" : {"account_number":1,"balance":39225,"firstname":"Amber","lastname":"Duke","age":32,
"gender":"M","address":"880 Holmes Lane","employer":"Pyrami","email":"amberduke@pyrami.com","city":"Brogan","state":"IL"}
    }, ...
    ]
  }
}

As for the response, we see the following parts:

  • took – time in milliseconds for Elasticsearch to execute the search
  • timed_out – tells us if the search timed out or not
  • _shards – tells us how many shards were searched, as well as a count of the successful/failed searched shards
  • hits – search results
  • hits.total – total number of documents matching our search criteria
  • hits.hits – actual array of search results (defaults to first 10 documents)
  • hits.sort - sort key for results (missing if sorting by score)
  • hits._score and max_score - ignore these fields for now


如返回信息所示,可以看到以下字段:
  • took:Elasticsearch执行本次搜索花费的微秒数
  • timed_out:本次搜索是否超时
  • _shards:本次搜索共搜索了多少个shards,还有搜索成功或者失败的shards个数
  • hits:搜索结果
  • hits.total:匹配我们搜索规则的文档总数
  • hits.hits:搜索结果的数组格式(默认是前10个结果)
  • hits.sort:排序的关键字内容,如果使用balance排序,则显示的是balance值(如果使用score排序,则没有这个关键字)
  • hits._score以及max_score:目前忽略这些字段

Here is the same exact search above using the alternative request body method:

这里有和上面同样搜索的方法,即使用request body 方式请求:

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "account_number": "asc" }
  ]
}

The difference here is that instead of passing  q=* in the URI, we POST a JSON-style query request body to the  _search API. We’ll discuss this JSON query in the next section.

不同在于不再在URI中传递q=*, 此处发送一个JSON格式请求体给_search API。下一节中我们会讨论JSON请求方式。


It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.

重要的是理解下述说明:一旦搜索结果返回,Elasticsearch就完成了请求,不再维护任何形式的服务端资源,也不再以游标的形式标记搜索结果。这和其它平台(例如sql)形成鲜明对比,你可以先获取获取请求结果中比较靠前部分的一个子集,然后当你需要剩余结果时,你可以继续访问server获取剩余的结果,这就需要服务端维护一个状态的游标。










15、Introducing the Query Language edit

介绍请求语言


Elasticsearch provides a JSON-style domain-specific language that you can use to execute queries. This is referred to as the  Query DSL. The query language is quite comprehensive and can be intimidating at first glance but the best way to actually learn it is to start with a few basic examples.

Elasticsearch提供JSON格式的请求方式。这被称为查询DSL。这种查询语言还是比较全面的,咋一看还是挺吓人的,但是最好的学习方式是从一些简单基础的例子开始。


Going back to our last example, we executed this query:

回顾一下上一个例子,我们执行了这个查询:

GET /bank/_search
{
  "query": { "match_all": {} }
}

Dissecting the above, the  query part tells us what our query definition is and the  match_all part is simply the type of query that we want to run. The  match_all query is simply a search for all documents in the specified index.

认真看一下上面的语法,query字段告诉我们查询定义是什么,match_all字段告诉我们想执行的简单查询类型。match_all查询是针对特定索引库所有文档进行搜索。


In addition to the  query parameter, we also can pass other parameters to influence the search results. In the example in the section above we passed in  sort, here we pass in  size:

除了query参数以外,我们还可以传递其它参数以影响搜索结果。上个例子中我们传递sort参数,此处我们传递size:

GET /bank/_search
{
  "query": { "match_all": {} },
  "size": 1
}

Note that if  size is not specified, it defaults to 10.

注意,如果不指定size,默认值是10


This example does a  match_all and returns documents 11 through 20:

下面的例子是返回从11到20的文档:

GET /bank/_search
{
  "query": { "match_all": {} },
  "from": 10,
  "size": 10
}

The  from parameter (0-based) specifies which document index to start from and the  sizeparameter specifies how many documents to return starting at the from parameter. This feature is useful when implementing paging of search results. Note that if  from is not specified, it defaults to 0.

from参数(从0开始)指定:从哪一个文档索引开始,size参数指定返回多少个文档。这个特征在我们执行分页查询时非常有用。注意,如果from没有指定,默认从0开始


This example does a  match_all and sorts the results by account balance in descending order and returns the top 10 (default size) documents.

下面的例子是做一个完全匹配( match_all),然后将结果按照account balance字段的降序排列,然后返回前10个文档

GET /bank/_search
{
  "query": { "match_all": {} },
  "sort": { "balance": { "order": "desc" } }
}











16、Executing Searches edit

执行搜索


Now that we have seen a few of the basic search parameters, let’s dig in some more into the Query DSL. Let’s first take a look at the returned document fields. By default, the full JSON document is returned as part of all searches. This is referred to as the source ( _source field in the search hits). If we don’t want the entire source document returned, we have the ability to request only a few fields from within source to be returned.

现在我们已经看到了一些基本的搜索参数,让我们深入了解一些更多的查询DSL的知识。首先看一下返回的文档域。默认情况下,完整的JSON文档会作为所有搜索的一部分返回。文档会在source字段出现(_source字段)。如果我们不像要完整的文档返回,可以只请求某些想要的字段返回即可。


This example shows how to return two fields,  account_number and  balance (inside of  _source), from the search:

例子展现了只返回两个字段,account_number和balance(_source内部),

GET /bank/_search
{
  "query": { "match_all": {} },
  "_source": ["account_number", "balance"]
}

Note that the above example simply reduces the _source field. It will still only return one field named _source but within it, only the fields account_number and balance are included.

If you come from a SQL background, the above is somewhat similar in concept to the SQL SELECT FROM field list.


Now let’s move on to the query part. Previously, we’ve seen how the  match_all query is used to match all documents. Let’s now introduce a new query called the  match query, which can be thought of as a basic fielded search query (i.e. a search done against a specific field or set of fields).

注意:上面的例子简单的减少了_source的字段。它依然回返回一个名为_source的字段,在_source内部,只包含account_number和balance两个字段。

如果你有SQL背景知识,上面比较类似于 SQL SELECT FROM语句

现在让我们看一下查询部分。前面我们已经看到match_all查询用来匹配所有文档。现在我们来介绍一种新的查询: match query,可以认为是一种基本字段搜索查询(例如,每个搜索都是针对特定字段或者字段集合的搜索)。


This example returns the account numbered 20:

下面的例子返回account_number为20的匹配文档:


GET /bank/_search
{
  "query": { "match": { "account_number": 20 } }
}

This example returns all accounts containing the term "mill" in the address:

下面的例子返回所有address中包含“mill”单词的帐户信息:
可以看成sql中的like

GET /bank/_search
{
  "query": { "match": { "address": "mill" } }
}

This example returns all accounts containing the term "mill" or "lane" in the address:

下面的例子返回所有address中包含“mill”或者“lane”的帐户信息

GET /bank/_search
{
  "query": { "match": { "address": "mill lane" } }
}

This example is a variant of  match ( match_phrase) that returns all accounts containing the phrase "mill lane" in the address:

下面例子返回address中包含词组“mill lane”的帐户信息:
可以看成sql语句中 =

GET /bank/_search
{
  "query": { "match_phrase": { "address": "mill lane" } }
}

Let’s now introduce the  bool(ean) query. The  bool query allows us to compose smaller queries into bigger queries using boolean logic.

接下来介绍 bool(ean) query。bool 查询允许使用boolean逻辑来组合小于的查询和大于的查询


This example composes two  match queries and returns all accounts containing "mill" and "lane" in the address:

下面的例子组合了两个match查询,返回所有address中包含“mill”和“lane”的帐户信息。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

In the above example, the  bool must clause specifies all the queries that must be true for a document to be considered a match.

上面的例子中,bool must条目指定了所有查询必须同时满足指定的条件。
注:可以看成sql语句中的and语句


In contrast, this example composes two  match queries and returns all accounts containing "mill" or "lane" in the address:

作为对比,下面的例子组合了两个match查询,然后返回所有address中包含“mill”或者“lane”的帐户信息

GET /bank/_search
{
  "query": {
    "bool": {
      "should": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

In the above example, the  bool should clause specifies a list of queries either of which must be true for a document to be considered a match.

上面例子中,bool should条款指定了每个记录如果符合任何一个match都可以认为是符合匹配的结果
类似于前面 {“query”: { “match”: {“address”: “mill lane”}}


This example composes two  match queries and returns all accounts that contain neither "mill" nor "lane" in the address:

下面的例子中组合了两个match查询,返回的结果必须不能包含指定的两个匹配,即address中不能包含“mill”或者“lane”:

GET /bank/_search
{
  "query": {
    "bool": {
      "must_not": [
        { "match": { "address": "mill" } },
        { "match": { "address": "lane" } }
      ]
    }
  }
}

In the above example, the  bool must_not clause specifies a list of queries none of which must be true for a document to be considered a match.

上面的例子中,bool must_not条款指定查询列表:即所有符合这两个匹配的结果都不是需要的结果。


We can combine  mustshould, and  must_not clauses simultaneously inside a  bool query. Furthermore, we can compose  bool queries inside any of these  bool clauses to mimic any complex multi-level boolean logic.

我们可以在同一个bool查询中同时使用must,should以及must_not条款。而且,我们可以使用这些bool条款模拟任何复杂的多重boolean逻辑查询。


This example returns all accounts of anybody who is 40 years old but doesn’t live in ID(aho):

这个例子返回了符合以下匹配的所有帐户信息:即任何一个age为40的,但是state不是ID的。

GET /bank/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "age": "40" } }
      ],
      "must_not": [
        { "match": { "state": "ID" } }
      ]
    }
  }
}












17、Executing Filters edit

执行过滤


In the previous section, we skipped over a little detail called the document score ( _score field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

前面的章节中,我们跳过了一些细节信息,例如document score(搜索结果中_score字段)。score是一个数值,用来表示文档与查询需求之间的相关性衡量。score越高,文档的相关性越强,score越低,文档相关性越差。


But queries do not always need to produce scores, in particular when they are only used for "filtering" the document set. Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.

但是请求并不是都需要产生scores,特别是当他们只用于过滤文档集合时。Elasticsearch会检测这些情况并自动优化查询执行,以便不计算无用的分数。


The  bool query that we introduced in the previous section also supports  filter clauses which allow to use a query to restrict the documents that will be matched by other clauses, without changing how scores are computed. As an example, let’s introduce the  range query, which allows us to filter documents by a range of values. This is generally used for numeric or date filtering.

前面章节介绍的 bool query也支持filter条款,允许使用查询限制被其他条款匹配的文档,而不用改变计算得分的方式。作为示例,我们来介绍 range query,这个允许我们使用一个值范围来过滤文档。这个在数字或者日期过滤中比较常用。


This example uses a bool query to return all accounts with balances between 20000 and 30000, inclusive. In other words, we want to find accounts with a balance that is greater than or equal to 20000 and less than or equal to 30000.

这个例子使用了bool查询,返回所有balances在20000和30000之间的帐户信息。换句话说,我们想找到这样的帐户,即balance既>=20000,又<=30000

GET /bank/_search
{
  "query": {
    "bool": {
      "must": { "match_all": {} },
      "filter": {
        "range": {
          "balance": {
            "gte": 20000,
            "lte": 30000
          }
        }
      }
    }
  }
}

Dissecting the above, the bool query contains a  match_all query (the query part) and a  rangequery (the filter part). We can substitute any other queries into the query and the filter parts. In the above case, the range query makes perfect sense since documents falling into the range all match "equally", i.e., no document is more relevant than another.

来深入看一下上面的东西,bool query包含了一个match_all查询(query部分),一个range查询(filter部分)。我们可以在query部分和filter部分提交任何查询。在上面的用例中,range查询是完美的,因为文档会落入这些平等的匹配范围,没有任何一个文档会比其它文档显得更相关(即相关性是相同的,因为匹配条件就是一个范围)。


In addition to the  match_allmatchbool, and  range queries, there are a lot of other query types that are available and we won’t go into them here. Since we already have a basic understanding of how they work, it shouldn’t be too difficult to apply this knowledge in learning and experimenting with the other query types.

除了match_all,match,bool以及range查询之外,还有大量的其它查询类型,现在不去讨论他们。因为我们对如何工作已经有了基本的概念,对于后面扩展学习来说不会很难。













18、Executing Aggregations edit

执行聚合


Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.

聚合提供了对数据进行分组和抽象统计的能力。想象聚合最简单的类比是sql group by以及sql聚合功能。在Elasticsearch中,可以执行搜索返回结果,也可以在同一个返回信息中同时返回聚合结果。这个非常有效,你可以同时执行查询和多哥聚合操作,然后同时获得结果,避免了多余的网络开销。


To start with, this example groups all the accounts by state, and then returns the top 10 (default) states sorted by count descending (also default):

开始,下面的例子使用state划分了所有帐户,然后返回前10个states,默认会采用降序排列:f

GET /bank/_search
{"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"}}}} 

In SQL, the above aggregation is similar in concept to:

sql中,上面的聚合类似于下面:

SELECT state, COUNT(*) FROM bank GROUP BY state ORDER BY COUNT(*) DESC

And the response (partially shown):

{"took":29,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1000,"max_score":0.0,"hits":[]},"aggregations":{"group_by_state":{"doc_count_error_upper_bound":20,"sum_other_doc_count":770,"buckets":[{"key":"ID","doc_count":27},{"key":"TX","doc_count":27},{"key":"AL","doc_count":25},{"key":"MD","doc_count":25},{"key":"TN","doc_count":23},{"key":"MA","doc_count":21},{"key":"NC","doc_count":21},{"key":"ND","doc_count":21},{"key":"ME","doc_count":20},{"key":"MO","doc_count":20}]}}} 

We can see that there are 27 accounts in  ID (Idaho), followed by 27 accounts in  TX (Texas), followed by 25 accounts in  AL (Alabama), and so forth.

我们可以看到ID(Idaho)为27个,紧接着是TX为27个,然后是AL为25个,然后是其它的。


Note that we set  size=0 to not show search hits because we only want to see the aggregation results in the response.

注意:设置size=0,意思是不显示搜索到的结果,因为我们只想在返回信息中看到聚合结果


Building on the previous aggregation, this example calculates the average account balance by state (again only for the top 10 states sorted by count in descending order):

基于前面聚合的例子,下面的例子计算了平均帐户余额,基于state信息的(还是前10个states,按照降序排列):

GET /bank/_search
{"size":0,"aggs":{"group_by_state":{"terms":{"field":"state.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}} 

Notice how we nested the  average_balance aggregation inside the  group_by_state aggregation. This is a common pattern for all the aggregations. You can nest aggregations inside aggregations arbitrarily to extract pivoted summarizations that you require from your data.

注意:我们把average_balance聚合内置在了group_by_state聚合中。这是所有聚合操作的一种通用方式。你可以任意嵌套聚合,以提取你想从数据中获得透视需要。


Building on the previous aggregation, let’s now sort on the average balance in descending order:

基于前面的聚合,现在可以按照降序排列这些平均余额。

GET /bank/_search
{"size":0,"aggs":{a"group_by_state":{"terms":{"field":"state.keyword","order":{"average_balance":"desc"}},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}} 

This example demonstrates how we can group by age brackets (ages 20-29, 30-39, and 40-49), then by gender, and then finally get the average account balance, per age bracket, per gender:

下面这个例子显示了我们如何通过age阶段划分,然后再通过性别划分,最后获取每个年龄阶段每个性别的平均帐户余额,

GET /bank/_search
{"size":0,"aggs":{"group_by_age":{"range":{"field":"age","ranges":[{"from":20,"to":30},{"from":30,"to":40},{"from":40,"to":50}]},"aggs":{"group_by_gender":{"terms":{"field":"gender.keyword"},"aggs":{"average_balance":{"avg":{"field":"balance"}}}}}}}} 

There are a many other aggregations capabilities that we won’t go into detail here. The  aggregations reference guide is a great starting point if you want to do further experimentation.

还有很多其它的聚合操作细节没有展示。 聚合操作参考是比较好的学习资料。











19、Conclusion edit

总结


Elasticsearch is both a simple and complex product. We’ve so far learned the basics of what it is, how to look inside of it, and how to work with it using some of the REST APIs. I hope that this tutorial has given you a better understanding of what Elasticsearch is and more importantly, inspired you to further experiment with the rest of its great features!

Elasticsearch既简单又复杂。目前为止,我们学习了基本操作,以及如何使用一些简单的REST APIs。我希望这个指导能给你更好的理解-Elasticsearch是什么以及更加重要的是,促使你学习其它更多更好的特征。










评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值