引言
原文出自:【Datadog】文档
本文是介绍关于监控ElasticSearch
性能的一篇文章。
Part 1
:我们会涉及到ElasticSearch
是如何工作的,并探索你必须监控哪些关键的指标。Part 2
:会解释如何收集ElasticSearch
性能指标。Part 3
:描述如何使用Datadog
监控ElasticSearch
。Part 4
:讨论如何解决五大常见的ElasticSearch
的问题。
Part 1: How to Monitor Elasticsearch Performance
1. What is Elasticsearch?
Elasticsearch is an open source distributed document store and search engine that stores and retrieves data structures in near real-time. Developed by Shay Banon and released in 2010, it relies heavily on Apache Lucene, a full-text search engine written in Java.
Elasticsearch
是一个开源的、分布式的文档存储,并且也是一个存储和获取数据结构的近实时搜索引擎。2010年发布,它严重依赖于Apache Lucene
,是用Java
写的全文搜索引擎。
Elasticsearch represents data in the form of structured JSON documents, and makes full-text search accessible via RESTful API and web clients for languages like PHP, Python, and Ruby. It’s also elastic in the sense that it’s easy to scale horizontally—simply add more nodes to distribute the load. Today, many companies, including Wikipedia, eBay, GitHub, and Datadog, use it to store, search, and analyze large amounts of data on the fly.
Elasticsearch
是以结构化Json
文档存储数据,并且通过RESTFUL API
或是用其它语言写的Web
客户端来作全文检索,就弹性而言它是非常容易水平扩展的,简单的增加节点分配负载。如今,很多公司,诸如Wikipedia
, eBay
, GitHub
, 和Datadog
都在积极地使用它来存储、搜索和分析大量数据。
1.1 The elements of Elasticsearch
Before we start exploring performance metrics, let’s examine what makes Elasticsearch work. In Elasticsearch, a cluster is made up of one or more nodes, as illustrated below:
在我们开始探讨性能指标之前,我们先考察下是什么促使Elasticsearch
工作的。在Elasticsearch
中,集群是由一个或多个节点组成,如下图所示:
Each node is a single running instance of Elasticsearch, and its elasticsearch.yml configuration file designates which cluster it belongs to (cluster.name) and what type of node it can be. Any property (including cluster name) set in the configuration file can also be specified via command line argument. The cluster in the diagram above consists of one dedicated master node and five data nodes.
每个节点是一个独立运行的ES
实例,并且在elasticsearch.yml
配置文件中通过cluster.name
指明了它属于哪个集群以及它属于哪种类型的节点。任何属性可以通过命令行参数进行指定,在上图中的集群是由一个专门的主节点和5个数据节点组成的。
The three most common types of nodes in Elasticsearch are:
在Elasticsearch
中主要的三种常见类型的节点分别是:
Master-eligible nodes: By default, every node is master-eligible unless otherwise specified. Each cluster automatically elects a master node from all of the master-eligible nodes. In the event that the current master node experiences a failure (such as a power outage, hardware failure, or an out-of-memory error), master-eligible nodes elect a new master. The master node is responsible for coordinating cluster tasks like distributing shards across nodes, and creating and deleting indices. Any master-eligible node is also able to function as a data node. However, in larger clusters, users may launch dedicated master-eligible nodes that do not store any data (by adding node.data: false to the config file), in order to improve reliability. In high-usage environments, moving the master role away from data nodes helps ensure that there will always be enough resources allocated to tasks that only master-eligible nodes can handle.
Master-eligible nodes
:默认的,每个节点除非被指定,不然即为候选主节点。每个集群从所有的候选主节点中自动选举一个主节点,如果当前主节点经历了一个失败(例如停电、硬件故障或是内存溢出的错误),剩余的候选主节点会选举一个新的主节点,主节点主要是协调集群任务(例如:在各个节点中分发 分片,创建或是删除索引)。然而,在大型的集群中,用户会启动专门做候选主节点的节点,它不存储任何数据,是为了增强它的可靠性(在配置文件中将node.data
置为false
)。比较常见的情形是,将主节点角色从数据节点中移除有助于保证有足够的资源分配给只有主节点才能处理的任务。
Data nodes: By default, every node is a data node that stores data in the form of shards (more about that in the section below) and performs actions related to indexing, searching, and aggregating data. In larger clusters, you may choose to create dedicated data nodes by adding node.master: false to the config file, ensuring that these nodes have enough resources to handle data-related requests without the additional workload of cluster-related administrative tasks.
Data nodes
:默认的,每个节点都是一个数据节点,并且是以分片的形式存储数据的(更多的细节参见下面一节),执行与索引、搜索和聚合数据相关的操作。在大型集群中,你可以选择创建专门的数据节点,通过在配置文件将node.master:
置为false
,确保这些数据节点有足够的资源去处理数据相关的请求,而不需要有额外的与集群相关的管理任务产生的工作负载。
Client nodes: If you set node.master and node.data to false, you will end up with a client node, which is designed to act as a load balancer that helps route indexing and search requests. Client nodes help shoulder some of the search workload so that data and master-eligible nodes can focus on their core tasks. Depending on your use case, client nodes may not be necessary because data nodes are able to handle request routing on their own. However, adding client nodes to your cluster makes sense if your search/index workload is heavy enough to benefit from having dedicated client nodes to help route requests.
Client nodes
:如果你将node.master
和node.data
都置为false
,你将获得一个客户节点,它被设计成一个负载平衡器,帮助路由 索引和搜索请求。客户节点帮助承担一些搜索工作负载,以便数据和符合主节点要求的节点能够专注于自己的核心任务。取决于您的用例,客户端节点可能不是必需的,因为数据节点能够自己处理请求路由。但是,将客户机节点添加到集群也是有意义的,如果您的搜索工作负载足够大,那将可以从提供专用客户端节点的帮助中获益。
1.2 How Elasticsearch Organizes Data
In Elasticsearch, related data is often stored in the same index, which can be thought of as the equivalent of a logical wrapper of configuration. Each index contains a set of related documents in JSON format. Elasticsearch’s secret sauce for full-text search is Lucene’s inverted index. When a document is indexed, Elasticsearch automatically creates an inverted index for each field; the inverted index maps terms to the documents that contain those terms.
在ES
中,相关联的数据通常存储在同一个索引中,可以被认为等价于被逻辑包裹的配置。每个索引包含了以JSON
格式的相关联的文档集。Elasticsearch
的全文搜索秘密武器是Lucene
的倒排索引。当文档被索引时,Elasticsearch
会自动为每个字段创建一个倒排索引,倒排索引将词条映射到包含这些词条的文档上。
An index is stored across one or more primary shards, and zero or more replica shards, and each shard is a complete instance of Lucene, like a mini search engine.
索引将被存储在一个或多个主分片上,并且0个或多个副本分片上,并且每个shard
都是一个完整的Lucene
索引擎。
When creating an index, you can specify the number of primary shards, as well as the number of replicas per primary. The defaults are five primary shards per index, and one replica per primary. The number of primary shards cannot be changed once an index has been created, so choose carefully, or you will likely need to reindex later on. The number of replicas can be updated later on as needed. To protect against data loss, the master node ensures that each replica shard is not allocated to the same node as its primary shard.
当创建一个索引,你可以指定主分片数。默认的每个索引是5
个主分片,1
个副本。主分片数一旦被创建即不可以被改变,所以选择时小心些,否则你可能之后需要reindex
。副本数量可以在之后按需变化更新。为了防止数据丢失,主节点需要确保每个副本分片不能和它的主分片分配在同一个节点上。
2. Key Elasticsearch Performance Metrics to Monitor
Elasticsearch provides plenty of metrics that can help you detect signs of trouble and take action when you’re faced with problems like unreliable nodes, out-of-memory errors, and long garbage collection times. A few key areas to monitor are:
Elasticsearch
提供了大量的指标帮助你发现问题的迹象,并在遇到不可靠节点、内存溢出错误、GC
回收时间过长等问题时采取行动。需要监视的几个关键领域是:
- Search and Indexing Performance
- Memory and Garbage collection
- Host-level system and Network metrics
- Cluster health and Node availability
- Resource saturation and errors
This article references metric terminology from our Monitoring 101 series, which provides a framework for metric collection and alerting.
本文引用了我们的Monitoring 101
系列中的指标术语,该系列提供了一个指标收集和警报的框架。
All of these metrics are accessible via Elasticsearch’s API as well as single-purpose monitoring tools like Elastic’s Marvel and universal monitoring services like Datadog. For details on how to collect these metrics using all of these methods, see Part 2 of this series.
所有这些指标都可以通过Elasticsearch's API
获取,与Elastics Marvel
等单用途监控工具和Datadog
等通用监控服务访问使用一样。有关如何使用所有这些方法收集这些指标的详细信息,请参阅本系列的第2部分。
2.1 Search and Indexing performance
2.1.1 Search performance metrics
Search requests are one of the two main request types in Elasticsearch, along with index requests. These requests are somewhat akin to read and write requests, respectively, in a traditional database system. Elasticsearch provides metrics that correspond to the two main phases of the search process (query and fetch). The diagrams below illustrate the path of a search request from start to finish.
在Elasticsearch
中,搜索请求和索引请求是两种主要的请求类型之一。这些请求在某种程度上类似于传统数据库系统中的读和写请求。Elasticsearch
提供了与搜索过程中两个主要阶段(query
和fetch
)相对应的指标。下图说明了搜索请求从开始到结束的路径。
Step 1:Client sends a search request to Node 2.
Step 2:Node 2 (the coordinating node) (协调节点)sends the query to a copy (either replica or primary) of every shard in the index.
节点2(协调节点)将查询发送到每一个分片(副本或主)。
Step 3:Each shard executes the query locally and delivers results to Node 2. Node 2 sorts and compiles them into a global priority queue.
每个Shard
在本地执行查询并将结果交付给节点2。节点2将它们排序并编译到一个全局优先队列中。
Step 4:Node 2 finds out which documents need to be fetched and sends a multi GET request to the relevant shards.
节点2找出需要获取哪些文档,并向相关分片发送一个Multi GET
请求。
Step 5:Each shard loads the documents and returns them to Node 2.
每个分片加载文档并将它们返回给节点2。
Step 6:Node 2 delivers the search results to the client.
节点2将搜索结果传递给客户端。
If you are using Elasticsearch mainly for search, or if search is a customer-facing feature that is key to your organization, you should monitor query latency and take action if it surpasses a threshold. It’s important to monitor relevant metrics about queries and fetches that can help you determine how your searches perform over time. For example, you may want to track spikes and long-term increases in query requests, so that you can be prepared to tweak your configuration to optimize for better performance and reliability.
如果您主要使用Elasticsearch
进行搜索,或者搜索是面向客户的特性,那么您应该监视查询延迟,并在查询延迟超过阈值时采取行动。监控Query
和Fetch
的相关指标非常重要,这些指标可以帮助您确定Search
在一段时间内的执行情况。
例如,您可能希望跟踪查询请求的峰值和长期增长,这样您就可以准备调整配置以优化性能和可靠性。
Metric description | Name | Metric type |
---|---|---|
Total number of queries | indices.search.query_total | Work: Throughput |
Total time spent on queries | indices.search.query_time_in_millis | Work: Performance |
Number of queries currently in progress | indices.search.query_current | Work: Throughput |
Total number of fetches | indices.search.fetch_total | Work: Throughput |
Total time spent on fetches | indices.search.fetch_time_in_millis | Work: Performance |
Number of fetches currently in progress | indices.search.fetch_current | Work: Throughput |
Work: Performance
:工作:性能
Work: Throughput
:工作:吞吐量
Search performance metrics to watch
Query load: Monitoring the number of queries currently in progress can give you a rough idea of how many requests your cluster is dealing with at any particular moment in time. Consider alerting on unusual spikes or dips that may point to underlying problems. You may also want to monitor the size of the search thread pool queue, which we will explain in further detail later on in this post.
- Query load: 查询负载。监视当前正在进行的查询数可以让您大致了解集群在任意时段内处理的请求数。请求数突然激增或骤降都预示了一些问题,可以考虑给予告警。如果想监控搜索线程池队列大小,文章后面会有介绍。
Query latency: Though Elasticsearch does not explicitly provide this metric, monitoring tools can help you use the available metrics to calculate the average query latency by sampling the total number of queries and the total elapsed time at regular intervals. Set an alert if latency exceeds a threshold, and if it fires, look for potential resource bottlenecks, or investigate whether you need to optimize your queries.
Query latency
: 查询延迟。尽管Elasticsearch API
没有直接提供此指标,但是可以通过几个指标来计算平均查询延迟,方法是定期抽样查询总数和总耗用时间。如果延迟超过一定阈值时,就要找到资源瓶颈,或确认是否需要优化查询。
Fetch latency: The second part of the search process, the fetch phase, should typically take much less time than the query phase. If you notice this metric consistently increasing, this could indicate a problem with slow disks, enriching of documents (highlighting relevant text in search results, etc.), or requesting too many results.
Fetch latency
:获取延迟。提取阶段是搜索过程的第二阶段,它通常需要比查询阶段花费少得多的时间。如果发现此指标持续增加,可能表示磁盘缓慢,对结果文档处理(在搜索结果中高亮相关文字等),或请求过多结果文档的问题。
2.1.2 Indexing performance metrics
Indexing requests are similar to write requests in a traditional database system. If your Elasticsearch workload is write-heavy, it’s important to monitor and analyze how effectively you are able to update indices with new information. Before we get to the metrics, let’s explore the process by which Elasticsearch updates an index. When new information is added to an index, or existing information is updated or deleted, each shard in the index is updated via two processes: refresh and flush.
索引请求类似于传统数据库系统中的写请求。如果您的Elasticsearch
工作负载是写操作,那么监视和分析文档更新索引的效率是非常重要的。在讨论指标之前,让我们先研究Elasticsearch
更新索引的过程。当新增文档到索引中,或更新或删除现有文档时,索引中的每个分片将通过两个进程进行更新:refresh
和flush
。
Index refresh
Newly indexed documents are not immediately made available for search. First they are written to an in-memory buffer where they await the next index refresh, which occurs once per second by default. The refresh process creates a new in-memory segment from the contents of the in-memory buffer (making the newly indexed documents searchable), then empties the buffer, as shown below.
新建文档不能立即用于搜索。首先,它们被写入内存缓冲区,等待下一次索引Refresh
(默认情况下每秒刷新一次)。Refresh
过程将内存缓冲区的内容写入到一个新的内存段(使新增文档可被搜索),然后清空缓冲区,如下所示。
A SPECIAL SEGMENT ON SEGMENTS
Shards of an index are composed of multiple segments. The core data structure from Lucene, a segment is essentially a change set for the index. These segments are created with every refresh and subsequently merged together over time in the background to ensure efficient use of resources (each segment uses file handles, memory, and CPU).
索引的分片是由多个段组成。核心数据结构来自Lucene
,一个段本质上是一个索引的变更集。这些段在每Refresh
时创建,然后随着时间的推移在后台合并在一起,以确保有效地使用资源(每个段使用文件句柄、内存和CPU
)。
Segments are mini-inverted indices that map terms to the documents that contain those terms. Every time an index is searched, a primary or replica version of each shard must be searched by, in turn, searching every segment in that shard.
段是将词条映射到包含这些词条的文档的迷你倒排索引。每次搜索索引时,必须依次搜索索引对应的每个分片的主或副本,并搜索该分片中的每个段。
A segment is immutable, so updating a document means:
段是不可变的,因此更新文档意味着:
- writing the information to a new segment during the refresh process
- marking the old information as deleted
- 在
Refresh
过程中将信息写入新段 - 将旧信息标记为已删除
The old information is eventually deleted when the outdated segment is merged with another segment.
当过时的段与另一个段合并时,旧的信息最终会被删除。
Index flush
At the same time that newly indexed documents are added to the in-memory buffer, they are also appended to the shard’s translog: a persistent, write-ahead transaction log of operations. Every 30 minutes, or whenever the translog reaches a maximum size (by default, 512MB), a flush is triggered. During a flush, any documents in the in-memory buffer are refreshed (stored on new segments), all in-memory segments are committed to disk, and the translog is cleared.
在将索引新的文档添加到内存缓冲区的同时,还将它们附加到Shard's Translog
:一个持久的、写前的操作事务日志。每隔30分钟,或者当Translog
达到最大大小(默认为512MB)时,就会触发一次Flush
。在Flush
期间,将刷新内存缓冲区中的任何文档存储到新段上,将所有内存段提交到磁盘,并清除Translog
。
The translog helps prevent data loss in the event that a node fails. It is designed to help a shard recover operations that may otherwise have been lost between flushes. The log is committed to disk every 5 seconds, or upon each successful index, delete, update, or bulk request (whichever occurs first).
Translog
在节点失败时帮助防止数据丢失。它的设计目的是帮助Shard
恢复在Flush
之间丢失的操作。日志每5秒提交一次到磁盘,或者也可以设置成在每个成功的索引、删除、更新或批量请求(以先发生的为准)之后提交一次。
The flush process is illustrated below:
下面说明了刷新过程:
Elasticsearch provides a number of metrics that you can use to assess indexing performance and optimize the way you update your indices.
Elasticsearch
提供了许多指标,您可以使用这些指标来评估索引性能并优化更新索引的方式。
Metric description | Name | Metric type |
---|---|---|
Total number of documents indexed | indices.indexing.index_total | Work: Throughput |
Total time spent indexing documents | indices.indexing.index_time_in_millis | Work: Performance |
Number of documents currently being indexed | indices.indexing.index_current | Work: Throughput |
Total number of index refreshes | indices.refresh.total | Work: Throughput |
Total time spent refreshing indices | indices.refresh.total_time_in_millis | Work: Performance |
Total number of index flushes to disk | indices.flush.total | Work: Throughput |
Total time spent on flushing indices to disk | indices.flush.total_time_in_millis | Work: Performance |
Indexing performance metrics to watch
Indexing latency: Elasticsearch does not directly expose this particular metric, but monitoring tools can help you calculate the average indexing latency from the available index_total and index_time_in_millis metrics. If you notice the latency increasing, you may be trying to index too many documents at one time (Elasticsearch’s documentation recommends starting with a bulk indexing size of 5 to 15 megabytes and increasing slowly from there).
Indexing latency
: 索引延迟。Elasticsearch API
没有直接提供此指标,但是可以通过index_total
和index_time_in_millis
指标来计算平均索引延时。如果延时在增加,可能是由于一次索引的数据量太大导致的(Elasticsearch
的文档建议在做bulk index
时,单批次索引从5M-15M,慢慢增加,直到找到合理的值)。
If you are planning to index a lot of documents and you don’t need the new information to be immediately available for search, you can optimize for indexing performance over search performance by decreasing refresh frequency until you are done indexing. The index settings API enables you to temporarily disable the refresh interval:
如果您计划索引大量文档,并且不需要立即将新信息用于搜索,那么可以通过减少刷新频率来优化索引性能,直到完成索引为止。索引设置API
允许您临时禁用刷新间隔。
curl -XPUT <nameofhost>:9200/<name_of_index>/_settings -d '{
"index" : {
"refresh_interval" : "-1"
}
}'
You can then revert back to the default value of “1s” once you are done indexing. This and other indexing performance tips will be explained in more detail in part 4 of this series.
完成索引之后,可以恢复到默认值1s。本系列的第4部分将更详细地解释这一点和其他索引性能技巧。
Flush latency: Because data is not persisted to disk until a flush is successfully completed, it can be useful to track flush latency and take action if performance begins to take a dive. If you see this metric increasing steadily, it could indicate a problem with slow disks; this problem may escalate and eventually prevent you from being able to add new information to your index. You can experiment with lowering the index.translog.flush_threshold_size in the index’s flush settings. This setting determines how large the translog size can get before a flush is triggered. However, if you are a write-heavy Elasticsearch user, you should use a tool like iostat or the Datadog Agent to keep an eye on disk IO metrics over time, and consider upgrading your disks if needed.
Flush latency
: 刷新延时。在Flush
成功完成之前,数据并不会持久化到磁盘,所以监控该指标也是非常有必要的,如果性能下降的厉害,就要采取相应措施了。如果您看到此指标稳步增长,则可能表明磁盘出现slow
问题; 此问题可能会升级,并最终不能写入数据。
您可以尝试在索引的Flush
设置中降低index.translog.flush_threshold_size
。 此设置是触发Flush
的阈值,即当Translog
超过多大时开始Flush
。 但是,如果您是一个写得很重的Elasticsearch
用户,您应该使用iostat
等工具随时关注磁盘IO
指标。如果有必要,请考虑升级磁盘。
2.2 Memory usage and garbage collection
When running Elasticsearch, memory is one of the key resources you’ll want to closely monitor. Elasticsearch and Lucene utilize all of the available RAM on your nodes in two ways: JVM heap and the file system cache. Elasticsearch runs in the Java Virtual Machine (JVM), which means that JVM garbage collection duration and frequency will be other important areas to monitor.
在运行Elasticsearch
时,内存是您需要密切监视的关键资源之一。
Elasticsearch
和Lucene
以两种方式利用节点上的所有可用RAM
:JVM
堆和文件系统缓存。Elasticsearch
在Java
虚拟机(JVM
)中运行,这意味着JVM
垃圾收集的持续时间和频率将是需要监视的其他重要区域。
2.2.1 JVM heap: A Goldilocks tale
Elasticsearch stresses the importance of a JVM heap size that’s “just right”—you don’t want to set it too big, or too small, for reasons described below. In general, Elasticsearch’s rule of thumb is allocating less than 50 percent of available RAM to JVM heap, and never going higher than 32 GB.
Elasticsearch
强调JVM
堆大小的重要性,您不希望将其设置得太大或太小,原因如下。一般情况下,Elasticsearch
的经验法则是将少于50%
的可用RAM
分配给JVM
堆,并且永远不会超过32 GB
。
The less heap memory you allocate to Elasticsearch, the more RAM remains available for Lucene, which relies heavily on the file system cache to serve requests quickly. However, you also don’t want to set the heap size too small because you may encounter out-of-memory errors or reduced throughput as the application faces constant short pauses from frequent garbage collections. Consult this guide, written by one of Elasticsearch’s core engineers, to find tips for determining the correct heap size.
分配给Elasticsearch
的堆内存越少,Lucene
可用的RAM
就越多,这在很大程度上依赖于文件系统缓存来快速处理请求。但是,您也不希望将堆大小设置得太小,因为您可能会遇到内存不足的错误或吞吐量降低,因为应用程序在频繁的垃圾收集中会面临持续的短暂暂停。请参考这本由Elasticsearch
核心工程师撰写的指南,找到确定正确堆大小的技巧。
Elasticsearch’s default installation sets a JVM heap size of 1 gigabyte, which is too small for most use cases. You can export your desired heap size as an environment variable and restart Elasticsearch:
Elasticsearch
的默认安装将JVM
堆大小设置为1g
,这对于大多数用例来说都太小了。您可以将所需的堆大小导出为环境变量并重新启动Elasticsearch
。
$ export ES_HEAP_SIZE=10g
The other option is to set the JVM heap size (with equal minimum and maximum sizes to prevent the heap from resizing) on the command line every time you start up Elasticsearch:
另一个选项是在每次启动Elasticsearch
时在命令行上设置JVM
堆大小(最小和最大大小相同,以防止堆大小调整)
$ ES_HEAP_SIZE="10g" ./bin/elasticsearch
In both of the examples shown, we set the heap size to 10 gigabytes. To verify that your update was successful, run:
在这两个例子中,我们都将堆大小设置为10g。要验证更新是否成功,请运行
$ curl -XGET http://<nameofhost>:9200/_cat/nodes?h=heap.max
The output should show you the correctly updated max heap value.
输出应该显示正确更新的最大堆值。
2.2.2 Garbage collection
Elasticsearch relies on garbage collection processes to free up heap memory. If you want to learn more about JVM garbage collection, check out this guide.
Elasticsearch
依赖于垃圾收集进程来释放堆内存。如果您想了解关于JVM
垃圾收集的更多信息,请查看本指南。
Because garbage collection uses resources (in order to free up resources!), you should keep an eye on its frequency and duration to see if you need to adjust the heap size. Setting the heap too large can result in long garbage collection times; these excessive pauses are dangerous because they can lead your cluster to mistakenly register your node as having dropped off the grid.
因为垃圾收集使用资源(为了释放资源!),您应该关注它的频率和持续时间,以确定是否需要调整堆大小。将堆设置得太大可能会导致长时间的垃圾收集,这些过多的暂停是危险的,因为它们可能导致您的集群错误地将节点注册为已脱离网格。
Metric description | Name | Metric type |
---|---|---|
Total count of young-generation garbage collections | jvm.gc.collectors.young.collection_count (jvm.gc.collectors.ParNew.collection_count prior to vers. 0.90.10) | Other |
Total time spent on young-generation garbage collections | jvm.gc.collectors.young.collection_time_in_millis (jvm.gc.collectors.ParNew.collection_time_in_millis prior to vers. 0.90.10) | Other |
Total count of old-generation garbage collections | jvm.gc.collectors.old.collection_count (jvm.gc.collectors.ConcurrentMarkSweep.collection_count prior to vers. 0.90.10) | Other |
Total time spent on old-generation garbage collections | jvm.gc.collectors.old.collection_time_in_millis (jvm.gc.collectors.ConcurrentMarkSweep.collection_time_in_millis prior to vers. 0.90.10) | Other |
Percent of JVM heap currently in use | jvm.mem.heap_used_percent | Resource: Utilization |
Amount of JVM heap committed | jvm.mem.heap_committed_in_bytes | Resource: Utilization |
JVM metrics to watch
JVM heap in use: Elasticsearch is set up to initiate garbage collections whenever JVM heap usage hits 75 percent. As shown above, it may be useful to monitor which nodes exhibit high heap usage, and set up an alert to find out if any node is consistently using over 85 percent of heap memory; this indicates that the rate of garbage collection isn’t keeping up with the rate of garbage creation. To address this problem, you can either increase your heap size (as long as it remains below the recommended guidelines stated above), or scale out the cluster by adding more nodes.
JVM heap in use
: 已用的JVM
堆大小。Elasticsearch
默认配置在JVM
堆使用率达到75%时进行垃圾回收GC
。如果使用率一直非常高比如85%,说明GC
长时间来不及回收内存,这很危险。这时可能需要增加内存或者增加节点。
JVM heap used vs. JVM heap committed: It can be helpful to get an idea of how much JVM heap is currently in use, compared to committed memory (the amount that is guaranteed to be available). The amount of heap memory in use will typically take on a sawtooth pattern that rises when garbage accumulates and dips when garbage is collected. If the pattern starts to skew upward over time, this means that the rate of garbage collection is not keeping up with the rate of object creation, which could lead to slow garbage collection times and, eventually, OutOfMemoryErrors.
JVM heap used vs. JVM heap committed
: 了解JVM
堆当前使用了多少内存与所提交的内存大小(保证可用的数量)相比多少是有帮助的。堆内存被使用的空间大小通常呈锯齿状,在垃圾累积时上升,在垃圾收集时下降。如果模式开始随时间向上倾斜,这意味着垃圾收集的速度跟不上对象创建的速度,这可能会导致垃圾收集时间变慢,并最终导致Out Of Memory
错误。
Garbage collection duration and frequency: Both young- and old-generation garbage collectors undergo “stop the world” phases, as the JVM halts execution of the program to collect dead objects. During this time, the node cannot complete any tasks. Because the master node checks the status of every other node every 30 seconds, if any node’s garbage collection time exceed 30 seconds, it will lead the master to believe that the node has failed.
Garbage collection duration and frequency
:GC
耗时和频率。年轻代和老年代 垃圾收集器都经历stop the world
阶段,因为JVM
停止执行收集死对象的程序。在此期间,节点无法完成任何任务。因为主节点每30秒检查一次其他节点的状态,如果任何节点的垃圾收集时间超过30秒,就会导致主节点认为该节点已经失败。
2.2.3 Memory usage
As mentioned above, Elasticsearch makes excellent use of any RAM that has not been allocated to JVM heap. Like Kafka, Elasticsearch was designed to rely on the operating system’s file system cache to serve requests quickly and reliably.
如上所述,Elasticsearch
很好地利用了没有分配给JVM
堆的任何RAM
。与Kafka
一样,Elasticsearch
的设计也依赖于操作系统的文件系统缓存来快速、可靠地处理请求。
A number of variables determine whether or not Elasticsearch successfully reads from the file system cache. If the segment file was recently written to disk by Elasticsearch, it is already in the cache. However, if a node has been shut off and rebooted, the first time a segment is queried, the information will most likely have to be read from disk. This is one reason why it’s important to make sure your cluster remains stable and that nodes do not crash.
很多变量决定了Elasticsearch
是否成功地从文件系统缓存中读取。如果段文件最近被Elasticsearch
写入到磁盘,那么它已经在缓存中了。但是,如果节点已经关闭并重新启动,那么在第一次查询一个段时,很可能必须从磁盘读取信息。这就是确保集群保持稳定和节点不崩溃非常重要的原因之一。
Generally, it’s very important to monitor memory usage on your nodes, and give Elasticsearch as much RAM as possible, so it can leverage the speed of the file system cache without running out of space.
通常,监视节点上的内存使用非常重要,并尽可能多地提供Elasticsearch RAM
,这样它就可以利用文件系统缓存的速度而不会耗尽空间。
2.3 Host-level network and system metrics(主机级网络和系统指标)
Name | Metric type |
---|---|
Available disk space | Resource: Utilization |
I/O utilization | Resource: Utilization |
CPU usage | Resource: Utilization |
Network bytes sent/received | Resource: Utilization |
Open file descriptors | Resource: Utilization |
While Elasticsearch provides many application-specific metrics via API, you should also collect and monitor several host-level metrics from each of your nodes.
Elasticsearch
通过API
提供了许多特定于应用程序的指标,您还应该从每个节点收集和监视多个主机级指标。
2.3.1 Host metrics to alert on(要提醒的主机指标)
Disk space: This metric is particularly important if your Elasticsearch cluster is write-heavy. You don’t want to run out of disk space because you won’t be able to insert or update anything and the node will fail. If less than 20 percent is available on a node, you may want to use a tool like Curator to delete certain indices residing on that node that are taking up too much valuable disk space.
If deleting indices is not an option, the other alternative is to add more nodes, and let the master take care of automatically redistributing shards across the new nodes (though you should note that this creates additional work for a busy master node). Also, keep in mind that documents with analyzed fields (fields that require textual analysis—tokenizing, removing punctuation, and the like) take up significantly more disk space than documents with non-analyzed fields (exact values).
Disk space
:如果您的Elasticsearch
集群是写操作较多的,那么这个指标尤为重要。您不希望耗尽磁盘空间,因为您无法插入或更新任何内容,并且节点将会失败。如果节点上可用的索引少于20%,您可能想要使用像curator这样的工具来删除节点上占用太多磁盘空间的某些索引。
如果不能删除索引,另一种选择是添加更多的节点,让主节点负责在新节点之间自动重新分配碎片(尽管您应该注意,这将为繁忙的主节点创建额外的工作)。另外,请记住,具有分析字段(需要文本分析标记、删除标点符号等字段)的文档要比具有非分析字段(精确值)的文档占用更多的磁盘空间。
2.3.2 Host metrics to watch
I/O utilization: As segments are created, queried, and merged, Elasticsearch does a lot of writing to and reading from disk. For write-heavy clusters with nodes that are continually experiencing heavy I/O activity, Elasticsearch recommends using SSDs to boost performance.
I/O utilization
: 随着段的创建、查询和合并,Elasticsearch
会执行大量的磁盘读写操作。对于写操作较多的集群,节点不断地经历大量I/O
活动,Elasticsearch
建议使用ssd
来提高性能。
CPU utilization on your nodes: It can be helpful to visualize CPU usage in a heat map (like the one shown above) for each of your node types. For example, you could create three different graphs to represent each group of nodes in your cluster (data nodes, master-eligible nodes, and client nodes, for example) to see if one type of node is being overloaded with activity in comparison to another. If you see an increase in CPU usage, this is usually caused by a heavy search or indexing workload. Set up a notification to find out if your nodes’ CPU usage is consistently increasing, and add more nodes to redistribute the load if needed.
CPU utilization on your nodes
:在热图(如上图所示)中显示每个节点类型的CPU
使用情况是很有帮助的。例如,您可以创建三个不同的图来表示集群中的每组节点(例如,数据节点、符合候选主节点条件的节点和客户端节点),以查看是否有一种类型的节点因活动而超载。如果您看到CPU
使用量增加,这通常是由于繁重的搜索或索引工作负载造成的。设置一个通知,以确定您的节点CPU
使用率是否一直在增加,如果是,则添加更多节点以重新分配负载
Network bytes sent/received: Communication between nodes is a key component of a balanced cluster. You’ll want to monitor the network to make sure it’s healthy and that it keeps up with the demands on your cluster (e.g. as shards are replicated or rebalanced across nodes). Elasticsearch provides transport metrics about cluster communication, but you can also look at the rate of bytes sent and received to see how much traffic your network is receiving.
Network bytes sent/received
: 节点之间的通信是平衡集群的关键组件。您需要监控网络,以确保它是健康的,并且能够满足集群上的需求(例如,在节点之间复制或重新平衡分片)。Elasticsearch
提供了关于集群通信的传输指标,但是您也可以查看发送和接收的字节率,以查看您的网络接收了多少流量。
Open file descriptors: File descriptors are used for node-to-node communications, client connections, and file operations. If this number reaches your system’s max capacity, then new connections and file operations will not be possible until old ones have closed. If over 80 percent of available file descriptors are in use, you may need to increase the system’s max file descriptor count. Most Linux systems ship with only 1024 file descriptors allowed per process. When using Elasticsearch in production, you should reset your OS file descriptor count to something much larger, like 64000.
-
Open file descriptors
: 文件描述符用于节点到节点的通信、客户端连接和文件操作。如果这个数字达到系统的最大容量,那么在旧的连接关闭之前,将不可能进行新的连接和文件操作。如果超过80%的可用文件描述符正在使用,您可能需要增加系统的最大文件描述符数量。大多数Linux
系统在每个进程中只允许使用1024
个文件描述符。在生产中使用Elasticsearch
时,您应该将OS
文件描述符计数重置为更大的值,比如64000。 -
HTTP connections
Metric description | Name | Metric type |
---|---|---|
Number of HTTP connections currently open | http.current_open | Resource: Utilization |
Total number of HTTP connections opened over time | http.total_opened | Resource: Utilization |
Requests sent in any language but Java will communicate with Elasticsearch using RESTful API over HTTP. If the total number of opened HTTP connections is constantly increasing, it could indicate that your HTTP clients are not properly establishing persistent connections. Reestablishing connections adds extra milliseconds or even seconds to your request response time. Make sure your clients are configured properly to avoid negative impact on performance, or use one of the official Elasticsearch clients, which already properly configure HTTP connections.
以除Java
以外的任何语言发送的请求都可以通过HTTP
上的RESTful API
与Elasticsearch
通信。如果打开的HTTP
连接的总数不断增加,这可能表明您的HTTP
客户端没有正确地建立持久连接。重新建立连接会使请求响应时间增加额外的毫秒甚至秒。确保您的客户端配置正确,以避免对性能的负面影响,或者使用官方的Elasticsearch
客户端,它已经正确配置了HTTP
连接。
2.4 Cluster health and node availability
Metric description | Name | Metric type |
---|---|---|
Cluster status (green, yellow, red) | cluster.health.status | Other |
Number of nodes | cluster.health.number_of_nodes | Resource: Availability |
Number of initializing shards | cluster.health.initializing_shards | Resource: Availability |
Number of unassigned shards | cluster.health.unassigned_shards | Resource: Availability |
Cluster status: If the cluster status is yellow, at least one replica shard is unallocated or missing. Search results will still be complete, but if more shards disappear, you may lose data.
A red cluster status indicates that at least one primary shard is missing, and you are missing data, which means that searches will return partial results. You will also be blocked from indexing into that shard. Consider setting up an alert to trigger if status has been yellow for more than 5 min or if the status has been red for the past minute.
Cluster status
: 如果集群状态为黄色,则至少有一个副本碎片未分配或丢失。搜索结果仍然是完整的,但是如果更多的碎片消失,您可能会丢失数据。
集群红色状态表明至少有一个主分片丢失,并且您丢失了数据,这意味着搜索将返回部分结果。您还将被阻止对该分片进行索引。如果状态为黄色超过5分钟,或者状态为红色超过一分钟,可以考虑设置一个警报来触发。
Initializing and unassigned shards: When you first create an index, or when a node is rebooted, its shards will briefly be in an “initializing” state before transitioning to a status of “started” or “unassigned”, as the master node attempts to assign shards to nodes in the cluster. If you see shards remain in an initializing or unassigned state too long, it could be a warning sign that your cluster is unstable.
初始化和未分配分片:当您第一次创建索引时,或者当一个节点重新启动时,当主节点试图将分片分配给集群中的节点时,其分片将在过渡到已启动或未分配状态之前短暂地处于初始化状态。如果您看到碎片处于初始化或未分配状态的时间太长,这可能是一个警告信号,表明您的集群不稳定。
2.5 Resource saturation and errors
Elasticsearch nodes use thread pools to manage how threads consume memory and CPU. Since thread pool settings are automatically configured based on the number of processors, it usually doesn’t make sense to tweak them. However, it’s a good idea to keep an eye on queues and rejections to find out if your nodes aren’t able to keep up; if so, you may want to add more nodes to handle all of the concurrent requests. Fielddata and filter cache usage is another area to monitor, as evictions may point to inefficient queries or signs of memory pressure.
Elasticsearch
节点使用线程池来管理线程如何消耗内存和CPU
。由于线程池设置是根据处理器的数量自动配置的,因此调整它们通常没有意义。但是,最好密切关注队列和拒绝,以确定您的节点是否无法跟上;如果是这样,您可能希望添加更多节点来处理所有并发请求。
Fielddata
和过滤器缓存使用情况是另一个需要监控的领域,因为驱逐可能会导致低效的查询或内存压力的迹象。
2.5.1 Thread pool queues and rejections
Each node maintains many types of thread pools; the exact ones you’ll want to monitor will depend on your particular usage of Elasticsearch. In general, the most important ones to monitor are search, merge, and bulk (also known as the write thread pool, depending on your version), which correspond to the request type (search, and merge and bulk/write operations). As of version 6.3.x+, the bulk thread pool is now known as the write thread pool. The write thread pool handles each write request, whether it writes/updates/deletes a single document or many documents (in a bulk operation). Starting in version 7.x, the index thread pool will be deprecated, but you may also want to monitor this thread pool if you’re using an earlier version of Elasticsearch (prior to 6.x).
每个节点维护多种类型的线程池;您想要监视的确切对象将取决于您对Elasticsearch
的特定使用。通常,最重要的监视对象是搜索、合并和批量(也称为写线程池,取决于您的版本),它们对应于请求类型(search
, merge
and bulk write
operations)。从版本6.3
开始,大容量线程池现在称为写线程池。不管它是写/更新/删除一个或多个文档(在批操作中),写线程池处理每个写请求。从版本7
开始,索引线程池将被弃用,如果您使用的是Elasticsearch
的早期版本(在6.x之前),但您也同样希望监视此线程池。
The size of each thread pool’s queue represents how many requests are waiting to be served while the node is currently at capacity. The queue allows the node to track and eventually serve these requests instead of discarding them. Thread pool rejections arise once the thread pool’s maximum queue size (which varies based on the type of thread pool) is reached.
每个线程池队列的大小表示在节点当前处于满负荷状态时等待被服务的请求数量。队列允许节点跟踪并最终服务这些请求,而不是丢弃它们。一旦达到线程池的最大队列大小(根据线程池的类型而变化),就会发生线程池拒绝。
Metric description | Name | Metric type |
---|---|---|
Number of queued threads in a thread pool | thread_pool.search.queue thread_pool.merge.queue thread_pool.write.queue (or thread_pool.bulk.queue*) thread_pool.index.queue* | Resource: Saturation |
Number of rejected threads a thread pool | thread_pool.search.rejected thread_pool.merge.rejected thread_pool.write.rejected (or thread_pool.bulk.rejected*) thread_pool.index.rejected* | Resource: Error |
METRICS TO WATCH
Thread pool queues: Large queues are not ideal because they use up resources and also increase the risk of losing requests if a node goes down. If you see the number of queued and rejected threads increasing steadily, you may want to try slowing down the rate of requests (if possible), increasing the number of processors on your nodes, or increasing the number of nodes in the cluster. As shown in the screenshot below, query load spikes correlate with spikes in search thread pool queue size, as the node attempts to keep up with rate of query requests.
Thread pool queues: 线程池队列。只是简单的把队列设大并不是一个好方案,因为这会耗尽系统资源,影响其他性能。而且队列过大反而会增加数据丢失的风险。如果发现等待队列及拒绝队列在逐步增加,如果可能的话减少请求频次,或者增加节点CPU
,或直接增加节点。
Bulk rejections and bulk queues: Bulk operations are a more efficient way to send many requests at one time. Generally, if you want to perform many actions (create an index, or add, update, or delete documents), you should try to send the requests as a bulk operation instead of many individual requests.
Bulk rejections and bulk queues: 批量操作是一次发送多个请求的更有效的方法。通常,如果您想执行许多操作(创建索引、添加、更新或删除文档),您应该尝试将请求作为一个批量操作发送,而不是许多单独的请求。
Bulk rejections are usually related to trying to index too many documents in one bulk request. According to Elasticsearch’s documentation, bulk rejections are not necessarily something to worry about. However, you should try implementing a linear or exponential backoff strategy to efficiently deal with bulk rejections.
批量拒绝通常与尝试在一个批量请求中索引太多文档有关。根据Elasticsearch
的文档,批量拒绝并不需要担心。但是,您应该尝试实现线性或指数回退策略,以有效地处理批量拒绝。
2.5.2 Cache usage metrics
Each query request is sent to every shard in an index, which then hits every segment of each of those shards. Elasticsearch caches queries on a per-segment basis to speed up response time. On the flip side, if your caches hog too much of the heap, they may slow things down instead of speeding them up!
每个查询请求被发送到一个索引中的每个分片,然后每个分片的每个段都将被查询。Elasticsearch
在每个段的基础上缓存查询,以加快响应时间。另一方面,如果您的缓存占用了太多的堆,它们可能会减慢而不是加快速度
In Elasticsearch, each field in a document can be stored in one of two forms: as an exact value or as full text. An exact value, such as a timestamp or a year, is stored exactly the way it was indexed because you do not expect to receive to query 1/1/16 as “January 1st, 2016.” If a field is stored as full text, that means it is analyzed—basically, it is broken down into tokens, and, depending on the type of analyzer, punctuation and stop words like “is” or “the” may be removed. The analyzer converts the field into a normalized format that enables it to match a wider range of queries.
在Elasticsearch
中,文档中的每个字段可以以两种形式存储:精确值或全文。一个确切的值,例如时间戳或年份,以索引的方式存储,因为您不希望接收查询1/1/16作为2016年1月1日。如果一个字段被存储为全文,这意味着它基本上被分析,它被分解为标记,并且,根据分析器的类型,标点符号和停止词,如is
或the
可能被删除。分析器将字段转换为规范化格式,使其能够匹配更大范围的查询。
For example, let’s say that you have an index that contains a type called location; each document of the type location contains a field, city, which is stored as an analyzed string. You index two documents: one with “St. Louis” in the city field, and the other with “St. Paul”. Each string would be lowercased and transformed into tokens without punctuation. The terms are stored in an inverted index that looks something like this:
例如,假设您有一个索引,其中包含一个名为location
的type
;type location
的每个文档都包含一个字段city
,该字段存储为analyzed
字符串。将两个文档添加至索引:一个文档在city
字段中包含 St. Louis
,另一个文档存储St. Paul
。每个字符串将被小写,并转换为没有标点符号的标记。这些Terms
存储在一个反向索引中,看起来像这样:
Term | Doc1 | Doc2 |
---|---|---|
st | x | x |
louis | x | |
paul | x |
The benefit of analysis is that you can search for “st.” and the results would show that both documents contain the term. If you had stored the city field as an exact value, you would have had to search for the exact term, “St. Louis”, or “St. Paul”, in order to see the resulting documents.
analysis
的好处是您可以搜索st.
,结果将显示两个文档都包含这个术语。如果您将city
字段存储为一个精确的值,那么您必须搜索准确的术语St. Louis
或St. Paul
,以查看结果文档。
Elasticsearch uses two main types of caches to serve search requests more quickly: fielddata and filter.
Elasticsearch
使用两种主要类型的缓存来更快地服务于搜索请求:fielddata
和filter
。
FIELDDATA CACHE
The fielddata cache is used when sorting or aggregating on a field, a process that basically has to uninvert the inverted index to create an array of every field value per field, in document order. For example, if we wanted to find a list of unique terms in any document that contained the term “st” from the example above, we would:
当对一个字段进行排序或聚合时,使用fielddata
缓存,这个过程基本上必须反转倒排索引,以按照文档顺序为每个字段创建一个包含每个字段值的数组。例如,如果我们想要在包含上面例子中的术语st
的任何文档中找到唯一的术语列表,我们会这样做:
1.Scan the inverted index to see which documents contain that term (in this case, Doc1 and Doc2)
扫描反向索引,查看哪些文档包含该术语(本例中为Doc1和Doc2)
2.For each of the documents found in step 1, go through every term in the index to collect tokens from that document, creating a structure like the below:
对于第1步中找到的每个文档,遍历索引中的每个词条以从该文档中收集标记,创建如下所示的结构
Doc | Terms |
---|---|
Doc1 | st, louis |
Doc2 | st, paul |
3.Now that the inverted index has been “uninverted,” compile the unique tokens from each of the docs (st, louis, and paul). Compiling fielddata like this can consume a lot of heap memory, especially with large numbers of documents and terms. All of the field values are loaded into memory.
现在反向索引已经被取消了,从每个文档(st
、louis
和paul
)编译惟一的令牌。像这样编译fielddata
可能会消耗大量堆内存,尤其是在有大量文档和词条的情况下。所有字段值都加载到内存中。
For versions prior to 1.3, the fielddata cache size was unbounded. Starting in version 1.3, Elasticsearch added a fielddata circuit breaker that is triggered if a query tries to load fielddata that would require over 60 percent of the heap.
对于1.3
之前的版本,fielddata
缓存大小是无界的。从版本1.3
开始,Elasticsearch
添加了一个fielddata
断路器,如果查询试图加载需要超过60%
堆空间的fielddata
,就会触发该断路器。
FILTER CACHE
Filter caches also use JVM heap. In versions prior to 2.0, Elasticsearch automatically cached filtered queries with a max value of 10 percent of the heap, and evicted the least recently used data. Starting in version 2.0, Elasticsearch automatically began optimizing its filter cache, based on frequency and segment size (caching only occurs on segments that have fewer than 10,000 documents or less than 3 percent of total documents in the index). As such, filter cache metrics are only available to Elasticsearch users who are using a version prior to 2.0.
过滤器缓存也使用JVM
堆。在2.0之前的版本中,Elasticsearch
会自动缓存最多占堆10%的过滤查询,并清除最近最少使用的数据。从2.0版本开始,Elasticsearch
开始根据频率和段大小自动优化其筛选器缓存(缓存仅发生在索引中包含的文档少于10,000个或不到总文档的3%的段上)。因此,筛选缓存指标仅对使用2.0之前版本的Elasticsearch
用户可用。
For example, a filter query could return only the documents for which values in the year field fall in the range 2000–2005. During the first execution of a filter query, Elasticsearch will create a bitset of which documents match the filter (1 if the document matches, 0 if not). Subsequent executions of queries with the same filter will reuse this information. Whenever new documents are added or updated, the bitset is updated as well. If you are using a version of Elasticsearch prior to 2.0, you should keep an eye on the filter cache as well as eviction metrics (more about that below).
例如,筛选器查询只能返回year
字段中的值在2000-2005范围内的文档。在筛选器查询的第一次执行期间,Elasticsearch
将创建一个包含匹配筛选器的文档的位集(如果文档匹配,则为1,否则为0)。使用相同的筛选器执行查询的后续操作将重用此信息。每当添加或更新新文档时,位集也会更新。如果您使用的是2.0之前的Elasticsearch
版本,那么应该密切关注筛选器缓存和回收指标(更多内容见下文)。
Metric description | Name | Metric type |
---|---|---|
Size of the fielddata cache (bytes) | indices.fielddata.memory_size_in_bytes | Resource: Utilization |
Number of evictions from the fielddata cache | indices.fielddata.evictions | Resource: Saturation |
Size of the filter cache (bytes) (only pre-version 2.x) | indices.filter_cache.memory_size_in_bytes | Resource: Utilization |
Number of evictions from the filter cache (only pre-version 2.x) | indices.filter_cache.evictions | Resource: Saturation |
CACHE METRICS TO WATCH
Fielddata cache evictions: Ideally, you want to limit the number of fielddata evictions because they are I/O intensive. If you’re seeing a lot of evictions and you cannot increase your memory at the moment, Elasticsearch recommends a temporary fix of limiting fielddata cache to 20 percent of heap; you can do so in your config/elasticsearch.yml file. When fielddata reaches 20 percent of the heap, it will evict the least recently used fielddata, which then allows you to load new fielddata into the cache.
Fielddata
缓存回收:理想情况下,您希望限制fielddata
清除的数量,因为它们是I/O
密集型的。如果您看到了大量的回收操作,并且目前无法增加内存,Elasticsearch
建议使用一个临时修复方法,将fielddata
缓存限制在堆的20%
,你可以在配置config/elasticsearch.yml
文件中这样做。当fielddata
达到堆的20%
时,它将回收最近最少使用的fielddata
,然后允许您将新的fielddata
加载到缓存中。
Elasticsearch also recommends using doc values whenever possible because they serve the same purpose as fielddata. However, because they are stored on disk, they do not rely on JVM heap. Although doc values cannot be used for analyzed string fields, they do save fielddata usage when aggregating or sorting on other types of fields. In version 2.0 and later, doc values are automatically built at document index time, which has reduced fielddata/heap usage for many users. However, if you are using a version between 1.0 and 2.0, you can also benefit from this feature—simply remember to enable them when creating a new field in an index.
Elasticsearch
还建议尽可能使用doc
值,因为它们的用途与fielddata
相同。但是,因为它们存储在磁盘上,所以不依赖于JVM
堆。虽然doc
值不能用于analyzed
字符串字段,但是当聚合或排序其他类型的字段时,它们可以保存fielddata
的使用。在2.0及更高版本中,Doc Values
是在文档索引时自动构建的,这减少了许多用户对fielddata/heap
的使用。但是,如果您使用的是1.0到2.0之间的版本,那么您也可以从这个特性中获益,只要记住在索引中创建新字段时启用它们即可。
Filter cache evictions: As mentioned earlier, filter cache eviction metrics are only available if you are using a version of Elasticsearch prior to 2.0. Each segment maintains its own individual filter cache. Since evictions are costlier operations on large segments than small segments, there’s no clear-cut way to assess how serious each eviction may be. However, if you see evictions occurring more often, this may indicate that you are not using filters to your best advantage—you could just be creating new ones and evicting old ones on a frequent basis, defeating the purpose of even using a cache. You may want to look into tweaking your queries (for example, using a bool query instead of an and/or/not filter).
过滤器缓存回收:如前所述,只有在使用2.0之前的Elasticsearch
版本时,才可以使用筛选缓存清除指标。每个段维护自己单独的过滤器缓存。由于回收行动在大segments
内的成本要高于小segments
内的成本,因此没有明确的方法来评估每次回收行动的严重程度。然而,如果您看回收更频繁地发生,这可能表明您没有充分利用过滤器,您可能只是在创建新的过滤器并频繁地回收旧的过滤器,这甚至违背了使用缓存的目的。你可能想要调整你的查询(例如:使用bool
查询替换对and/or/not
的使用)
2.5.3 Pending tasks(待处理工作)
Metric description | Name | Metric type |
---|---|---|
Number of pending tasks | pending_task_total | Resource: Saturation |
Number of urgent pending tasks | pending_tasks_priority_urgent | Resource: Saturation |
Number of high-priority pending tasks | pending_tasks_priority_high | Resource: Saturation |
Pending tasks can only be handled by master nodes. Such tasks include creating indices and assigning shards to nodes. Pending tasks are processed in priority order—urgent comes first, then high priority. They start to accumulate when the number of changes occurs more quickly than the master can process them. You want to keep an eye on this metric if it keeps increasing. The number of pending tasks is a good indication of how smoothly your cluster is operating. If your master node is very busy and the number of pending tasks doesn’t subside, it can lead to an unstable cluster.
挂起的任务只能由主节点处理。这类任务包括创建索引和向节点分配分片。挂起的任务按优先级顺序处理,紧急优先,然后是高优先级。当变化发生的速度快于主进程处理它们的速度时,它们就开始累积。如果这个指标一直在增加,你就需要关注它。挂起任务的数量可以很好地反映集群的运行情况。如果您的主节点非常繁忙,并且挂起的任务数量没有减少,则可能导致不稳定的集群。
2.5.4 Unsuccessful GET requests
Metric description | Name | Metric type |
---|---|---|
Total number of GET requests where the document was missing | indices.get.missing_total | Work: Error |
Total time spent on GET requests where the document was missing | indices.get.missing_time_in_millis | Work: Error |
A GET request is more straightforward than a normal search request—it retrieves a document based on its ID. An unsuccessful get-by-ID request means that the document ID was not found. You shouldn’t usually have a problem with this type of request, but it may be a good idea to keep an eye out for unsuccessful GET requests when they happen.
GET
请求比普通的搜索请求更简单,它根据文档的ID
检索文档。不成功的get-by-ID
请求意味着没有找到文档ID
。对于这种类型的请求,您通常不应该有问题,但当出现不成功的GET
请求时,最好密切注意。
Conclusion
In this post, we’ve covered some of the most important areas of Elasticsearch to monitor as you grow and scale your cluster:
在这篇文章中,我们讨论了Elasticsearch
中最重要的一些领域,可以在集群增长和扩展时进行监视。
- Search and indexing performance
- Memory and garbage collection
- Host-level system and network metrics
- Cluster health and node availability
- Resource saturation and errors
As you monitor Elasticsearch metrics along with node-level system metrics, you will discover which areas are the most meaningful for your specific use case. Read Part 2 to learn how to start collecting and visualizing the Elasticsearch metrics that matter most to you, or check out Part 3 to see how you can monitor Elasticsearch metrics, request traces, and logs in one platform. In Part 4, we’ll discuss how to solve five common Elasticsearch performance and scaling problems.
当您监视Elasticsearch
指标和节点级系统指标时,您将发现哪些区域对于您的特定用例最有意义。请阅读第2部分,了解如何开始收集和可视化对您最重要的Elasticsearch
指标,或者查看第3部分,了解如何在一个平台上监视Elasticsearch
指标、请求跟踪和日志。在第4部分中,我们将讨论如何解决5个常见的弹性搜索性能和缩放问题。