Consul Server 性能 ——Server Performance(中)
本篇文章翻译自: https://www.consul.io/docs/install/performance
翻译了Server性能下半部分,主要介绍如何对Consul集群读、写能力进行调优。主要分为四方面: 读、写、网络、限流。
转载请注明🙂,喜欢请一键三连哦😊
读/写调优
中文翻译
Consul写能力受限于磁盘I/O,读能力受CPU的限制。内存需求将取决于存储的KV对的总大小,并应根据该数据确定大小来确定内存大小(硬盘存储也应如此)。键的值(Value)大小限制为512KB。
对于【写】负载比较重的场景(频繁写), 内存大小可设置为: 内存大小 = KV数 * 平均大小 * 2-3x
由于写入操作在提交之前必须同步到一定服务器上的磁盘(持久存储),因此部署具有高写入吞吐量的磁盘(或SSD)将提高写入端的性能。
对于【读】【负载比较重的场景(频繁读),请使用allow_stale DNS选项配置所有Consul服务器代理,或使用Stale一致性模式查询API。默认情况下,对服务器的所有查询都是RPC请求转发给Leader并由其提供服务。通过启用Stale读取,所有Server 都将响应任何查询,从而减少了leader的开销。通常,Stale模式过时的时长为100ms,甚至更少。
(PS : 这一句是我根据其他文档理解的,欢迎指正,意思是说,虽然设置了过时模式,正常情况下,也就是过时了100m设置更少,理解出处为:最大过时时长设置),但它显著提高了性能并减少了高负载下的延迟。
如果Leader Server Node内存不足或磁盘已满,Server Node最终将停止响应,失去选主能力,并且无法继续Commit。但是,通过配置max_stale
并将其设置为一个较大的(0.7.1 之后的版本已经默认设置为10年了,也就是说及时过时,仍然可以一直响应),consul server 将在一些场景下(比如断电)继续响应读请求;
需要注意的是,Stale模式不适合在强一致性非常重要的场景(比如锁定或选择应用程序Leader)。对于关键情况,需要可选的一致性API查询模式才能实现真正的线性化;取舍是,这会将读取转换为完全的定额写入,因此需要更多的资源和更长的时间。
读取能力要求高的集群,可以使用企业版的Consul, 企业版Consul增强了读功能,具有更高的扩展性。此功能允许集群包含可以Voting-Server节点和None-Votting-Server节点。作为None-Votting-Server节点,仍将参与数据复制,但不会阻止日志Entry的提交。
(PS : 我理解它的意思时,企业版添加的None-Votting-Server 仍然会参与数据的复制,但是复制成功与否不影响Master Leader Log Entry Commit,在下面这篇文档中,介绍了说None-Votting-Server 参与复制,不参与Leader选举。 参考:企业版通过Non-Voting Servers增强读的可扩展性)
【网络通信】Consul Agent 通过Gossip协议建立连接(可以理解为通过Gossip协议互相感知)。监视处理程序、运行状况检查和日志文件复制都需要打开文件(文件操作符,IO操作都需要文件操作符)。对于注重写能力的集群,ulimit
(Linux设置的参数)大小必须从默认值(1024)增加,以防止leader耗尽文件描述符。
【限流】为了防止由于错误的配置客户端,产生的Server CPU峰值,Client对Server的RPC请求应限制速率。参考Rate-Limit
注意:速率限制应仅在客户端代理上配置。
英文原文
Read/Write Tuning
Consul is write limited by disk I/O and read limited by CPU. Memory requirements will be dependent on the total size of KV pairs stored and should be sized according to that data (as should the hard drive storage). The limit on a key’s value size is 512KB.
Consul is write limited by disk I/O and read limited by CPU.
For write-heavy workloads, the total RAM available for overhead must approximately be equal to
RAM NEEDED = number of keys * average key size * 2-3x
Since writes must be synced to disk (persistent storage) on a quorum of servers before they are committed, deploying a disk with high write throughput (or an SSD) will enhance performance on the write side. (Documentation)
For a read-heavy workload, configure all Consul server agents with the allow_stale DNS option, or query the API with the stale consistency mode. By default, all queries made to the server are RPC forwarded to and serviced by the leader. By enabling stale reads, any server will respond to any query, thereby reducing overhead on the leader. Typically, the stale response is 100ms or less from consistent mode but it drastically improves performance and reduces latency under high load.
If the leader server is out of memory or the disk is full, the server eventually stops responding, loses its election and cannot move past its last commit time. However, by configuring max_stale and setting it to a large value, Consul will continue to respond to queries during such outage scenarios. (max_stale documentation).
It should be noted that stale is not appropriate for coordination where strong consistency is important (i.e. locking or application leader election). For critical cases, the optional consistent API query mode is required for true linearizability; the trade off is that this turns a read into a full quorum write so requires more resources and takes longer.
Read-heavy clusters may take advantage of the enhanced reading feature (Enterprise) for better scalability. This feature allows additional servers to be introduced as non-voters. Being a non-voter, the server will still participate in data replication, but it will not block the leader from committing log entries.
Consul’s agents use network sockets for communicating with the other nodes (gossip) and with the server agent. In addition, file descriptors are also opened for watch handlers, health checks, and log files. For a write heavy cluster, the ulimit size must be increased from the default value (1024) to prevent the leader from running out of file descriptors.
To prevent any CPU spikes from a misconfigured client, RPC requests to the server should be rate limited
NOTE Rate limiting is configured on the client agent only.
In addition, two performance indicators — consul.runtime.alloc_bytes
and consul.runtime.heap_objects
— can help diagnose if the current sizing is not adequately meeting the load.