Production Checklist 翻译

原文地址

 

Production Checklist

生产环境检查表

Overview

Data services such as RabbitMQ often have many tunable parameters. Some configurations make a lot of sense for development but are not really suitable for production. No single configuration fits every use case. It is, therefore, important to assess system configuration and have a plan for "day two operations" activities such as upgrades before going into production.

像RabbitMQ这样的数据服务通常有许多可调参数。有些配置对开发很有意义,但实际上并不适合生产。没有一个配置适合每个用例。因此,在投入生产前,评估系统配置并制定“第二天操作”活动(如升级)的计划非常重要。
 

Production systems have concerns that go beyond configuration: a certain degree of system observability (monitoring, metrics), application resource usage and security (e.g. firewalls, credentials and shared secret generation) are essential. This guide provides an overview of such topics as well.

生产系统的关注点超出了配置:一定程度的系统可观察性(监视、度量)、应用程序资源使用和安全性(例如防火墙、凭据和共享机密生成)是必不可少的。本指南还提供了此类主题的概述。
 

Monitoring and metrics are the foundation of a production-grade system. Besides helping detect issues, it provides the operator data that can be used to size and configure both RabbitMQ nodes and applications.

监视和度量是生产环境等级系统的基础。除了帮助检测问题外,它还提供了操作数据,可用于调整RabbitMQ节点和应用程序的大小和配置。
 

Operators also should keep RabbitMQ support timelines in mind when picking a version to deploy.

在选择要部署的版本时,运营商还应该记住RabbitMQ支持的时间表。
 

Virtual Hosts, Users, Permissions

虚拟主机、用户、权限

It is often necessary to seed a cluster with virtual hosts, users, permissions, topologies, policies and so on. The recommended way of doing this at deployment time is via definition file import.

通常需要在集群中植入虚拟主机、用户、权限、拓扑、策略等。建议在部署时执行此操作的方法是通过定义文件导入。
 

Virtual Hosts

In a single-tenant environment, for example, when your RabbitMQ cluster is dedicated to power a single system in production, using default virtual host (/) is perfectly fine.

例如,在单租户环境中,当RabbitMQ集群专用于为生产中的单个系统,使用默认虚拟主机(/)是非常好的。
 

In multi-tenant environments, use a separate vhost for each tenant/environment, e.g. project1_development, project1_production, project2_development, project2_production, and so on.

在多租户环境中,为每个租户/环境使用单独的vhost,例如project1_开发、project1_生产、project2_开发、project2_生产等。
 

Users

For production environments, delete the default user (guest). Default user only can connect from localhost by default, because it has well-known credentials. Instead of enabling remote connections, consider creating a separate user with administrative permissions and a generated password.

对于生产环境,请删除默认用户(guest)。默认情况下,默认用户只能从本地主机连接,因为它具有已知的凭据。与其启用远程连接,不如考虑创建一个具有管理权限和生成密码的单独用户。
 

It is recommended to use a separate user per application. For example, if you have a mobile app, a Web app, and a data aggregation system, you'd have 3 separate users. This makes a number of things easier:

  • Correlating client connections with applications
  • Using fine-grained permissions
  • Credentials roll-over (e.g. periodically or in case of a breach)

建议每个应用程序使用单独的用户。例如,如果您有一个移动应用程序、一个Web应用程序和一个数据聚合系统,那么您将有3个独立的用户。这使得许多事情变得简单:

  • 将客户端连接与应用程序关联
  • 使用细粒度权限
  • 凭证展期(例如,定期或在违约的情况下)

In case there are many instances of the same application, there's a trade-off between better security (having a set of credentials per instance) and convenience of provisioning (sharing a set of credentials between some or all instances).

如果同一个应用程序有多个实例,那么在更好的安全性(每个实例都有一组凭据)和方便的配置(在一些或所有实例之间共享一组凭据)之间需要权衡。
 

For IoT applications that involve many clients performing the same or similar function and having fixed IP addresses, it may make sense to authenticate using x509 certificates or source IP address ranges.

对于涉及许多客户端执行相同或类似功能并具有固定IP地址的物联网应用程序,使用x509证书或源IP地址范围进行身份验证可能是有意义的。
 

Monitoring and Resource Limits

监测和资源限制
 

RabbitMQ nodes are limited by various resources, both physical (e.g. the amount of RAM available) as well as software (e.g. max number of file handles a process can open). It is important to evaluate resource limit configurations before going into production and continuously monitor resource usage after that.

RabbitMQ节点受到各种资源的限制,既有物理资源(例如可用的RAM数量),也有软件资源(例如一个进程可以打开的最大文件句柄数)。在投入生产前评估资源限制配置并在生产之后持续监控资源使用情况是很重要的。

Monitoring

Monitoring several aspects of the system, from infrastructure and kernel metrics to RabbitMQ to application-level metrics is essential. While monitoring requires an upfront investment in terms of time, it is very effective at catching issues and noticing potentially problematic trends early (or at all).

监控系统的几个方面,从基础设施和内核指标到RabbitMQ再到应用程序级指标,都是必不可少的。虽然监测需要在时间方面进行前期投资,但它在捕捉问题和及早(或根本)发现潜在问题趋势方面非常有效。
 

Memory

RabbitMQ uses Resource-driven alarms to throttle publishers when consumers do not keep up.

RabbitMQ使用资源驱动的警报在消费者没有跟上时阻止生产者服务器。
 

By default, RabbitMQ will not accept any new messages when it detects that it's using more than 40% of the available memory (as reported by the OS): vm_memory_high_watermark.relative = 0.4. This is a safe default and care should be taken when modifying this value, even when the host is a dedicated RabbitMQ node.

默认情况下,当RabbitMQ检测到它使用了超过40%的可用内存(如操作系统报告的那样):vm_memory_high_watermark.relative =0.4时,它将不接受任何新消息。这是一个安全的默认值,在修改该值时应该小心,即使主机是专用的RabbitMQ节点。

The OS and file system use system memory to speed up operations for all system processes. Failing to leave enough free system memory for this purpose will have an adverse effect on system performance due to OS swapping, and can even result in RabbitMQ process termination.

操作系统和文件系统使用系统内存来加速所有系统进程的操作。如果不能为此目的保留足够的可用系统内存,将由于操作系统交换而对系统性能产生不利影响,甚至可能导致RabbitMQ进程终止。
 

A few recommendations when adjusting the default vm_memory_high_watermark:

  • Nodes hosting RabbitMQ should have at least 256 MiB of memory available at all times. Deployments that use quorum queuesShovel and Federation may need more.
  • The recommended vm_memory_high_watermark.relative range is 0.4 to 0.7
  • Values above 0.7 should be used with care and with solid memory usage and infrastructure-level monitoring in place. The OS and file system must be left with at least 30% of the memory, otherwise performance may degrade severely due to paging.

调整默认vm_memory_high_watermark时的一些建议:

  • 托管RabbitMQ的节点应始终具有至少256 MiB的可用内存。使用仲裁队列、Shovel和联合的部署可能需要更多。
  • 建议的vm_memory_high_watermark.relative范围是0.4到0.7
  • 在使用0.7以上的值时,应谨慎使用,并在适当的位置进行可靠的内存使用和基础结构级别的监视。操作系统和文件系统必须保留至少30%的内存,否则性能可能会因分页而严重降低。
     

 

These are some very broad-stroked guidelines. As with every tuning scenario, monitoring, benchmarking and measuring are required to find the best setting for the environment and workload.

这些是一些非常宽泛的指导方针。与每个调优场景一样,需要监视、基准测试和测量来找到环境和工作负载的最佳设置。

Learn more about RabbitMQ and system memory in a separate guide.

在单独的指南中了解有关RabbitMQ和系统内存的更多信息。
 

Disk Space

The current 50MB disk_free_limit default works very well for development and tutorials. Production deployments require a much greater safety margin. Insufficient disk space will lead to node failures and may result in data loss as all disk writes will fail.

当前50MB的磁盘可用空间限制默认值非常适合开发和教程。生产部署需要更大的安全裕度。磁盘空间不足将导致节点故障,并可能导致数据丢失,因为所有磁盘写入都将失败。
 

Why is the default 50MB then? Development environments sometimes use really small partitions to host /var/lib, for example, which means nodes go into resource alarm state right after booting. The very low default ensures that RabbitMQ works out of the box for everyone. As for production deployments, we recommend the following:

那么为什么默认50MB呢?例如,开发环境有时使用非常小的分区来承载/var/lib,这意味着节点在启动后立即进入资源警报状态。非常低的默认值确保RabbitMQ对每个人都是开箱即用的。对于生产部署,我们建议:
 

  • disk_free_limit.relative = 1.0 is the minimum recommended value and it translates to the total amount of memory available. For example, on a host dedicated to RabbitMQ with 4GB of system memory, if available disk space drops below 4GB, all publishers will be blocked and no new messages will be accepted. Queues will need to be drained, normally by consumers, before publishing will be allowed to resume.
  • disk_free_limit.relative =1.0是最小建议值,它转换为可用内存总量。例如,在一个专用于RabbitMQ的4G内存大小的主机上,如果可用磁盘空间低于4GB,所有生产者服务器都将被阻止,并且不会接受新消息。在允许恢复发布之前,队列通常需要被消费者清空。
  • disk_free_limit.relative = 1.5 is a safer production value. On a RabbitMQ node with 4GB of memory, if available disk space drops below 6GB, all new messages will be blocked until the disk alarm clears. If RabbitMQ needs to flush to disk 4GB worth of data, as can sometimes be the case during shutdown, there will be sufficient disk space available for RabbitMQ to start again. In this specific example, RabbitMQ will start and immediately block all publishers since 2GB is well under the required 6GB.
  • disk_free_limit.relative =1.5是更安全的生产值。在具有4GB内存的RabbitMQ节点上,如果可用磁盘空间降至6GB以下,所有新消息将被阻止,直到磁盘警报清除。如果RabbitMQ需要将4GB的数据刷新到磁盘上(有时在关闭过程中会出现这种情况),那么将有足够的磁盘空间供RabbitMQ重新启动。在这个特定的示例中,RabbitMQ将启动并立即阻止所有发布服务器,因为2GB远远低于所需的6GB。
  • disk_free_limit.relative = 2.0 is the most conservative production value, we cannot think of any reason to use anything higher. If you want full confidence in RabbitMQ having all the disk space that it needs, at all times, this is the value to use.
  • disk_free_limit.relative =2.0是最保守的生产值,我们想不出任何理由使用更高的值。如果您想对RabbitMQ拥有它所需的所有磁盘空间有充分的信心,那么这就是要使用的值。

Open File Handles Limit

打开文件句柄限制

Operating systems limit maximum number of concurrently open file handles, which includes network sockets. Make sure that you have limits set high enough to allow for expected number of concurrent connections and queues.

操作系统限制并发打开的文件句柄的最大数量,其中包括网络套接字。请确保将限制设置得足够高,以允许预期数量的并发连接和队列。
 

Make sure your environment allows for at least 50K open file descriptors for effective RabbitMQ user, including in development environments.

确保您的环境至少允许50K个可打开的文件描述符用于RabbitMQ用户,包括在开发环境中。
 

As a rule of thumb, multiple the 95th percentile number of concurrent connections by 2 and add total number of queues to calculate recommended open file handle limit. Values as high as 500K are not inadequate and won't consume a lot of hardware resources, and therefore are recommended for production setups.

根据经验,将%95的并发连接数乘以2,然后将队列总数相加,以计算建议的打开文件句柄限制。高达500K的值已经足够,并且不会消耗大量硬件资源,因此建议用于生产设置。
 

See Networking guide for more information.

有关详细信息,请参阅网络指南。
 

Log Collection

日志收集
 

It is highly recommended that logs of all RabbitMQ nodes and applications (when possible) are collected and aggregated. Logs can be crucially important in investigating unusual system behaviour.

强烈建议收集并聚合所有RabbitMQ节点和应用程序的日志(如果可能)。日志在研究异常系统行为时非常重要。
 

Application Considerations

应用注意事项
 

The way applications are designed and use RabbitMQ client libraries is a major contributor to the overall system resilience.

应用程序的设计和使用RabbitMQ客户机库的方式是总体系统弹性的主要贡献者。

Applications that use resources inefficiently or leak them will eventually affect the rest of the system. For example, an app that continuously opens connections but never closes them will exhaust cluster nodes out of file descriptors so no new connections will be accepted.

使用资源效率低下或泄漏资源的应用程序最终将影响系统的其余部分。例如,一个持续打开连接但从不关闭连接的应用程序将耗尽集群节点的文件描述符,因此不会接受新的连接。
 

This and similar problems can manifest themselves in more complex scenarios, e.g those collectively known as the thundering herd problem.

这种和类似的问题可以在更复杂的情况下表现出来,例如统称为雷鸣羊群问题。
 

This section covers a number of most common problems. Most of these problems are generally not protocol-specific or new.

本节介绍了一些最常见的问题。这些问题大多不是特定于协议的,也不是新出现的。
 

They can be hard to detect, however. Adequate monitoring of the system is critically important as it is the only way to spot problematic trends (e.g. channel leaks, growing file descriptor usage from poor connection management) early.

然而,它们很难被发现。对系统进行充分的监控至关重要,因为这是及早发现问题趋势(例如,通道泄漏、连接管理不善导致的文件描述符使用量不断增加)的唯一方法。
 

Connection Management

Messaging protocols generally assume long-lived connections. Some applications connect to RabbitMQ on start and only close the connection(s) when they have to terminate. Others open and close connections more dynamically. For the latter group it is important to close them when they are no longer used.

消息传递协议通常假定连接寿命较长。有些应用程序在启动时连接到RabbitMQ,只有在必须终止时才关闭连接。其他人则更动态地打开和关闭连接。对于后一组,重要的是在它们不再使用时关闭它们。
 

Connections can be closed for reasons outside of application developer's control. Messaging protocols supported by RabbitMQ use a feature called heartbeats (the name may vary but the concept does not) to detect such connections quicker than the TCP stack. Developers should be careful about using heartbeat timeout that are too low (less than 5 seconds) as that may produce false positives when network congestion or system load goes up.

由于应用程序开发人员无法控制的原因,可以关闭连接。RabbitMQ支持的消息传递协议使用一个称为heartbeats的特性(名称可能不同,但概念不同)来比TCP堆栈更快地检测此类连接。开发人员应该小心使用心跳超时太低(小于5秒),因为当网络拥塞或系统负载增加时,这可能会产生误报。
 

Very short lived connections should be avoided when possible. The following section will cover this in more detail.

尽可能避免使用寿命极短的连接。下一节将更详细地介绍这一点。
 

Connection Churn

 

As mentioned above, messaging protocols generally assume long-lived connections. Some applications may open a new connection to perform a single operation (e.g. publish a message) and then close it. This is highly inefficient as opening a connection is an expensive operation (compared to reusing an existing one).

Such workload also leads to connection churn. Nodes experiencing high connection churn must be tuned to release TCP connections much quicker than kernel defaults, otherwise they will eventually run out of file handles or memory and will stop accepting new connections.

如上所述,消息传递协议通常假定连接是长期的。有些应用程序可能会打开一个新连接来执行单个操作(例如发布消息),然后关闭它。这是非常低效的,因为打开连接是一个昂贵的操作(与重用现有的操作相比)。

这样的工作量也会导致连接中断。遇到高连接混乱的节点必须调整为比内核默认值更快地释放TCP连接,否则它们最终将耗尽文件句柄或内存,并停止接受新的连接。
 

If a small number of long lived connections is not an option, connection pooling can help reduce peak resource usage.

如果不能选择少量的长寿命连接,则连接池可以帮助减少峰值资源使用。
 

Recovery from Connection Failures

从连接故障中恢复
 

Some client libraries, for example, Java.NET and Ruby, support automatic connection recovery after network failures. If the client used provides this feature, it is recommended to use it instead of developing your own recovery mechanism.

一些客户端库,例如Java、.NET和Ruby,支持网络故障后的自动连接恢复。如果使用的客户机提供了此功能,则建议使用它,而不是开发自己的恢复机制。
 

Other clients (Go, Pika) do not support automatic connection recovery as a feature but do provide examples that demonstrate how to recover from connection failures.

其他客户机(Go、Pika)不支持将自动连接恢复作为一项功能,但提供了一些示例来演示如何从连接失败中恢复。
 

Excessive Channel Usage

过度使用信道

Channels also consume resources in both client and server. Applications should minimize the number of channels they use when possible and close channels that are no longer necessary. Channels, like connections, are meant to be long lived.

信道同样消耗客户机和服务器中的资源。应用程序应尽可能减少使用的信道数,并关闭不再需要的信道。信道,就像连接一样,都是长连接。
 

Note that closing a connection automatically closes all channels on it.

请注意,关闭连接会自动关闭其上的所有信道。
 

Polling Consumers

Polling consumers (consumption with basic.get) is a feature that application developers should avoid in most cases as polling is inherently inefficient.

轮询消费者(使用basic.get)是应用程序开发人员在大多数情况下应该避免的一个特性,因为轮询本身就没有效率。
 

Security Considerations

Users and Permissions

See the section on vhosts, users, and credentials above.

Inter-node and CLI Tool Authentication

RabbitMQ nodes authenticate to each other using a shared secret stored in a file. On Linux and other UNIX-like systems, it is necessary to restrict cookie file access only to the OS users that will run RabbitMQ and CLI tools.

It is important that the value is generated in a reasonably secure way (e.g. not computed from an easy to guess value). This is usually done using deployment automation tools at the time of initial deployment. Those tools can use default or placeholder values: don't rely on them. Allowing the runtime to generate a cookie file on one node and copying it to all other nodes is also a poor practice: it makes the generated value more predictable since the generation algorithm is known.

CLI tools use the same authentication mechanism. It is recommended that inter-node and CLI communication port access is limited to the hosts that run RabbitMQ nodes or CLI tools.

Securing inter-node communication with TLS is recommended. It implies that CLI tools are also configured to use TLS.

Firewall Configuration

Ports used by RabbitMQ can be broadly put into one of two categories:

  • Ports used by client libraries (AMQP 0-9-1, AMQP 1.0, MQTT, STOMP, HTTP API)
  • All other ports (inter node communication, CLI tools and so on)

Access to ports from the latter category generally should be restricted to hosts running RabbitMQ nodes or CLI tools. Ports in the former category should be accessible to hosts that run applications, which in some cases can mean public networks, for example, behind a load balancer.

TLS

We recommend using TLS connections when possible, at least to encrypt traffic. Peer verification (authentication) is also recommended. Development and QA environments can use self-signed TLS certificates. Self-signed certificates can be appropriate in production environments when RabbitMQ and all applications run on a trusted network or isolated using technologies such as VMware NSX.

While RabbitMQ tries to offer a secure TLS configuration by default (e.g. SSLv3 is disabled), we recommend evaluating TLS configuration (versions cipher suites and so on) using tools such as testssl.sh. Please refer to the TLS guide to learn more.

Note that TLS can have significant impact on overall system throughput, including CPU usage of both RabbitMQ and applications that use it.

Networking Configuration

Production environments may require network configuration tuning, for example, to sustain a high number of concurrent clients. Please refer to the Networking Guide for details.

Clustering Considerations

Cluster Size

When determining cluster size, it is important to take several factors into consideration:

  • Expected throughput
  • Expected replication (number of mirrors)
  • Data locality

Since clients can connect to any node, RabbitMQ may need to perform inter-cluster routing of messages and internal operations. Try making consumers and producers connect to the same node, if possible: this will reduce inter-node traffic. Equally helpful would be making consumers connect to the node that currently hosts queue master (can be inferred using HTTP API). When data locality is taken into consideration, total cluster throughput can reach non-trivial volumes.

For most environments, mirroring to more than half of cluster nodes is sufficient. It is recommended to use clusters with an odd number of nodes (3, 5, and so on).

Partition Handling Strategy

It is important to pick a partition handling strategy before going into production. When in doubt, use the autoheal strategy.

Node Time Synchronization

A RabbitMQ cluster will typically function well without clocks of participating servers being synchronized. However some plugins, such as the Management UI, make use of local timestamps for metrics processing and may display incorrect statistics when the current time of nodes drift apart. It is therefore recommended that servers use NTP or similar to ensure clocks remain in sync.

Getting Help and Providing Feedback

If you have questions about the contents of this guide or any other topic related to RabbitMQ, don't hesitate to ask them on the RabbitMQ mailing list.

Help Us Improve the Docs <3

If you'd like to contribute an improvement to the site, its source is available on GitHub. Simply fork the repository and submit a pull request. Thank you!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值