victoriametrics的prometheus高可用性和容错策略长期存储

最新推荐文章于 2024-08-14 11:57:06 发布

weixin_26711867

最新推荐文章于 2024-08-14 11:57:06 发布

阅读量2k

点赞数

文章标签： python java

原文链接：https://medium.com/miro-engineering/prometheus-high-availability-and-fault-tolerance-strategy-long-term-storage-with-victoriametrics-82f6f3f0409e

版权

本文探讨了Prometheus作为监控工具在短期指标保留上的优势，但其长期存储成本高昂，且不易实现高可用性和故障转移。尽管如此，通过结合其他工具，可以实现Prometheus的HA和FT，并优化长期存储方案。文章介绍了Prometheus系统中的关键组件，如Blackbox、Exporters和AlertManager，并讨论了如何利用其他工具改善长期存储策略。

摘要由CSDN通过智能技术生成

本文的“为什么”？(“Why” of this article?)

Prometheus is a great tool for monitoring small, medium, and big infrastructures.

Prometheus是监视小型，中型和大型基础架构的好工具。

Prometheus anyway, and the development team behind it, are focused on scraping metrics. It’s a particularly great solution for short term retention of the metrics. Long term retention is another story unless it’s used for collecting a small number of metrics. This is normal in some way, because most of the time, when investigating some problems using the metrics scraped by Prometheus, we use metrics not older than 10 days. But this is not always the case, especially when the statistics that we are searching for are a correlation between different periods, like different weeks per months, or different months, or we are interested in keeping historical synthesis.

无论如何，Prometheus及其背后的开发团队都专注于抓取指标。对于短期保留指标而言，这是一个特别好的解决方案。除非用于收集少量指标，否则长期保留是另一回事。这在某种程度上是正常的，因为在大多数情况下，当使用Prometheus收集的指标调查某些问题时，我们使用的时间不超过10天。但这并非总是如此，尤其是当我们要搜索的统计数据是不同时期之间的相关性时，例如每月不同的周数或不同的月份，或者我们有兴趣保留历史综合信息。

Actually, Prometheus is perfectly able to collect metrics and to store them even for a long time, but storage will become extremely expensive since Prometheus needs to use fast storage, and Prometheus is not known to be a solution which permits to reach HA and FT in a sophisticated way (as we are going to explain there is a way, not so sophisticated, but it’s there). We will explain in the present article how to achieve HA and FT for Prometheus and also why we can achieve long term storage for metrics, in a better way using another tool.

实际上，Prometheus能够完美地收集和存储指标，甚至可以长时间存储，但是由于Prometheus需要使用快速存储，因此存储将变得极其昂贵，而Prometheus并不是一个可以在其中达到HA和FT的解决方案。一种复杂的方法(正如我们将要解释的那样，有一种方法，虽然不是那么复杂，但确实存在)。我们将在本文中解释如何实现Prometheus的HA和FT，以及为什么我们可以使用另一种工具以更好的方式实现指标的长期存储。

That said, during the past years many tools started to compete and many are still competing for solving those problems and not only.

就是说，在过去的几年中，许多工具开始竞争，并且不仅为解决这些问题，还在为解决这些问题而竞争。

The common components of a Prometheus installation are:

Prometheus安装的常见组件是：

Prometheus
普罗米修斯
Blackbox
黑盒子
Exporters
出口商
AlertManager
警报管理器
PushGateway
PushGateway

普罗米修斯的HA和FT (HA and FT of Prometheus)

Prometheus can use federation (Hierarchical and Cross-Service), which permits to configure a Prometheus instance to scrape selected metrics from other Prometheus instances (https://prometheus.io/docs/prometheus/latest/federation/). This kind of solution is pretty good when you want to expose only a subset of selected metrics to tools like Grafana, or when you want to aggregate cross-functional metrics (like business metrics from one Prometheus and a subset of services metrics from another one which is working in a federated way). This is perfectly fine, and it can work in many use cases, but it’s not compliant with the concept of High Availability, nor with the concept of Fault Tolerance: we are still talking about a subset of metrics, and if one of the Prometheus instances goes down, those metrics will be not collected during the down. Making Prometheus HA and FT must be done differently: there is no native solution from the Prometheus project itself.

Prometheus可以使用联盟(分层和跨服务)，该联盟允许将Prometheus实例配置为从其他Prometheus实例( https://prometheus.io/docs/prometheus/latest/federation/ )抓取所选指标。当您只想将所选指标的子集暴露给Grafana等工具时，或者想要汇总跨功能指标(例如来自一个Prometheus的业务指标和来自另一个Prometheus的服务指标的子集)时，这种解决方案非常好正在以联盟方式工作)。这样做很好，并且可以在许多用例中使用，但是它既不符合高可用性的概念，也不符合容错的概念：我们仍在讨论度量的子集，以及Prometheus实例中的一个下降，这些指标将不会在下降期间收集。制作Prometheus HA和FT的方法必须不同：Prometheus项目本身没有本地解决方案。

Prometheus can achieve HA and FT in a very easy way, without the need for complex clusters or consensus strategies.

普罗米修斯可以非常轻松地实现HA和FT，而无需复杂的集群或共识策略。

What we have to do, is to duplicate the same configuration file, the prometheus.yml in two different instances configured in the same manner, that are going to scrape the same metrics from the same sources. The only difference is that instance A is also monitoring instance B and vice versa. The good and old concept of redundancy is easy to implement, it’s solid, and if we use IaC (Infrastructure as Code, like Terraform) and a CM (Configuration Manager, like Ansible) it will also be extremely easy to manage and maintain. You do not want to duplicate an extremely big and expensive instance with another one, it’s better to duplicate a small instance, and to keep only short term metrics on it. This also makes the instances quickly recreable.

我们要做的是在以相同方式配置的两个不同实例中复制相同的配置文件prometheus.yml ，这些实例将从相同的来源中获取相同的指标。唯一的区别是实例A也在监视实例B，反之亦然。冗余的好概念很容易实现