使用您已有的工具进行快速的实时细粒度监控

As Engineers, we often find it necessary to have real-time, accurate, and granular timings of our code. These are useful for optimizing data processing, general monitoring, incident response, and many other purposes. At Klaviyo we already have a robust infrastructure for timing processes at scale via our statsd pipeline, as described by John Meichle in his earlier post. For certain real-time monitoring requirements, however, we need a different system that uses granular rather than statistical timings. Statistical systems have two downsides that make them unsuitable for measuring individual events that have a wide primary key distribution and need specific alerting.

作为工程师,我们经常发现有必要对我们的代码进行实时,准确和精细的计时。 这些对于优化数据处理,常规监视,事件响应和许多其他目的很有用。 正如John Meichle在他较早的帖子中所描述的那样,在Klaviyo,我们已经拥有通过statsd管道进行大规模时序处理的强大基础架构。 但是,对于某些实时监视要求,我们需要使用粒度而不是统计时间的不同系统。 统计系统有两个缺点,这使其不适用于测量具有广泛的主键分布并且需要特定警报的单个事件。

The first downside is that the number of metrics is limited. This is a problem because for example we cannot have granular timings that include primary keys. So we cannot know how long some particular process took to execute for some particular set of data. We can only keep track of averages, percentiles, and other statistics for each metric that we create.

第一个缺点是指标数量有限。 这是一个问题,因为例如我们不能有包含主键的精细定时。 因此,我们不知道某个特定过程对一组特定数据执行需要花费多长时间。 我们只能跟踪创建的每个指标的平均值,百分位数和其他统计信息。

Another downside is that the monitoring itself only occurs when we explicitly call the monitoring function for a particular metric in the code. This becomes a problem for long running processes that need to be timed while they are running. It is possible to put a timer at every intermediate step of the process, but it becomes cumbersome to have to do that for large chunks of code. Not only do we then end up with lots of lines of code that can look like timer.time(‘step_7_of_28_in_process_38’), aggregating them on dashboards becomes a chore due to having to know every step of every process. Some steps may also end up being very long for just one line of code as well, such as a particularly heavy database insert, and therefore we still cannot time it while it is happening, since our timer now can only tell us how long the insert took after the insert happened.

另一个缺点是,仅当我们为代码中的特定指标明确调用监视功能时,监视本身才发生。 对于长时间运行的进程而言,这是一个问题,需要在运行时对其进行计时。 可以在该过程的每个中间步骤中都放置一个计时器,但是必须为大量代码执行此操作变得很麻烦。 然后,我们不仅结束了很多看起来像timer.time('step_7_of_28_in_process_38')的代码行,由于必须了解每个过程的每个步骤,所以将它们聚集在仪表板上变得很麻烦。 有些步骤也可能只需要一行代码就很长,例如特别繁重的数据库插入,因此我们仍然无法在发生时对其计时,因为我们的计时器现在只能告诉我们插入多长时间了。插入发生后拍摄。

When we combine these downsides, alerting in real time on a particular process stuck in the middle of an expensive query that takes tens to hundreds of minutes becomes incredibly difficult. We can only know after the fact that the process was stuck, and not in a granular way but in a statistical way that involves lots of percentile math and potentially flappy and non-specific alerting.

当我们综合考虑这些不利因素时,实时警告处于阻塞状态的特定过程非常困难,该过程耗时数十分钟至数百分钟,而代价高昂的查询也是如此。 我们只能在事实卡住了这一事实之后才能知道该过程,而并非以一种精细的方式,而是一种涉及大量百分比数学以及可能出现不稳定和非特定警报的统计方式。

There is, however, a better way to do this. It can also be accomplished using a few tools that most programmers are already working with on a daily basis. These include threading, a centralized cache, and an alerting cron job that looks for the right data in the right place. This article describes how we were able to implement such a system for real-time alerting and outlines a path for implementing such a system from scratch for general use cases.

但是,有一种更好的方法可以做到这一点。 也可以使用大多数程序员每天都在使用的一些工具来完成此任务。 其中包括线程,集中式缓存和警报cron作业,该作业在正确的位置查找正确的数据。 本文介绍了我们如何实现这样的实时警报系统,并概述了在一般情况下从头开始实现这种系统的路径。

监控线程 (The monitoring thread)

The first component of a real-time granular monitoring is the monitoring thread. A background thread is invoked by surrounding the code we want to time with a context manager. This context manager starts the thread that begins to count the time that it has been running, occasionally calling a save function that persists its elapsed time integral. Once the code is finished running, the context manager stops the thread and records any left over time in the integral, as shown in figure 1. It is easy to integrate time within a thread because its context can be shared with its parent process, so accessing the left over time after the thread is gone is just a matter of reading the memory that the thread was modifying while it ran.

实时粒度监视的第一部分是监视线程。 通过用上下文管理器包围我们想计时的代码来调用后台线程。 该上下文管理器启动线程,该线程开始计算其运行时间,偶尔调用保存其已逝时间积分的保存函数。 代码完成运行后,上下文管理器将停止线程并将所有剩余时间记录在整数中,如图1所示。由于可以与其父进程共享上下文,因此很容易在线程中集成时间。在线程消失后访问剩余时间只是读取线程在运行时正在修改的内存。

Image for post
Image for post
Figure 1: Timer thread example implementation, including usage example
图1:计时器线程示例实现,包括用法示例

Depending on the granularity of the timer thread, this method can be as precise as we want, where the tradeoff becomes how much CPU time we want to spend timing versus how granular we want our timings to be. Production testing on modern cloud infrastructure has shown that a granularity of 10 milliseconds or even 1 millisecond does not result in significant CPU load compared to regular code, which means our timings can be at least that granular.

根据计时器线程的粒度,此方法可以达到我们想要的精确度,折衷方案是我们要花费时间的CPU时间与我们希望时间的粒度之间的权衡。 在现代云基础架构上的生产测试表明,与常规代码相比,粒度为10毫秒甚至1毫秒不会导致大量的CPU负载,这意味着我们的时序至少可以达到这种粒度。

When we started this project, we were hesitant to use threads because our codebase is primarily written in python, which is not considered to be a thread-safe environment despite supporting threads. The reason python usually fails to thread properly is that many of its third party libraries assume that they will be run in a single thread context. This means that accessing the same library to perform the same kinds of functions from multiple threads at the same time can lead to at best unexpected results and at worst difficult to diagnose segmentation faults. We ended up working around the problem by not writing to the cache directly from the thread and writing to a file instead, which will be covered in more detail in the optimization section.

当我们开始这个项目时,我们很犹豫是否使用线程,因为我们的代码库主要是用python编写的,尽管支持线程,但它仍被认为不是线程安全的环境。 python通常无法正确执行线程的原因是其许多第三方库都假定它们将在单个线程上下文中运行。 这意味着从多个线程同时访问相同的库以执行相同种类的功能可能会导致最好的意外结果,并且最难于诊断分段错误。 最后,我们通过不直接从线程写入高速缓存而不是直接写入文件来解决该问题,这将在优化部分中进行详细介绍。

监控基础架构 (The monitoring infrastructure)

Once the monitoring thread is in place, it needs to write somewhere. The natural choices are a database or a cache. Usually when dealing with granular, real-time monitoring we do not particularly care about historical data. Once something is done processing, we do not care much about it except in statistical terms. The amount of data generated by granular monitoring is also considerably greater than the amount of data generated by statistical monitoring, because each of our primary keys will have at least one distinct set of data associated with it, as opposed to being rolled up into an average of some sort. Therefore we can use a relatively short-lived cache to store the results of the thread timings with the primary keys of the entities being timed. We can then combine our real-time monitoring with regular database-based statistical monitoring to get historical statistics on our overall process.

监视线程到位后,需要在某个地方写。 自然选择是数据库还是缓存。 通常,在处理精细的实时监控时,我们并不特别在意历史数据。 一旦完成某件事的处理,除了统计方面,我们就不在乎它了。 细粒度监控生成的数据量也远大于统计监控生成的数据量,因为我们的每个主键将至少具有一组与之关联的不同数据集,而不是汇总为平均值某种形式。 因此,我们可以使用寿命相对较短的缓存来存储线程计时的结果以及实体的主键被计时的时间。 然后,我们可以将实时监视与基于常规数据库的统计监视相结合,以获取整个过程的历史统计信息。

For our system, we decided to use redis sentinel clusters to store the timing data, using a TTL of 30 days for any data point. This fixed TTL gave us the ability to accurately estimate the amount of data that would be contained in the redis clusters at any given time, which meant that it was easy to size the hardware for this approach. The main point of concern here was not data volume but request volume. Granular timings are usually a write-heavy, read-light workload because the bulk of the work consists of accumulating the timings. When it comes time to read the data, a single process is able to query the combined results of all the granular data points quickly and cheaply.

对于我们的系统,我们决定使用redis前哨群集来存储计时数据,任何数据点的TTL为30天。 这个固定的TTL使我们能够准确地估计在任何给定时间Redis集群中将包含的数据量,这意味着很容易为该方法调整硬件大小。 这里主要关注的不是数据量,而是请求量。 粒度定时通常是繁重的读工作量,因为大部分工作都由累积定时组成。 当需要读取数据时,一个过程就能快速,廉价地查询所有细粒度数据点的组合结果。

Image for post
Figure 2: Monitoring infrastructure
图2:监控基础架构

To support a write-heavy workload we had to go with sharded sentinel clusters, where each sentinel was responsible for handling some percentage of the huge write firehose that was coming from the worker threads. The basic infrastructure pattern is illustrated in figure 2.

为了支持繁重的写工作负载,我们必须使用分片的哨兵群集,其中每个哨兵负责处理来自工作线程的大量写入防火墙。 基本的基础结构模式如图2所示。

发布数据 (Publishing data)

The hardest part about using multiple sharded clusters to absorb huge amounts of writes from many different worker threads working on different timer keys is designing the sharding scheme. We want to write in such a way that all the clusters will be used evenly, but most of our primary key volumes are going to follow a pareto distribution due to the nature of real-world data. It is inevitable that there will be some entities that will be using the system much more than others, and their primary keys will be considerably hotter in terms of worker usage. Sharding a timer storage cache on any pareto-distributed keys will immediately result in hot shards by utilizing a specific shard much more than others and therefore nullify most of our sharding effectiveness.

关于使用多个分片群集来吸收来自不同计时器线程上工作的许多不同工作线程的大量写入的最困难部分是设计分片方案。 我们希望以这样一种方式来编写,即所有群集将被均匀使用,但是由于实际数据的本质,我们的大多数主键量将遵循pareto分布 。 不可避免的是,有些实体将比其他实体更多地使用该系统,就工作人员的使用而言,它们的主键将变得更热。 在任何pareto分布的键上分片计时器存储缓存将通过比其他分片更多地利用特定分片而立即导致热分片,从而使我们的大多数分片效果失效。

To use all clusters evenly, we decided to shard on the key of the metric being written to the cache itself. This metric key can be very different between different processes, for example by giving them batch_ids or process_ids. Sharding by metric keys will result in a normal distribution because such ids tend to be randomly generated. This will result in approximately even load on all shards even when the primary keys of the data being timed are pareto distributed.

为了均匀地使用所有群集,我们决定对要写入缓存本身的指标的键进行分片。 在不同的进程之间,例如,通过为它们提供batch_ids或process_ids,此度量标准密钥可能会非常不同。 通过度量键进行分片将导致正态分布,因为此类ID倾向于随机生成。 即使将要计时的数据的主键进行pareto分配,这也会导致所有分片上的负载均匀接近。

However, this kind of distribution presents a problem when it comes time to read or query the data. As far as the reader of the data is concerned, the data is now randomly distributed between the shards and therefore there is no way of knowing which shard the data is on. Worse, the data’s key now contains a random element such as a batch_id or a process_id, so there is no way to find that data without scanning all keys on all shards every time we need to read.

但是,这种分配在读取或查询数据时会出现问题。 就数据读取者而言,数据现在随机分布在分片之间,因此无法得知数据位于哪个分片上。 更糟糕的是,数据的密钥现在包含一个随机元素,例如batch_id或process_id,因此,每次我们需要读取数据时,都无法在不扫描所有分片上的所有密钥的情况下找到该数据。

There are a few solutions to this conundrum. We can simply implement fast data scanning on our readers, and actually go through all the data on every read. As mentioned before, the amount of data ends up being small compared to the number of writes, since the writes are being aggregated and the data is expiring over time. We can also create our own secondary index, which we can also store in the cache, so that we know which random ids to look for when reading data for a particular primary key.

有一些解决此难题的方法。 我们可以简单地在读取器上执行快速数据扫描,并在每次读取时实际遍历所有数据。 如前所述,与写入次数相比,数据量最终很少,因为写入是聚合的,并且数据会随着时间过期。 我们还可以创建自己的二级索引,该二级索引也可以存储在缓存中,以便我们知道在读取特定主键的数据时要查找哪些随机ID。

Image for post
Figure 3: Namespace keys and random record keys
图3:命名空间键和随机记录键

In our case, though, we hit upon a much simpler solution, illustrated in figure 3. We split the key into two parts: a namespace key and a record key. The namespace key would be the same on all shards, and would contain the primary key of the data. The record key would be different for every data point and would contain the random element. Redis allowed us to implement such an approach using its hashed sets. The key of the hash set would be the namespace key, and the element key would be the record key, with the data being stored as the element’s value. We could use the HSET and HINCRBY operations to make both writing to and reading from the shards extremely fast.

但是,在我们的例子中,我们遇到了一个简单得多的解决方案,如图3所示。我们将键分为两部分:名称空间键和记录​​键。 名称空间键在所有分片上都相同,并且包含数据的主键。 每个数据点的记录键都不同,并且包含随机元素。 Redis允许我们使用其哈希集来实现这种方法。 哈希集的键将是名称空间键,元素键将是记录键,数据将作为元素的值存储。 我们可以使用HSETHINCRBY操作,使两者写入和碎片阅读速度极快。

消耗数据 (Consuming data)

Given our new namespace and record key schema, reading the data from the cache became very simple. We needed to determine the namespace key for a given primary key, then proceed to query all the shards for all contents of those namespace keys using HGETALL. Once in memory, we could pull out and use the record keys, which contained information about the kind of data stored in the key, as well as the unpredictable random element. A namespace key would look something like “granular_timings:entity:primary_key_id” and the record keys would look something like “timing_step_1:random_id”. After removing the random id at the end, we can use the remainder of the record key to determine that the value of that key is the timing for step 1 of our timed process.

给定我们新的名称空间和记录键架构,从缓存中读取数据变得非常简单。 我们需要确定给定主键的名称空间键,然后继续使用HGETALL在所有分片中查询这些名称空间键的所有内容。 一旦进入内存,我们就可以拔出并使用记录密钥,其中包含有关存储在密钥中的数据类型的信息,以及不可预测的随机元素。 名称空间键看起来像“ granular_timings:entity:primary_key_id”,而记录键看起来像“ timing_step_1:random_id”。 最后删除随机ID后,我们可以使用记录键的其余部分来确定该键的值是计时过程第1步的时间。

A cron pulling this kind of data can very quickly get granular and real-time timings for a particular entity, which is the main goal of the project. We were able to get the timings we needed, and set up our alerting infrastructure to use this new information for real-time alerts. However, we soon ran into issues with both our usage of threads and the extremely high write load on our redis sentinel shards.

定时获取此类数据可以非常快速地获取特定实体的详细时间和实时时间,这是项目的主要目标。 我们能够获得所需的时间,并建立了警报基础架构,以使用此新信息进行实时警报。 但是,我们很快就遇到了线程使用问题以及redis哨兵分片上的极高写入负载的问题。

优化 (Optimization)

When the worker threads wrote data to the redis clusters, they would do so in high volume, but mostly incrementing the same keys. Because of the pareto distributed primary keys mentioned earlier, certain keys had a very large number of increments to the same key. Threads would also write to the same key over and over simply because of the flush mechanism that is needed to keep the data real-time.

当工作线程将数据写入redis集群时,它们会以高容量进行写入,但是大多数情况下会增加相同的密钥。 由于前面提到的pareto分布式主键,某些键与同一键的增量很大。 仅仅由于保持数据实时所需的刷新机制,线程也将一遍又一遍地写入同一密钥。

All these writes resulted in two problems in production. The first problem was that threads were causing segfaults because they were using the redis driver to perform operations at the same time as the core application was using the redis driver for other things. The second problem was that even with normal distribution, the load on the sharded clusters was such that they were overheating during the peak periods of our application’s usage.

所有这些写操作在生产中导致了两个问题。 第一个问题是线程导致段错误,因为线程正在使用redis驱动程序在核心应用程序将redis驱动程序用于其他事情的同时执行操作。 第二个问题是,即使使用正态分布,分片群集上的负载也是如此,以至于在我们的应用程序使用高峰期它们过热。

To solve both problems, we had to optimize the threads so that they would write less, and so that they would not use the redis driver directly. We wrote an accumulator that would run as a global dictionary that would accumulate writes in a given worker process. This allowed us to consolidate our writes not only within a thread but also between different invocations of the context manager and therefore different threads. The threads would increment inside the accumulator’s dictionary, and the accumulator itself would flush to redis. This handily solved the case where some workloads finished so quickly compared to others that their thread barely had any time to record timings before it needed to flush to redis, resulting in a huge number of writes for that process. Each step would be timed individually, and each thread would individually write to redis. With the accumulator, the flush would only happen at the end of the process, or when the flush interval arrived, whichever came first. This, along with the accumulation of the increments themselves, already greatly reduced our write load.

为了解决这两个问题,我们必须优化线程,以便减少写操作,从而避免直接使用redis驱动程序。 我们编写了一个累加器,该累加器将作为全局字典运行,该字典将在给定的工作进程中累加写入。 这使我们不仅可以在线程内而且在上下文管理器的不同调用之间以及因此在不同线程之间合并我们的写入。 线程将在累加器的字典内递增,并且累加器本身将刷新为redis。 这轻松地解决了一些工作负载与其他工作负载相比完成得如此之快的情况,以至于他们的线程在刷新为redis之前几乎没有时间记录时序,从而导致该过程的大量写入。 每个步骤将单独计时,每个线程将单独写入redis。 使用累加器时,仅在处理结束时或在刷新间隔到达时才进行刷新,以先到者为准。 这以及增量本身的累加,已经大大减少了我们的写负载。

Image for post
Figure 4: Optimized system
图4:优化的系统

The next step was to have the accumulator flush to the file system on the worker and have a local cron pick up the resulting files, accumulate them once more, and then insert them all into redis in a single pipeline. The completed system is shown in figure 4. This solved the segfault issue because redis was no longer being used directly for timing inside of the worker process or timer thread. It also drastically limited the overhead of redis timing operations, because now the cron only needed to establish a single connection and use a single pipeline to write all the timings from all the timer threads invoked on all worker processes on that box between the write intervals. This reduced the load on our redis timer clusters by two orders of magnitude, making the timing system much faster and more scalable as a whole. Our dream of granular, real-time timers that are easy to use for alerting purposes was finally realized.

下一步是使累加器刷新到工作服务器上的文件系统,并让本地cron拾取生成的文件,再次累加它们,然后将它们全部插入到单个管道中的redis中。 完整的系统如图4所示。由于redis不再直接用于工作进程或计时器线程内部的计时,因此解决了segfault问题。 这也极大地限制了redis计时操作的开销,因为现在cron只需要建立一个连接并使用一个管道就可以在写入间隔之间在该框中的所有工作进程上调用来自所有计时器线程的所有计时来写入所有计时。 这将我们的Redis计时器集群上的负载减少了两个数量级,从而使计时系统整体上更快,更可扩展。 最终实现了我们梦想中的易于使用的细粒度实时计时器的梦想。

结论 (Conclusion)

This timing method is best suited to conditions where we need to know the precise timings for a particular primary key with respect to a specific process, and we need to know those timings before the process finishes executing. This usually applies to alerting around slow SQL queries in our case, but is generally applicable to any tech stack with threading and any process that meets these criteria.

此计时方法最适合以下条件:我们需要知道特定主键相对于特定进程的精确计时,并且我们需要在进程完成执行之前知道那些计时。 在我们的情况下,这通常适用于针对慢速SQL查询发出警报,但通常适用于具有线程的任何技术堆栈以及满足这些条件的任何进程。

It is very important to note that this kind of data can grow very quickly because it cannot be compacted like regular time-series data, due to the fact that each new primary key results in a new set of data. For this reason, it is best to use this in conjunction with a statistical monitoring system that has historic data, while expiring this kind of data on an aggressive schedule. We must use the right tool for the right job. It may be possible to aggregate this data into historical data for storage, but it would add an additional layer of complexity. We will need additional business logic for determining what statistic keys the primary key data will be rolled into, and some process needs to do this asynchronously. We will also then need to set up our application to query for both types of data, because it would not be easy to determine when a set of specific data has been turned into a set of statistical data. For this reason it is recommended to treat this data as a temporary but highly relevant cache for a real-time alerting process.

非常重要的一点是,由于每个新主键都会产生一组新数据,因此这类数据无法像常规时间序列数据那样压缩,因此可以快速增长。 因此,最好将其与具有历史数据的统计监视系统结合使用,同时以激进的时间表使此类数据过期。 我们必须使用正确的工具来完成正确的工作。 可以将这些数据聚合到历史数据中进行存储,但是会增加一层复杂性。 我们将需要其他业务逻辑来确定将主键数据放入哪个统计键中,并且某些流程需要异步执行此操作。 然后,我们还需要设置应用程序以查询两种类型的数据,因为要确定何时将一组特定数据转换为一组统计数据并不容易。 因此,建议将这些数据视为实时警报过程的临时但高度相关的缓存。

These patterns were implemented on a python stack with redis sentinel clusters, but they are highly applicable to any stack that needs real-time process monitoring. We have tried to show that all that is needed for a reliable and scalable real-time granular monitoring system is a threading implementation, a centralized cache, and an asynchronous monitoring process that reads and makes use of this data.

这些模式是在带有redis前哨群集的python堆栈上实现的,但是它们非常适用于需要实时过程监控的任何堆栈。 我们试图证明,可靠和可扩展的实时粒度监视系统所需的全部是线程实现,集中式缓存以及读取和使用此数据的异步监视过程。

翻译自: https://klaviyo.tech/fast-real-time-granular-monitoring-with-the-tools-you-already-have-be33661008b3

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值