How not to practice observability

最新推荐文章于 2024-07-20 19:31:22 发布

李白的朋友王维

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量69

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134867522

版权

Hi folks, hope you're all having a great day. I'm Anand and I'm part of the leadership team in ManageEngine, which is one of the three major divisions in Zoho Corporation. ManageEngine is a suite of more than 120 award winning products focused on enterprise IT management. We have developed intimate expertise in this domain over the last 25 years.

I'm here to share some details on how not to do observability, based on our own experience across multiple products that we have built inside Zoho Corporation. Sometimes we always feel that it is better to know the anti-patterns to identify the gaps that could be of varying natures and most of the times these gaps eventually turn into people issues, if left unattended for longer durations of time. We will get into these details gradually. But first off, are we clear on what is observability first?

The first and most fundamental concept that you've got to understand is monitoring is not observability. When it comes to monitoring, it's more of a reactive functionality that relies a lot on alerts and events. And most of the time we generate these alerts and events based on threshold configurations that we do based on our manual assumptions. And this results in poor outcomes in most situations. Observability kind of adds a layer of proactive mechanism or dynamism to this monitoring by relying a lot on the historical data. It also provides a platform of reasoning for the developers in order to make any sort of a decision that is critical to bring about a change in the existing ecosystem.

That's all kind of gibberish, right? So it's kind of pretty complicated to hear about these explanations. So let's consider a simple scenario to understand observability better. We can consider observability more like the brain in the human body. The brain kind of takes care of the effective functioning of the heart by actually reacting to appropriate situations based on past experiences. In a similar manner, the observability solutions kind of learn from your entire ecosystem and provides you with the right set of options when you are actually faced with a situation that could be a lot more similar to something that happened in the past.

With this idea of observability, we will directly deep dive into the various core tenets that one should be aware of, in order to understand when they are actually going to bring in observability solutions into their ecosystem - what are the challenges that they may face. I will just break down these details from various angles like usability, reliability, cost efficiency, and of course the human quotient attached to all these things.

There is a widespread assumption that observability tends to become a lot better when you actually add more information to it. This is more of a theoretical thought process, but when you think about it from practical experience, it is not always true. The quality of the observability tends to become a lot better based on sampling the right data at right intervals of time.

I will just give you a sample scenario to understand this in a real world use case. It's more like how much water you carry when you are going for a hike. If you carry too much water, you tend to get bogged down by the heaviness of the water and you may not actually travel too far and end up losing out on the hike itself. When you consider carrying too little fluid, you may run out of fluids at the end of the day and get bogged down again. So you exactly require the right amount of water to hit the peak at the optimal rate.

If you map a similar scenario to a case of classic CPU utilization problem, let's consider a high traffic application scenario. In a high traffic application scenario, we tend to sample the CPU utilization at 5 minute intervals. In most cases, what happens with this problem is - you have 4 minutes of 25% CPU utilization and a spike happening at every 5th minute. The overall CPU utilization for 5 minutes is going to be 40% at the end of the day. And if this is repeating for the entire day, at the end of the day you will actually have a resultant status that there are more than 200 spikes that you actually missed out on. And this is a big problem when you consider the application slowness your customers tend to actually experience when there is an occasional spike happening every 5th minute.

When you consider sampling it too hard, say you sample this data every 10 seconds - what happens is you have a lot of CPU spikes and you tend to over provision the systems again. So we have seen customers on both ends of the spectrum, but the ones who actually sample it at 10 seconds, tend to store a lot of data and you actually tend to over provision for storing and providing an observability solution at the end of the day. And even if you are planning to store the data into a SaaS based solution, you tend to actually overpay for the SaaS based solution as well.

Now that we have got an understanding of how the problem is actually clustered and we have sorted it out, we move on to the real world - we will actually think about creating dashboards and visualizing these metrics, right? Let's first of all admit that most of us have to understand how to go ahead and create a dashboard in the real world. This is an acute skill gap that we have observed across multiple customers as well.

We have seen customers who have the feel that if there is no dashboard to answer a question, there is no answer to the question itself. This is not an easy problem to solve because this has a lot of human quotient attached to it as well. So let's get this right - first off, dashboards are meant to solve problems that are actually frequently referred to. You are not supposed to create a dashboard for something that is not going to happen regularly and frequently referred to.

Why am I stating this? When you're actually creating a dashboard, it is a piece of technical debt that you are planning to carry forward to your next set of incidents of a similar nature. Why do I say this? When you are checking in a lot of metrics that could actually result in a problem - say you are having a network issue on a regular basis and you're putting the RX or TX ratio into almost 15 dashboards whenever you see a particular issue. What you do is whenever you're actually facing an issue, you will end up trading around 15 to 20 dashboards each and every time. And at the end of the day, you will end up creating another dashboard.

So you should always have a good thought process to make sure you measure the usage of your dashboards and understand how frequently it is referred to and how many people are looking into it before going ahead with the iteration or creation of a new dashboard.

Now, when you understand this problem more and more with respect to creation of dashboards, there is one human quotient attached to it that is called assumptions. And one such assumption that kills the overall observability story is to assume that everything is fine in parts.

As you can see in the depiction here, a lot of our customers reach out to us and tell that they are fine with end-to-end observability. Like I'm just picking up one part of a synthetic transaction to give you a sample scenario - where you may have an Amazon cart process, you're just picking up the items and putting it into the cart and then you go with the checkout process and get your order ID. You capture the entire sequence of transactions and put it in a script and you try to actually replay it again and again using a bot.

Now you replay these transactions every 5 minutes or 10 minutes or 15 minutes, right? So what happens here is you get a success rate for every 5 or 10 or 15 minutes. What happens if your customer is actually having a failure within these 5 minutes, within these 10 minutes or within these 15 minutes? You may also fail to capture a scenario where there could be a performance issue happening for customers in between these 5 to 10 minutes.

So it is always important to understand that you cover your entire application layer by layer, metric by metric, completely captured and linked together. Why do I say this? When you are faced by an incident, if you are actually putting in a lot of effort to link all the data and create an RCA, the meantime to resolution for solving the problem tends to get a lot bigger.

So what we would suggest is go ahead and try to link all the data as much as possible with a single solution. Or else if you are using multiple tools, make sure that you're linking all the data together so that you don't have to spend a lot of time during an incident trying to correlate data from different sources.

Now you have actually fixed the sampling rate and you know how to create your dashboards. What is the next thing that you will think about? There are operational aspects attached to it. Just think about a simple configuration problem - whatever configuration you are planning to do to maybe get an alert or incident triggered - if it is not gauged in the same manner by your engineer, you are actually training him not to react to an incident in future scenarios where he will tend to ignore it when a similar scenario reoccurs.

So it is more like training the engineer to actually ignore or consider it to be a false positive in future scenarios. And one more thing you've got to actually concentrate on is false positives kind of drain down your sanity and the overall ecosystem. And you've got to actually concentrate on one particular metric called noise to signal ratio in order to make sure that you're always within the tolerable limits and all your customers, as well as the engineers who are handling the incidents, are kind of in harmony.

And one more thing that you also have to keep a tab on is whenever you're actually creating configurations to create alerts or incidents, you tend to actually pay for the SMS, voice alerts as well as the ticketing management systems, which doesn't come for free as well. So make sure that you create alerting configurations as optimally as possible.

And if you consider this particular problem of misconfiguration to create too many alerts, you can actually dig deep into it like we have done inside our organization. And we have actually reached a stage where we understood that it all happened too much whenever there was too much centralization of configuration for any sort of ecosystem.

Observability is done best when it is actually available for everyone to look into and they are getting a lot of insights out of it. In most scenarios, the best practices that are to be followed for the observability of one application may not be applicable for another application.

So one application could be having higher IOPS, one application could be having higher memory usage. And in case of higher memory usage itself, that could be an application that is using different technologies - it can work in a single instance manner, it can be clustered, it can be container based. So each and every application is different.

So it is best left to the dependent teams to actually make the call so that they don't result in some sort of bloating of infrastructure that is created by the "dev guru" syndrome where they kind of push best practices across the organization into the overall scheme of things.

So the objective of the gurus should be more focused towards creating the platform that is used by the entire organization to contribute towards observability. Or else a major headache that I would suggest for you guys to actually look into is when you have a series of folks who are kind of called "dev gurus" - hoarding of data starts to happen. Like they will actually make the data inaccessible for you at the time of handling incidents.

So there is one major misconception with respect to observability that is to bring in a lot of access restrictions. So in the era of containerization or maybe microservices, you should always make sure that cross team observability is always a possibility. So that to get into the details and solve problems, you don't have to spend a lot of time on processes that are going to delay solving the problems.

There is another scenario as well where the data could be stored in multiple tools across the organization and you cannot get access to all the tools' data. This is where the recent trend of platform engineering is evolving as a hot market right now where they are kind of taking care of this data unification problem.

When the platform engineering teams are pretty busy handling this data unification problem, one big thing that they have to also think about is the single point of failure for capturing the data that is required to do observability. When you consider this, the failure could be with respect to the agent that is capturing the data, it could also be due to network failure that you expect to get when there is a firewall configuration or something. You should have effective failure mechanisms as well as high availability mechanisms to make sure and ensure that your data that is required to do observability is always available.

And there are other scenarios where the observability systems itself can result in a crash of your system that is required to do observability on. So this is something that should be worked out - maybe circuit breakers should be there in place to make sure that you have effective chaos engineering practices for observability solutions as well.

When you consider this problem in depth, there are places where we tend to over-delegate responsibilities. Like if you are moving across shifts, you may feel that people moving from shift 1 to shift 2 have to do a knowledge transfer and they will actually own a pile of data. And what happens here is they will start reinventing the wheel in order to actually make things a lot better.

So what we have seen as a pattern with respect to our platform teams as well is to try out new tools as they are pretty cool. So one mantra that they should always think about when they are trying out new tools is - tools don't solve problems, but people do. Why I give you this thought process is whenever you are adopting a new tool, you can never get the full benefits out of it without changing how we are going to make things work internally.

Why I spoke about the internal factor is when you consider adopting a new tool into your organization, you have to actually have answers to a lot of questions like who is going to manage them, who is going to take care of capturing the metrics, who is going to be actually looking into it on a regular basis - and a lot of other metrics you should be able to get the details on so that you can make a decision on change of a particular tool.

Now these all kind of sound a lot more complicated, I agree. But that doesn't mean that you should never look for a change in any tool because whatever be the scenario, we won't suggest you to stick on to a tool where you kind of overpay and underuse in general.

This is where at ManageEngine we have pioneered in the last 25 years and we have actually built the right set of tools at the right price, all the time.

Hope you have got some valuable information today. Thanks for coming over and if you are eager to understand how we are doing observability across multiple products - to find out about products at Zoho Corporation and how we are solving observability problems for more than 280,000 customers across the globe, please visit us at Booth 406 and get some inside information on it. Have a great time reinventing your ideologies. Thank you for coming!

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
How not to practice observability

Hi folks, hope you're all having a great day. I'm Anand and I'm part of the leadership team in ManageEngine, which is one of the three major divisions in Zoho Corporation. ManageEngine is a suite of more than 120 award winning products focused on enterpris
复制链接

扫一扫

How not to practice observability

“相关推荐”对你有帮助么？