Observability best practices at Amazon

最新推荐文章于 2025-08-06 15:09:25 发布

原创最新推荐文章于 2025-08-06 15:09:25 发布 · 941 阅读

11 ·

CC 4.0 BY-SA版权

文章标签：

#aws

本文围绕亚马逊的可观测性展开，介绍了其涵盖的监控范围，包括理解客户体验、故障排查等。还提及利用可观测数据进行事后审查、运营审查，以及复合警报功能的应用。强调从多方面测量以全面了解客户体验，持续改进运营。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Hello, everybody. Thank you for coming today to observe this talk on observability. It's one of my favorite topics and I assume it's also Ian's favorite topic. I'm David. And we are here to share some stories about Amazon's never ending journey for the perfect granularity of observable.

This session covers the full range of monitoring and observable at Amazon from really understanding our customer experience and using metrics to observe and empathize with what our customers are experiencing on the systems that we run. We'll also get into everything that needs troubleshooting. So we're going to go through some stories about how we troubleshoot and find the needle in the haystack when we're using observable tools to do that. And then we'll talk about how we measure from everywhere in order to understand exactly what our customers are seeing and always questioning to make sure we have the right perspective in our operations.

At Amazon, we are just obsessed with observability. So this will hopefully give you an understanding as to why that is.

In this chapter of our session, I'm going to cover some grounding in how we think about observability at Amazon, some of the places we use our observable systems and how that comes together to drive a cultural flywheel of improved experience of improved operations, which in turn leads to a better customer experience.

I like to think of observability less as a way to view the running system, but more as a way to view the customer experience. We observe things like CPU and memory. But when we think about characteristics of the system such as availability and latency, that's all about measuring and empathizing what our customers are seeing in their services.

In Amazon, my role for the past 10 years has been to build better and better observable systems for my customers. So during this talk, I'm going to bring in some examples of the systems I've worked on, some of the problems I've solved and how we brought some of those solutions into CloudWatch so you can apply them too.

As builders in Amazon, we own and operate all the services we build in a dev ops model. So in my specific role, I'm on call for the CloudWatch team. As such, I carry a pager or a pager app as it is now. But I must say when I joined Amazon, that was a new role of ownership to me. And it took a lot to learn, but having seen its benefits firsthand for the last decade, I'm convinced being closer to the operational culture of our services allows us to build better services from day one.

Just out of curiosity who here is on call for their services? Cool. So quite a few people, but in terms of using our observability, one of the most obvious places to use it is when alarms start incidents occur and alarms start to fire. We make sure that our alarms are actionable, not too noisy but engage our teams before there's customer impact.

During the operational event, our teams use their observable systems in the form of metrics, logs and traces to troubleshoot and pinpoint the specific issue that needs attention. When they've found that specific issue, they use a range of tools to troubleshoot and return the system to good health. But that's what happens in the middle of an event.

After an event, we use our observability data to ask ourselves questions and examine if we could have done better. This cultural aspect of continuing looking to improve our operations and process is what we call a correction of error process or COE or maybe more commonly known as a retrospective or post-mortem.

In these reviews, we dive very deep into an operational event and ask ourselves 5 whys to get to the various root causes and look for opportunities to improve. If you're interested, this 5 whys technique was first used in the Toyota car manufacturing by Toyota. But it's been since incorporated into multiple methodologies such as Kaizen and lean manufacturing. So if you're interested in having a structure around root cause analysis, it's well worth looking at.

But in order to answer these 5 whys, we're using metrics and log analysis to gather evidence of the systems we're analyzing as part of this process. We're asking questions of our observable data. We make new learnings and we often find that we need to further refine our telemetry data to gather new data that lets us mitigate the issue quicker next time.

Another place we heavily use our observable data is in operational reviews. These meetings happen at various levels across the organization. They start in individual teams, we call these two pizza teams, and they work upwards towards organizational level reviews and they culminate in an AWS-wide operations review every Wednesday. This meeting is attended by thousands of engineers and our senior leadership so that together we remain close to the operational culture and experience that our services are offering.

But what's common in all of these meetings is that we review and inspect the dashboards that best represent the customer experience of the service being reviewed. We ask lots of questions to see if we can further improve our understanding and refine the customer experience. We also review the COEs I previously mentioned and we share best practices that are occurring across the company and celebrate operational wins that have occurred.

Now, Amazon gathers huge amounts of observable data. So I felt sharing some numbers about the amount of data we gather for observable systems will help give some level of the importance we place on observability at Amazon.

As some of you may know, CloudWatch powers the observability at Amazon. And as of this month, CloudWatch is ingesting over 9 quadrillion metric observations and is ingesting over 5 exabytes of log data per month.

Now, if you're like me, I know comprehending such enormous numbers is difficult. But a quadrillion has 15 zeros after it. Another way to think about it is this month, the world's population passed a significant milestone - we have over 8 billion people on Earth. So if you were to evenly allocate metric observations across every person on the planet, CloudWatch would have over a million observations per person on the planet per month. So it's pretty mind blowing scale.

However, collecting this volume of data takes the guesswork out of deciding if our customers are getting a great experience. And in my view, it speaks to and empowers many of our core leadership principles at Amazon, such as a desire to be customer obsessed, to be right a lot, to dive deep, insisting on the highest standards, etc.

What I've highlighted in the past few minutes can be represented as a cultural flywheel of improved operations. This flywheel is being turned by our observable systems which in turn requires us to have instrumented our services. Instrumenting our services generates telemetry data in the form of metrics, logs and traces which is often consumed in the form of alarms and dashboards. This then allows us to ask better questions of our system.

But a key point, and we'll see this in the talk quite a few times, is it's really important to keep the act of adding new instrumentation simple because we need to do it all the time. Not just once. As we operate our systems, we gain experience in them and more often than not, we learn what matters to customers. And we use that observability to alarm, to triangulate, to mitigate the impact and ask further questions.

But more often than not, we generate new learning. So we have to enhance our observable data to get the key metrics we need so we can mitigate the issue and speed up the learning cycle so we can resolve issues quicker next time.

However, given our culture of operations and our systems of observability are so interlinked, it can be challenging to say which came first. However, in my view, I'd say it was actually a culture of operational understanding and improving the customer experience came first and that led us to build the observable systems that we use today, which we also offer in CloudWatch so many of you can use in your own services.

Next, we're going to go into the kind of top to bottom of how we use observability tools, following along just an example of how we find the root cause of something, whether it's an operational incident or just a question that we happen to have.

Due to this, we'll follow along this journey, following along with the needle in the haystack metaphor. We'll use this kind of landscape to paint this picture here.

I wanted to just show the needle in the haystack story essentially. There are many entry points into the observable journey. Maybe I just have a question, maybe we're reviewing something in our operations meeting and wondering, well, why is the graph showing that? But one thing that also happens in the observable journey is the alarm goes off and that alerts us to the existence of a needle we didn't otherwise have just sitting around having some idea.

So we now know that there's a needle because of alarms. But in a distributed system, the reality is we don't know even what haystack to look in. Yet there are a whole, it's a complex system, there are always complex systems or at least systems with many components. So we don't actually know where to go.

So what do we use for that? For observability? Well, we might look at a dashboard. We at Amazon, we kind of curate hand curate dashboards with just key metrics and indicators and some kind of sense of where we should be looking. And so from there, we might say, okay, this is the metric on the dashboard that shows me which haystack or which component.

But maybe we look there and we realized, oh, that wasn't the right place. That isn't the right haystack. So we have to keep kind of looking and use other tools. So maybe we, we use a service map, a service map is kind of built from trace data and it can help summarize to show what part of a distributed system is looking good from its metrics and which one is in alarm or something.

So maybe this, we look at the service map and that shows us, aha, now we know which haystack to be looking in and then from there, we'll pull up another dashboard and to say, okay, now, now I'm going to pull up the dashboard for that microservice or whatever size of application and use the metrics and that to break it down to find which part of the haystack to look in next.

And then finally, we'll do some log analysis to really sift through everything and get to okay. Here is the actual exception or cause or log statement that shows exactly what went wrong and that's the needle. So let's talk about how we get to that.

There are a whole bunch of concrete steps that I want to go into a whole bunch more detail because there are a lot of tools and techniques. It's not just as easy as zooming into a haystack necessarily.

In this section, we're going to briefly discuss tracing or distributed tracing. So continuing with our haystack idiom, I'm going to provide some tips on how we use tracing in Amazon and dashboards to find the right haystack. Sometimes we find the right haystack via tracing or directly via dashboard or sometimes it's a combination of both.

This diagram provides a view of a typical distributed system many of you may encounter in your day to day work. It contains a load balancer in front, behind that there's a front end fleet, it's using some Lambda, there's queuing and there's a polling fleet. However, even in this relatively simple service, there's multiple places where performance issues can occur and finding the needle isn't trivial. This is where tracing can be of benefit.

Distributed tracing allows for deep insight into how requests are traversing through our distributed system. We enable tracing by attaching an identifier we call a trace ID onto each request. As the request traverses through the system, we pass the trace ID between each component very much like a baton in a relay race.

In this diagram, we can see how the ID is passing across the system as the trace ID is passed around. We capture timing information and metrics on the performance of the system. This allows us to spot early warning signs if performance bottlenecks may be occurring.

But because the tracing is collecting timings and metrics around the interactions across the system, distributed tracing allows for some powerful summarizations of the system. Using tracing, we can dive deep into a specific trace and understand how it, how that request behaved as it traversed the system. But you can also gather all of the traces into a singular view in X-Ray. We call that our service map which lets you easily troubleshoot the full distributed system.

Tracing lets you find which haystack you're having trouble with. And then it lets you jump into the dashboard for that specific haystack or service, which is a good segue into talking about dashboards.

Now, in Amazon, we love our dashboards. Every team I know would literally have dozens of dashboards to best represent the various aspects of their services. So in this session, I'm going to discuss how we think about dashboarding in Amazon overall.

For even the most simple of services, we use many unique dashboards to provide a holistic view of the performance and health of our services. This view allows us to understand how our systems are operating and behaving from different perspectives and over different time intervals.

We work backwards from the customer to create a hierarchy of dashboards starting at the top level customer experience with each subsequent level of the dashboards delving into a lower aspect of the service.

During operational events, people frequently assume different roles in the process of resolving an issue. Business leaders are busy communicating and informing our customers of the ongoing impact, while our service teams and engineering teams are busy working to mitigate the underlying issue and restore service to good health as quickly as possible.

For both teams having a top level view of the customer experience is a great starting point. It allows business leaders to effectively communicate with our customers and the tech teams use it to understand if the actions they're taking are improving the customer experience.

A good customer experience dashboard focuses on the key customer experiences the service offers, but also makes it very clear in terms of the breadth and depth of the customer impact.

The entry point to our web based services are typically through UI or API endpoints. So having a dedicated system level dashboard must contain enough data for our operators to see how the system is operating and its customer facing endpoints are behaving.

We also build microservice dashboards to facilitate fast and comprehensive evaluation of the customer experience within a single service instance or partition. This narrow view ensures that our operators are not distracted by irrelevant information during an operational event. It may also have a view into other microservices that are dependent for that specific microservice.

We also build dashboards that enable our teams to view the amount of resources our services are using. This allows them to do longer term capacity planning and forecasting of their services, for example, to ensure they have enough compute and storage.

As you can see, there are lots and lots of dashboards. In fact, there are a lot more than I'm showing here. But the key thing is that this makes it so that during an operational event, which always happens at 3 in the morning, you quickly have the data you need at hand without having to go searching for the right metric during an operational event, which can be quite stressful.

At Amazon, we've been building dashboards for decades. So during this time, we've learned what works best for our teams. I'm going to briefly highlight some of the key points here:

We have found that using time series graphs work best for our teams. There are times when displaying a single number can be useful, but the value of a time series graph is that it allows our teams to easily visualize when the performance of a system has changed, and also by how much.
We ensure we place the most critical data at the top of our dashboards - in this case availability, latency, request rate. Less important metrics can be placed lower down on the dashboard. Again, during an operational event that can happen at any time during the day, we want to make it easy for our teams to see all the critical data.
We ensure our dashboards always have the critical data visible and they're not too wide, so during an event at 3am, we don't want to have to do horizontal scrolling with a scroll bar.
Time zone mapping is difficult. We're a multinational company, we operate in different regions. So again, time zone mapping is difficult. We want to avoid that. So having consistent time zones and displaying the time zones on our dashboards is really useful. It makes it easier to correlate across teams and across events. In this example, we're using the UTC time zone.
We also annotate our graphs with SLAs and SLOs so our teams can understand again what good looks like for our specific metrics.
Similarly, you can't assume you're the only person using your dashboards at any time. So make it easy for your teams by adding descriptive text to your dashboards to make it easy to understand.
We want to make sure that we don't overload the y-axis by having multiple time series at very different scales, as that makes it very difficult to compare two similar metrics.

So a typical example where this could be is if you were plotting the minimum and the maximum of a different graph, but they are of a different metric, but they are orders of magnitude apart. It's not really very useful. So again, keep the time, the y axis, the range narrow.

Again, again, going back to keeping it simple for your teams during you know operational events and at different times, you want to avoid clutter on your dashboard. So again, you know, keep the, keep your dashboard simple. If you know, break to break them into multiple dashboards, don't have, don't overload dashboards with having hundreds and hundreds of graphs, it's maybe not useful.

However, one of the key things again is that you don't want to go fishing for metrics during an operational event. So build a detailed set of dashboards in advance before your services go live. It's important that you can get the things that you've learned from previous events onto your dashboards, which is why it's critical to build a culture of continually inspecting and refining your dashboards.

One way to do this is to manage your dashboards programmatically with an infrastructure as code approach in my organizations. And in most teams, we have so many dashboards that we can't review them all in a single meeting. So we've created a wheel of fortune where each service has a specific slot on the wheel. And at the beginning of our operational meeting, we spin that wheel and we randomly select two services in our meeting to audit this random approach works well for us because it ensures that you know, each team needs to be prepared and to speak to their dashboard during their operational reviews, which in turn requires them to prepare and, and audit their services for those who are more interested in this interesting topic on dashboarding.

My colleague John O'Shea has written an excellent builders, amazon builders library article on this topic called Building Dashboards for Operational Visibility.

Alright. So now that this uh now that this uh our dashboards have kind of pointed us to the part of the haystack that we are interested in that we think the problem is, is and we need to use metrics now to drill down. And so now we're gonna talk about how to use metrics and metrics are all about and using them to drill down is all about this concept of dimensionality.

So, but in order to understand how metrics work and how they show up and everything i want to kind of talk about where they come from, which is the instrumentation in our code in our systems. At Amazon, every application or service for every unit of work or like for a service that's a request or web page, that's a request for every unit of work. We collect a bunch of instrumentation and at the end of it, we write it all out to a file.

Um it depend to a file and uh in some kind of structured log file like this. Uh this is the amazon, the cloudwatch embedded metric format. Um there are a lot of different ways to format um uh a different structured log file that you can parse later. This is one way. And so it's structured in a way where it has the instrumentation in one spot.

So these are the facts about what happened during that request, uh what it was about what it was, what we were doing during it. And then we have another part of this, each log record which tells a metric system, how to turn those facts, like which of those facts to turn into a metric and broken down in one way.

Ok. So what do i mean by that? Let's go to the to where the actual instrumentation comes from, which is in some code somewhere. So let's say we have some kind of product catalog service. Product info service for the ecommerce site it serves up has different api s like get product info which returns just information about some product to display on the site.

Well, if this is the code that, you know, i implement, um and then i try to operate this product info service, i might gonna have no idea, like no clue as to what's going on with this service. Why? Well, because i, if i'm trying to troubleshoot, why request failed or what's why something is being slow, i don't have any sense for what that request was even trying to do or who it was for? I won't know.

Ok. Did, did this request make, make it to the cache? Did it get anything useful from the cache or did it have to go back to the database if it did or if either of these things failed? Why like do they time out time are very different than errors when i'm going to try to troubleshoot and debug and mitigate impact? Did the database take a long time to query? Did it like return anything? There are so many questions that this code by itself doesn't help me understand. And so i won't be able to operate this thing effectively.

So what we do at amazon doesn't actually add too much to the code, but it's some really important stuff at amazon. We use a common metrics library all throughout the company. Um and that metrics library we whenever like a framework or something gets a new request, it creates a new object, the new met instance of a metrics object. And we pass that around throughout all the code.

Uh sometimes people use kind of aspect, you know, weaving stuff like that to kind of magically do it. Other, some languages more naturally pass a context object around. I just like to explicitly pass the metrics object around. Um i don't know and i just like to see what's happening, i guess. But anyway, we pass this metrics object around uh to uh the sdks if we use an sdk to make a remote call or some caching library to access a cache. And those libraries also speak in terms of this metrics uh library. And so they will add instrumentation automatically to that request.

So like all, so we just keep getting more and more instrumentation that happens often pretty automatically because our different libraries and sdks are participating with this metrics api but then in my code, i'm also gonna add my own timings and facts. So i'll say kind of what a request was working on. So i can tell later when i'm trying to debug it. And so all of that instrumentation when it comes from all these code, other libraries and dependencies it gets written out.

And so you can see there's a whole bunch of information about a request that's gonna be useful to understand what happened. There's a whole bunch of stuff. There's like who who made this request to us? Like where did they come from? They trace uh there's information about the infrastructure that process the request. Like what, what instance am i, what availability zone did i run in? What cache node am i talking to? And then there are a bunch of timing and other like measurements about how long did things take? How many things did it process and stuff like that? Different categories and kind of if you break it down more generally, there are measurements which are these again, timing the met numbers, numbers and facts about whether things happened. And then there are attributes like strings, so timing which is like a number and then strings which are attributes and going back to that format like we can see then that there is that that kind of time measurement. But then we need to be able to tell the metric system how to turn those into metrics. And that's what this part does. The metric definition. It says i want to take that time measurement and turn that into a metric.

Ok. So what does that look like when we do that? Let's say you're plotting something these concepts apply to any kind of tool you're using to look at metrics. But this is looking at cloud watch. Um we're plotting this overall latency for the product info service and the way that it gets there is we have this, this kind of definition in that each of those, you know emf records that says, hey, this is uh this blob of json is about the product info service. Ok? So now we're, we're narrowing it down. We're not just looking at all latency across the entire universe. It's just for the product info service.

Uh there are no dimensions. We're not breaking it down further than just the overall latency for the product info service. And we're gonna use the time uh time attribute from the request. And uh these, these are the units to display then and how to interpret this. And then we actually have the measurement which contributes to the actual line. Ok? Fantastic. So now we have a metric but we're now we're back in the haystack and we see a spike in that metric. So now the product info service has become slow. Well, this metric, ok? Doesn't help us understand why it just understands that it happened.

So how can we use metrics from here to, to find out why? Well, to think about that, let's go back to this architecture of what, what could even happen in the product info service might have different api s like get product, which is gonna do a simple key value look up in the database like dynamo db. Uh maybe it has an update product api so merchants can update their information about the products that they're listing and maybe there's a search product. So people can on the site, kind of just generally search about things and that might use a search index and not use any of those other dependencies.

Each of these api s might have its own set of dependencies and things that can go wrong. So what we wanted, this is where metric dimensions come in to be useful. So dimensionality lets us break down these measurements like latency into something that's more insightful and finer grain than all up aggregation. Um it can help us kind of find refine and get closer down into the haystack. This is where the dimension concept comes along.

So when we're looking at that kind of instrumentation, one thing we can also do is log with every request an attribute that says this is the api name that i'm working on right now. Again, it's just another fact about the request i'm logging and this is by, this is a search products api invocation. And then in that dimension section, you can say this is where the instrumentation can say, hey, metric system also break that down by operation, whatever you find in this, in this operation field, whatever string i put in there, make each of those its own distinct metrics. And that's what you see here.

So ok, dimensionality helps us get a little bit lower. Uh we can kind of see which api might be affected or something like that. But i wanna, i wanna help use metrics to understand the really the why or even more information. Uh and then that's where we use the same idea of dimensionality. But what, what we call high cardinality metrics, it's a dimension that has just a ton of distinct values.

So dimensionality is, let's you say, show me this metric measurement per dimension. So that's what dimension is all about. Show me the latency per availability zone errors per api that's how metric and dimension kind of apply with each other. So the metric is a measurement, it's a number and the dimension is a string or just a fact about something, it's a string with some of these.

Uh and then how many unique string values per thing? That's what we call the cardinality. So for example, if we're looking at availability zones in a region, there's probably a handful, we can make sense of that. We could plot all the the latency per availability zone on a graph and be able to see like a handful of lines and we can make sense of that and, and, and learn something. But then as we start to get to something that's higher cardinality like maybe per instance per product per customer.

Now there would just be just completely, just absurd number of lines on the graph. If we try to look at them all at once, there probably aren't enough pixels on the screen to render a graph that shows the latency or errors per customer or something, right. So the dimensionality is still useful. But we have to think about uh analyze, we need different tools. And when we start digging into the high cardinality metrics, they're still really useful. Like there's still really important questions where we need high cardinality to be able to answer that question.

So now we're getting really into interesting of, well, what's going on? Why are the, where that needle is really? Like are all of my ec like let's say i have a fleet of ec2 instances, are all of them errors, returning errors or is it really just a subset of them, maybe one of them, something like that? You need to sift through all the per instance metrics, there are different customers having different types of impact.

Ok. Now we need to start looking at each customer, but we can't look at every single one. So we need some way to sift through that. So there are some really interesting questions we need, but we just can't just plot them all and have use our eyeballs to figure it out. So that's where some tools help us make sense of high, high cardinality metrics.

One of those is cloud watch metric insights here. You can, if you think about it, i here, i'm i'm plotting the, you know, showing the, the dimension of per api and i said, oh, per api is a pretty low cardinality dimension. Well, maybe it might be uh if you have though like say a handful of resources on some web service and you're doing crud operations on each of those, you easily have, you know, many tens of api s and now graphing each of those is, is still gonna be a little bit of an eye chart to figure it out.

So we have cloudwatch metric insights which lets you kind of use sequel to describe which metrics you even want to be looking at. And so you can do things like order by and limit. So you can say, show me the of my, look at all of my latency for api or errors per api and show me just the five with the most errors. Ok. So this is really useful, this helps sift a bit of the hay away in the haystack. Like i have a lot of these different dimensions like a card, a high cardinality dimension, a lot of distinct values. This lets me find the ones that i it's not just picking five arbitrarily, it's picking the five that are probably the most interesting to me that help me answer my question.

But let's go to an extreme high cardinality case. I might not actually want a different metric for every one. Like i might actually even care to look at for like errors per product in my catalog. I might not really have a use case for, hey, given this arbitrary product in the catalog, show me like the errors over time. For the last i don't know period of time. That's probably not something i'm gonna do, but it can help me answer a question. Like, do i have any mis?

Uh so it might so for that, we use uh cloud watch contributor insights. This looks at all the distinct values that are coming in from you, but it actually only keeps track of it and only creates a metric then for the top say 100 so this actually just sifts through this very noisy, very big haystack to find the, the few that you might want and it shows you on keeps track of only those. But so that's how uh cloudwatch contributor insights can help us narrow down and find just the one distinct value we might be looking for.

But it also is a useful tool for summarizing metrics in a different way. You know, er let's say we're looking, i was kind of share a story about where we kind of look for to understand the customer experience for client caused errors in a web service. So let's say there are two types of a website or a web service. There are kind of two categories of errors in general. There are things that the server cause like our own code failed or something. It's our fault, like we need to know and drive that down to zero and we can, we have a control to drive server faults in the five hundreds down to zero. But there are all those there's this other category of client faults where like, you know, the customer just called this thing wrong.

Um you know, maybe they passed bad requests or they didn't authorize or authenticate or something like that. So how do we make sense of these? On one hand, we should only really monitor an alarm on the server faults because those are the ones that are actionable. But we actually also really need to care about the client faults too.

Um you know, somebody could come up like show up and just start calling us with bad arguments. Like we don't want our pager to go off. But actually, maybe there are sometimes when it should. And like that's if we actually let's say we were to deploy a bug, like code up a bug in our service where we say took, took an argument that used to be allowed like you used to be able to pass in a like a larger input and we shortened it. So now all of you who are calling our service with those longer inputs uh would be failing because of our change and we wouldn't notice because we weren't alarming on it. Like we can't just ignore this entirely, but we also can't alarm on it because if one of you shows up and just starts calling us with a bunch of bad arguments, we can't just wake up and do nothing.

We would want to know and talk to you and say hey, maybe, maybe we have something that's uh like awkward to use in our api or maybe there's something that we do need to work with you on to help, you know that this is happening. But still, it's a less, it's a different type of, of visibility that we need. So contributor insights is a tool that helps us plot this graph here. This is rendering from cloud watch, uh contributor insights console.

Um and so it shows the errors per client and it's nicely in sips if we wake up and we find a graph that looks like this, we'd say ok, this is not something that i this is probably one customer. It's unlikely we just did a code push. That was this broader issue. But if we see this graph, this is probably our fault, we probably made a mistake here

"Um and so, but we can't, we just create an alarm on this. I can, I can't create an alarm on every one of them. Like how do I actually do the alerting here?

Well, that's where we can combine this idea of cardinality and dimensional dimensionality in, in an interesting way. The graph that we would typically say we should set an alarm on is the percent of requests with errors. If you think about that, that's the number of requests came that, that errored divided by the number of requests.

Wouldn't it be great if we could do the graph on the right that shows the percentage of clients with errors. So the number of clients who have an error divided by the total number of clients. Well, that's what something that we can do. Um thanks to this idea of dimensionality and that you're having the right tool to deal with that dimensionality. To take. We want to be able to take the number of customers who have errors divided by the number of customers and plot that over time.

Well, CloudWatch supports metric math and uh and actually Contributor Insights uh supports not only just keeping track of the top end, but it also keeps track of the very good estimate of the number of total distinct values over time. So you can use this metric math idea along with Contributor Insights. Uh unique contributors to say, give me have two rules. One that tracks the customers who have errors and one that tracks the request per customer overall, divide the two with metric math and you can create an alarm on it and you can, you can get that, that kind of holy grail here. This percent of clients that have errors, super powerful feature.

I know it's kind of talking pretty fast through it, but it's it's kind of this like dimensionality concept um applied in a really useful way. Ok. So now we we we kind of have a sense for maybe a finer understanding of where in the haystack the problem is or even what the problem is, but this isn't always going to get us to the actual understanding of what happened yet.

Sometimes we have to aggregate data in a way that we didn't plan for ahead of time when i worked on the AWS IoT Core team. I remember helping a customer troubleshoot an issue where they had a bunch of devices out in the field, talking to a AWb SI ot and then their service was forwarding their data to an analytics app. And they had this problem where periodically their analytics app was getting overloaded by all of those devices and all the data that they were sending. And it wasn't clear why the the app was getting overloaded because the requests per second graph was clearly over their capacity line. They're like, i don't understand, but actually it turns out what this graph was showing was the number of requests per minute divided by 60 to get to the average request rate. And that's not, that's, that's lossy, right? That's the average request.

So what was happening is, you know, the devices would be idle and then, you know, right at the top of the every five minutes, all the devices would wake up and like send their data and then a second later they would go back to sleep. And so this was we needed to introduce jitter in here. But how did we know this? Well, we looked at because we log every request with all their information. uh about because they, they were logging every single request with the instant attached to it. They could break down the data in a way that they weren't doing with metrics, they could break it down by second. Whereas before they were breaking it down by minute, we use log analysis tools that just sift through like go through all of our logs and do like, you know, distributed querying just through a simple query language to be able to slice and dice the data in a different way than we planned to initially.

Like i said, everything is in the log, we can break things down in distinct ways after the facts, even if we didn't materialize a metric in the metric system ahead of time, we already talked about using the metric definition in emf to say, i always want to know the latency per operation that should just be a metric. But for other things, if you have other questions, like show me the top 20 products in my catalog that are returning errors from this particular ip address, like maybe it might, i might get to that kind of a question in my troubleshooting. Well, i can actually ask that question and get an answer very quickly by going through all of the logs and just, you know, using this query to to bin things in that way. And then i can see metrics about that in CloudWatch Log Insights or like maybe show me the products with the lowest cache hit rate per node. So i can find like why a particular cache node is, is not behaving the way i expected.

So very interesting just questions based on any dimension that i didn't even think to put a ahead of time to materialize as a something i need to look at all the time. You can just keep digging down because everything is instrumented and logged.

Finally, we kind of once we that's log analysis, we're broadly looking across all the logs, but sometimes to find the needle, it's often actually just a log entry. And so we have to be able to sift through sift through all of those logs to find the one that like this is like an archaeologist might be like digging for like trying to extract all the sand so that they can find some kind of like artifact that they're looking for. They're sifting through the logs and eventually we can using log analysis tools, we can find the specific log entries uh that we're that we're actually looking for.

So for example, like what the raw logs are really important, like we can like we might, i worked on dynamo tb and i would work with a lot of viewers like to help troubleshoot things. Some sometimes you might say i'm seeing errors. Ok at this particular time. Well, they need to be able to ask a question like, well, did the request even make it to us and like if we get to the raw logs, maybe we find the stack trace like this that actually no, the, the client errored in dns look up. So it actually didn't even try to make the request. Maybe. So then from there, we'll say, ok, let's troubleshoot like route 53 resolver configuration for your private dns, maybe that's not configured, right? So this has the actual log gets us to what, what actually happened or like, well, if you rest request did get to dynamo db, what did it say back? Maybe it said, oh, this, you already executed that transaction. Well, i didn't think i executed that transaction. Ok. Well, let's go right before that. And we can see that dynamo your call a second ago timed out. So the sdk retried and that's why it looked like a duplicate. So anyway, you can paint the story by actually looking at the raw logs too. They even in this case, the unstructured logs super.

So we're gonna talk about now about using profiling to in particular, using profiling in production. So i know there's times we've all written code, which is, you know, maybe let's say not the most optimal, at least i know i have in my career. Um so, you know, i'm sure, i don't know if you can spot there are some parts of the code here which may not be, might be slow to operate to run. So i've highlighted it here. But the key point is similar to what david was talking about. We don't always add timers around our code. Uh we do a lot of the time but it's not always times where we add timers around our code. So how do we catch and, and, and, and, and observe um issues like that in production? Um so this is where we, this is where we, where profiling is extremely useful.

So, profilers show how long different parts of the code are taking to run, whether it's waiting for something to return or real cpu time, it shows flame graphs like what like what's on the screen here where the bottom of the graph is the is the whole application and up on top, there's the sta static types of inefficiency, the wider the bar, the more cpu or time that the function is taking. And the topmost box shows the function that was on the cpu at the time at a snapshot.

I used to think about profiling as something we did offline or on a desktop, but that misses out on the entire value and the learnings you can get from doing profiling and production. So on amazon, we've built always on profiling that way we can go and look for, look at the profiling when we're trying to optimize our code. But also during operational events, when we want to investigate what's happening, we can use profiling to observe the performance of our code before the event during the event and find out what aspects of our, of our code ran unexpectedly slow.

Um it's a super handy feature. I'm not going to dwell on it too long. It's very easy to install. It's one click if you're using services such as lambda. But i know from my own experience in cloud watch, my teams have delivered some fabulous operational wins, availability and latency improvements just by looking at profiling and being curious and you can also do something very similar in amazon code grew.

So the key takeaways that i'd like to leave you with this chapter on on finding the root cause is that finding the root cause needs a wide range of observable capabilities. There isn't just one technique or tool to solve them all you want to make the most critical data you need easily available and with low latency metrics are a typical way to solve this challenge.

There are times you need to dive deeper and deeper into a problem. In those cases, as the cardinality increases, you don't want hundreds and thousands of lines in your graph. This is where services just as contributor insights, metric insights and log log insights are really beneficial.

If something was slow or hard to find during an operational event, make it easier to get to next time. However, it's all underpinned by having a solid logging strategy because once you have this, you can always go back to the tape or the disk to find the specific data you need.

Ok. Having walked through the details of how we find root causes. In this chapter of our talk, we're going to cover some of the best practices we've employed in amazon to get even more useful observable data. Here, we're going to discuss ways we, we we've optimized our understanding of the customer experience.

So going back to the distributed system i introduced earlier in the previous chapter, as we've already discussed, any good team in amazon would typically have multiple dashboards to understand how the performance of each specific component is operating. However, our customer journeys usually require the use of multiple api s and multiple components in order to complete their use case.

One of the first, one of the first projects i had on amazon was to help improve the overall latency and availability in our cloud watch console. One of the common questions we faced was what exactly were our console customers experience experiencing at any particular time? At that time, the console had a series of tests that validated the api behavior was correct. However, testing the user experience and user uh can be more complex than testing an api s and introduces multiple variables. So it is the customer's browser version os versions, mobile clients client latency.

So it was challenging to understand consistently if our console customers were getting a great experience, we were not satisfied that we were measuring the customer experience as best we could. So we created a testing framework that used a headless browser to continually validate and test our service in production. We found this testing invaluable. It helps uncover bugs and issues that previously we couldn't detect. It had some neat features such as being able to snapshot images of the u i when the test fails. So it made it easier to debug issues.

Having started as an experiment, we found that multiple aws teams were also having the same challenges. So we decided to vend this service across all teams in a aws in order to minimize the work across the org. In fact, the service was so so essential to how we test our user experiences. On amazon, we launched it as an external aws service in 2020 it's called CloudWatch Synthetics. And it offers many of the same capabilities that are previously described. It enables you to monitor your api and user experiences every minute 24 7 using modular canary scripts and you can be alerted when your application does not behave as expected.

So that's great. Now, we have ac a client that's measuring the end to end customer experience from the outside, but we can do even more. So we also found that it's important to get as close to our customers as possible. So we can see things from, from their perspective and in real time.

So going back to our system architecture, we now have robot users continually testing things, but we don't have insight into what our real customers are doing for that. We build CloudWatch, real user monitoring. When when you add the CloudWatch RUM or real user monitoring, web client to the html header of your application, you'll be able to collect performance telemetry including navigation events, javascript and htp errors and client site performance metrics. You'll be able to visualize the data using CloudWatch ROAM and ARRIVE insights to improve your application's client side performance. You can also set alarms for for your key metrics and including the number of errors and page load times and, and and the core web vitals.

And now we have a complete picture of the customer experience, but now that we're measuring from everywhere. In amazon, our builders make extensive use of alarming to detect operational issues. Yeah. So now that we're measuring from everywhere, you can see that we have a lot of potential signals to process. One of the challenges with this is that, that may result in multiple alarms going off for a single incident. This can result in alarm fatigue, which is where an operator is getting paid too often and they often get overwhelmed by the volume of alarms they're processing, which in turn leads to missed alarms or delayed responses.

In an example that will follow, some customers could generate millions of alerts. And in those situations, the engineers and operators can spend more time triaging the alarms and this input inputting alarms as opposed to finding the actual root cause.

Amazon faced these issues over a decade ago. And, and similar to the previous example, we scrappy made a system to address this issue. Similar to our browsing example, it started as an experiment and small. However, my team owned this service today. And a core concept is that we use, we use boolean logic to aggregate our alarms. For example, if one host is impaired, you know, that can probably be resolved using automation. However, if hundreds of hosts are impaired, you want a human to, to, to get engaged and check it out.

Amazonians loved the capabilities of this and its usage grew quickly and it's now the preferred way that amazonians and teams manage their alarms in amazon. It's so foundational to our operations that we launched in 2019 as a stand alone feature in CloudWatch and it's called Composite Alarms.

Multiple customers are using composite alarms today. But one great example of using this at massa scale comes from BT. They're using CloudWatch composite alarms to monitor millions of their smart home router devices across the UK. Using EMF which David introduced earlier uh and composite alarms, they're ingesting per device log data and creating a hierarchy of aggregated metrics, which in turn is represented in a corresponding hierarchy of alarms. This allows them to quickly triangulate per local per geo where, where, where they're having operational events and it has enabled them to react faster and put things right sooner. For those who are interested of reference on the screen, there's an interesting blog you can check out in terms of how BT have done it.

But the theme of all of this is that we need to measure from everywhere. So we can fully measure the customer experience. So we talked throughout this about kind of three different categories of how we use observable and a kind of a bunch of tips to drill down into the haystack to make sure we understand and empathize with what customers are seeing. But sort of the kind of the meta kind of theme across all of these is that there are a lot of tools in all of our observable tool kits making up sort of layers. But what we need to be able to do is use those tools to be able to answer more and more specific questions like these tools, you can kind of stack them, you starting with alarms, it's an important signal but not the most precise in service maps, dashboards, you're getting more and more precision and finding, getting closer to the needle as a with these tools. But sometimes, especially if we're, if we're having to walk through each tool to get to the next or each kind of process to get to the next, the effort can be harder to move up that stack and tool chain, or at least you have to think about more and do more steps and things to be able to sift through and get to the root cause.

So because of that, i mean, we use things like traces to be able to navigate this stack, like pretty, pretty quickly like traces. If you look at a trace, it kind of takes you throughout to even to the raw logs for that event, but even outside of traces, we're always trying to take that information that's harder to get to. But important once we realize that something, oh, this was a really useful fact to be able to look at errors per customer, we make sure to spend the time to take that. That took longer to get to. Maybe i had to do a log que uh log analysis query that time to break down errors per customer. Well, let's make that easier to get to in fact, to get to next time and to create a claw watch contributor insights rule to, to have that breakdown all the time and just put that on our dashboard.

So it's all about pushing down those things that we found were hard to get to, but also valuable for next time into those higher level abstractions. So we have a more summarized and uh and a view of of that so that and then be able to connect the dots and drill down faster. And the way that we do that at amazon is by having like a culture around doing this continuously, you know, adding instrumentation, adding which results in logs and metrics and alarms, helps us ask better questions the next time we have something.

So we're refining, improving our operations and making it easier to answer questions so we can spend our, you know, our important time on asking a better and more interesting question. Next time to keep turning that wheel and making things better for the better customer experience come out of the service.

We get together and look at these, look at dashboards and ask, have another opportunity to ask questions every week when we're asking each other and ourselves, what's real? What does this dashboard really mean? Maybe there's a more interesting question that we could ask and represent on the dashboard. And of course, when something does go wrong, we want to dig in and understand where using we're really navigating into those like top layers of the spending a lot of time in logs and out of that, we have always often have a lot of actions of, ok, let's make that easier to get to and more obvious immediately to an on call next time rather than having to dig in even maybe after the fact in order to figure that out. So that's the journey that we go to con go through continuously in observable.

Um thank you all for uh for spending your time with us today.