Building observability to increase resiliency

李白的朋友王维

已于 2023-12-06 18:47:28 修改

阅读量170

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

于 2023-12-05 15:38:17 首次发布

本文链接：https://blog.csdn.net/just2gooo/article/details/134809103

版权

Hello, everybody. Thank you for coming to what I anticipate to be the most ambitious crossover Re:Invent session of all time. A talk covering both Observability and Resiliency - two of my favorite topics because after all, you can design an architect the most resilient system in the world. But if you don't operate it using resilient practices, your customers won't see your resilient architecture, they'll see something else.

Now, these are topics that I've been really passionate about as I've worked on services at Amazon for the last 17 years - from DynamoDB to Lambda - and these are the two areas that I care about a lot. And so that's actually why for the last year, I've been actually working on observable services like CloudWatch.

Now for this talk, I need everybody to board this ship that we're going to be sailing together throughout the talk. We will be navigating the seas of observable data, navigating around outages and stormy weather through storms and rough seas as we encounter them and fixing problems with the ship itself. Of course, this is all just a metaphor for operating a service that we'll do together. But this is a chance for me to show off my skills as an artist.

Now how many people in here have sailing expertise? Raise your hand if you have. Okay, if you're sitting next to somebody who raised their hand, please ignore them as they roll their eyes and groan at my sailing metaphors that are not quite on point. Because this is the extent of my sailing experience - not extensive by any means - but this is an observable talk, but it's not going to be in terms of the typical observable pillars of logs, metrics and traces. We're going to organize the talk in terms of what can go wrong in a service and then how we would use observable to get out of that situation or avoid it altogether - so diagnosing issues, uncovering things that observable data wasn't showing us, and then preventing future issues.

The first chapter of diagnosing issues is where we'll actually spend the bulk of our time during this session. We'll talk a lot about using dimensionality to slice and dice data just and cut it in the way to get just the right observable kind of data to learn to see exactly what we're looking for by finding the needle in the haystack. We'll look at high cardinality dimensions to make sense of when you have like just hundreds of thousands of time series that we're trying to sift through and we'll use tracing to build service maps and be able to navigate distributed systems to the one piece of the system that is having trouble.

For uncovering hidden issues, we'll learn about how to use things like synthetic workloads and real user monitoring and how to aggregate server side metrics in really interesting ways to be able to kind of look at the observable data in a more customer focused way than we tend to.

And then finally, for preventing hidden future issues, we will talk about how to use auto scaling and just scale and look at every metric that we might need to - that could be a sign of something creeping up and being a problem. And then when we run kind of controlled experiments that test our systems responsiveness to failures, we'll talk about how to use observable data to make sure that we're seeing everything we need to see during those experiments.

So all aboard, let's begin our journey on the seas of observable data as we diagnose issues. There are four different types of failures that we'll talk about here that can things that can go wrong in a system - a bad dependency, a bad component, a bad deployment and a traffic spike.

So we'll begin though, before we get into the failure modes, with a leak aboard our ship. We already have some problems here now but we don't actually know that there's a leak. So this is just happening, it's beneath the surface but we don't know. So let's figure out how we would even know that there's a problem in the first place before we get into specific failure modes...

They're called each, each component sending something is a trace segment. We need to gather all those in one place so that we can map out the whole system. And so when you map out the whole system by aggregating those trace segments into something like X-Ray, well, now you can piece them together and provide this nice visualization that shows the entire architecture.

We didn't have to manually create dashboards or anything. It just shows the system. And if you overlay the telemetry data, the health data, especially in a summarized way, you can get this nice kind of red, green, yellow kind of status of each component. And so now we have created this nice visualization that we had for the ship. We now have that nice visualization for our distributed system as well.

Now, this isn't just like, you know, PowerPoint drawings and stuff like that. This is also real thing. So this is the CloudWatch Service Lens map that you use X-Ray tracing. You can actually see this. This is using the one observable workshop. It's open source, you can deploy it into your own account if you want. I did and I took the screenshot. And sure enough, you can see, ok, this is actually a fairly complex system with a bunch of things talking to each other. But you can nicely see the summarization that there is a little bit of red around the circles on a couple of those components. Those are the ones that we should start with investigating first because there's some sign of a problem there.

We have the automatic mapping with the overlay of the telemetry data. Fantastic. So we already have now we have a few more takeaways on how to to kind of navigate a distributed system and find the bad dependency propagate the incoming trace context to every outbound call that you make to every dependency. So every component in the system needs to pass this baton, sometimes it's easy to do with some kind of frameworks. Sometimes you actually have to instrument your systems to explicitly pass trace IDs around. It's very important to do. So one way or another. Make sure you're doing that second, enable collection of trace segments from all AWS services and from your apps like get all those trace segments together in one place so that X-Ray and Service Lens can actually map this out for you. Otherwise propagating is good. But getting those traces together is even better. And sometimes you often use sampling so that you don't necessarily need to gather every trace segment at all. In order to build the service map, you can sample like 1% and you have enough information, 1% of requests, send them and then you have at least that nice service map. You don't actually need every request sample whatever you want.

Now the third, the final lesson here is use service maps derived from traces to triangulate failing dependencies during incidents. So yeah, that nice visualization find where the problem is. So now let's dig deeper and find out why this service was having a problem and that could be because of some failed component.

So looking at that kind of web server layer architecture one more time, this is rotated more horizontally or vertically this time. So users send requests through an application load balancer. And then we have an EC2 based web server fleet with nine servers across availability zones. These are the things that are rendering the HTML that are going is going to get sent back to the customer. So the load balancer using its algorithm is going to pick one of these targets, the EC2 instances to, to for a given request to render that the HTML for that customer.

Now, the web server does something with it. It doesn't really matter for the, for this part of it. So what would happen though if the web server process on this EC2 instance ran out of memory and crashed? So there's no longer a process listening for new HTTP requests. And well, the problem, what would happen is all the requests coming in that go to that, that were destined for that host would fail. And we would see an overall error rate of 11% because they're 9 1/9.

Ok. So that you would see that kind of sustained error rate because requests are being sent to this thing, but it's not responding and giving customers a response. Ok. Well, fortunately, load balancers are definitely better than that they have. If you configure them, they will do health checks. So they'll do, they'll check every host every few seconds and say, hey, i need you to render, you know, your ping page just like render a blank page just to make sure that you're still running and healthy and that you think you're healthy. And so after a few seconds like failing these, since the web server process has crashed, it will fail the pinging requests and it will be taken out of service. And the load balancer will start sending requests to other servers instead. Great. So with properly configured health checks, this is actually what we would see in the observable data. Just a blip. Probably nobody cares. Fantastic with retries and whatever.

So let's talk though about a more interesting failure mode. What would happen if that process instead that on that 11 host instead of crashing it just kind of got wedged in a weird way. So for example, maybe it's HTTP client library that it's using to communicate with just the product info service. Like one of its one of its use cases is now impaired. Maybe there's a deadlock in that HTTP client and it's not able to talk to the info service, but it can talk to everything else. And so some of the functionality is totally fine when rendered from this web server. But some of it is kind of weird. We call this a gray failure. That's why i made that gray because it's a gray failure.

Ok. So what would happen here? Well, we had those nice health checks but they're not going to save us because the health check is just asking if the web server is running now. We could you could say ah ok. Well, then clearly we need to make the health check logic so that it makes sure that it can talk to all of its dependencies or something like that. That is fine. But unfortunately, health checks won't be able to enumerate every possible failure mode that can happen. Like from this, from this server's perspective, the problem is in the product info service, not in of itself, but really it is the product info service that has the wedged code in it. So it's a little, it's a little tricky. And so we can't enumerate all possible failure modes in a health check. And second, there's the problem where what would happen if, if we overreact and all of the servers started having the same kind of, oh, i'm unhealthy failure mode at once. It just gets really complicated. It's a more complicated topic than we have time for going into here. But I've actually written an article in the Amazon Builders Library that goes into this in great detail about like the pros and cons of what to put in a health check and what not to.

Now, this article says, well, because you can't win in health checks, you have to do something else using observable data. And so we'll talk about that part of it here. Now, this is what we would see during a gray failure. We would see that, you know, that same kind of one out of nine requests are failing, at least for that use case that, that involved the product info service. And this is what we would see in the telemetry data, an overall site wide error rate has increased, but we don't know why so what tool and observable can we use to zoom in on this a little bit and understand where the failure is.

Well, we can use again dimensionality. We are looking at, we did that for per web page metrics, we can do that for per instance metrics. And so for each instance, we can plot the error rate, the ratio of errors for each instance. And then we have a very clear signal on which EC2 instance is running or where our application is running is having the problem. Now, it's important these aren't just EC2 metrics like CPU these are our application health broken that we emit. So your code has to emit this success failure metric per then EC2 instance. So application health broken down per piece of infrastructure per component.

And so now you might say, ok, let's use that thing that David was talking about earlier. The composite alarms will create a separate alarm threshold for every host. So we know actually i wouldn't recommend this. I mean, you can, but this would be tricky because now every time an instance kind of comes into existence, you'd have to have some automation that creates a new alarm. Just for that one. It would be a little, I mean, you can do it, it would just be a little bit clumsy. There's a little bit of better way that I'll show you or a different way. The one that i would favor and that's using CloudWatch Metrics Insights to just query for the worst instance, the most failing instance. That's what this query does. You use SQL for it and you just say here is, show me the most, the number of failures per instance ID and give me the top 10 most failing ones. And that gives us this graph right there that, well, the hand drawn version of it, but you can't alarm on that. I haven't fixed the problem. This is just a graph you can't alarm on, ok, those 10 things, but you can alarm on the top. You could say you write this kind of metric math expression to say alarm on the topmost failing instance. And if it's failing more than 1% of the time for two minutes in this case, then i would like an alarm and that's a nice clear signal that will, that will alarm even though i didn't animate it swinging again. But ok, so, but this isn't just just to make this a little bit more real. Here's the screenshot of what that would actually look like.

This is the, this is the kind of actual CloudWatch console showing this kind of metric insights, query experience. This is showing this kind of nice wizard that lets you build queries. If you click on the button on the right, you would get the SQL that i was showing. And actually as of yesterday, there's now another thing here that if you go to this UI that isn't listed here. If you don't remember if you don't want to use the kind of wizard and you don't want to write the SQL. You can actually just use human language to say, show me the top failing instances and it will generate the query for you. So we launched new query generator for metrics, insights, logs, insights and stuff. So that's the kind of neat that you can actually just build these queries then in whatever way you want.

Ok. Now let's get into a similar type of failure, a different type of challenge where it's more than one instance having an issue. So here we have the web server environment again broken across availability zones. So we have each load balancer is sending requests across the different availability zones. And now we have a caching fleet like the product info team, like you all decided you want to make things more efficient by having a distributed cache in place a caching layer in front of the database. And you'll do this by having the web servers talk to cache nodes in their same availability zone, save on latency, save on bandwidth, etc. It's a good pattern.

So what would happen though if one of these cache nodes gray failed in whatever whatever way just any and it's this weird software bug. Anything happens. One of the cache nodes is is acting up well to our observable data. This would look like a gray failure of all of the hosts in that zone, all of the instances in that zone and that wouldn't.

Ok. So now i'd be breaking this down. I have the per instance metrics and i would see those but, but i wouldn't really tell what was in common across all of those things. So because, and this is actually more impact than a single host. This is a whole zone's worth. So we need to be able to act quickly and zoom in and tell where the problem is, which piece of infrastructure or whatever is having the problem.

So what tool can we use to drill down and understand where the problem is quickly and how to understand why i'm seeing an overall error rate. Well, you guessed it, we can use dimensionality again, we can say, ok, now we'll have latency and errors per availability zone as well as per instance, that way we get a really clear signal and we can break down and see.

"Ok. Yep, this our software in this availability zone. For some reason, we don't really need to know why it is behaving differently than in the others. It's just a, it's a component boundary. It's designed to behave differently. And so once we have that signal, we can react very quickly using Route 53 Application Recovery Controller and tell it to ok. I want to tell all my load balancers to just stop routing traffic to that availability zone and it'll just kind of get out and the application recovers right away. So that's great.

So we have the takeaway again of how to use observable data to deal with a bad component. And that was to split key application health metrics on separate dimensions like for each infrastructure boundary like EC2 instance or availability zone, again, measure things that can fail separately separately and to find an alarm on the poorest performing part of your components using Metric Insights query. This way, you can write one query and just have an alarm on the worst, worst performing one. Ok. Fantastic.

Now, let's get into another failure mode that can happen uh of a bad deployment. So we're aboard the ship again now and we need to make a change because as you can see, the sail is not getting very much wind. Now, there's a lot of wind, it's going this way, but the sail is also this way. So we need to be able to turn the sail. So let's make a change. We need to do a deployment. So you this side, you're to let go of your ropes and you need to pull in these ropes so that we can kind of catch more of the sail, keep going, keep going. We'll do a little more. Ok. Now, we're getting a lot of strain that's kind of moving. This is maybe bad, ok? And sure enough that we're now dismasted. I do know that term, we're dismasted, which we kind of, this is kind of game over for our voyage.

So let's do it a different way. Let's do it a better way. So, so let's take that same change. Let's do it again, fix our mast and now this time, let's use monitoring. Um and in order to do this a little bit better. So, ok, once again, let you know, pull in your ropes, let go your ropes. We need to kind of start adjusting the sail. Get that. Ok. We're in alarm now, I don't know why, but let's just undo that thing. Let's roll back because we're in alarm and let's just stop doing the thing we are doing and put it back. So just, ok, let it out. You pull back in. Ok. Great. And now we're back and we're safe again and happy because we have avoided a catastrophe.

Ok. So how does this work in our um website world? Well, this is the observable stuff. We would see the observable stuff, the data that we would see if we hadn't done anything. This is the first case where everything broke and it was catastrophic. Well, we, if we had been looking at this, we would have seen the error rate go up a little bit when we say deployed. You know, we're doing good, we're doing resilient operations, we're deploying incrementally. Just not, not to every, you're not just like eating the change out to the whole fleet at once. So. Ok, great. So one box deployment, we have a little bit of an error rate increasing. Ok. Wave one deployment. Wave two deployment. Ok. So this, you could see the error rate just keeps creeping up. So that's, that's the first scenario that we didn't like because it broke everything.

The second one is where we did this manual rollback. So what would this look like if we did if we deployed a bad change and we were manually rolling back again. So we would get an alarm. That's great. We got the alarm. But then, you know, somebody like wakes up and like gets on their laptop or start investigating. It was like, ok, that's there's an error rate. What, why could that be going on? Oh, you know, there was a deployment recently, we should maybe look into that. Let's call the people who are doing that deployment and ask them, should we roll back? They like, that's probably not that change. It's probably something else. Eventually they ok, let's just roll it back and then eventually do. And of course, that fixes it. So that's the kind of the manual roll back case, what we would see when we did use the alarm to initiate a rollback. But the problem is there are these kind of two phases, you could break this down more. If any incident is like this, you break it down into phases, the time to detect that there was a problem and the time to then fix it, mitigate it. We don't know exactly what happened. We don't need to, we just need to know, we just need to fix it.

Um so let's use now a more resilient operational practice of auto rollbacks. So instead of waiting for a human to have to, you know, decide, oh, and make a judgment call and realize and roll back, just have the computer do it like this is just such a better resilient operational practice. Um so we can as soon as the alarm happens, whatever is running, the deployment will always be watching the alarm to say, oh if the alarm went off, i'm just gonna roll this thing back immediately. This is the more resilient operational practice. You can design the mo an architect, the most resilient system in the world. But if you don't operate it with resilient practices, your customers won't see that resilient result.

Um so we should do auto rollbacks. And so now that improves things that takes the time to mitigate way down. Ok? Now any type of change we're doing, whether it's code deploy infrastructure change, we should be using this practice. And for example, even when you're using CloudFormation, this is a CloudFormation screenshot you can define with any change that you deploy that you make to your CloudFormation stack an alarm that CloudFormation will watch and roll back that, that change. If that's an alarm, it's just a great practice like every change you do, you should do this.

Great. So how do we tackle that other part? That kind of time to detect part? Well, there we need to use some more like straight up observable techniques. So how can we get a better sense that there's a problem earlier in this metric? We have an overall metric and how do we drill down to get a better signal? A more precise signal? What observable tool can we use everybody catching on to this at this point? Ok. Dimensionality. There we go. Ok. Dimensionality we can show now that all these applications, success and failure metrics per code revision. And now if we do that, the results are pretty, pretty immediately visible, right? Like the, the i showed overall not the old code is behaving just the same, it's a flat line but the overall bumps up a little bit. But this new code metric appears for the first time because there wasn't any of the new code and it spikes immediately up. You don't have to be a data scientist. So i understand that there was a problem there. It's a very pronounced spike exactly what we need. We have a quick alarm, auto rollback starts pretty quick time to mitigate and time to detect and the time to mitigate is even faster because we didn't have to roll back through so much of the fleet, right. So it's, it's pretty much as good as we can get here other than, you know, testing and not deploying a broken change. That's the other option, other option. But you have to be prepared for this. It's like resiliency means that you're ready for anything.

Ok. So what do we roll back on? Like what should that alarm be? Well, we should use composite alarms again, just roll back on. Ok. Overall alarms, we could add the new code version specific alarms, even the old code version specific alarms because it is possible that we like to deploy a serialization bug that affects the old code, not the new code. I mean, there's all kinds of stuff that can go wrong. So because there's all kinds of stuff that a deployment can break, we should alarm on literally everything we have like all of the alarms just roll them up and roll back like if it's too noisy, then we need to fix the alarms and not make them so noisy. So literally every alarm you have roll back on a single composite alarm. It's fantastic.

So how do we diagnose the bad deployment? We could again split key application, health metrics on logical boundaries this time like deployment id to minimize the time to detect the bad changes. Again, measure things that can fail separately separately and then roll back all types of changes, whether it's cloud information codedeploy whatever to minimize, roll them back automatically, to minimize time to recover. And finally combine literally all of your alarms into a single CloudWatch composite alarm for triggering those rollbacks. Ok, great.

Now, how do we deal with the failure of a traffic spike? Which ironically is more of a success than a failure because now more people are using the application. But, ok, traffic spike. So what would that look like diagnosing a traffic spike in terms of our observable data and our incident response? Let's kind of follow through how this would go down.

We would get the latency alarm. Uh we might like, we'll start by looking at our um our nice service map and see. Ok. This is the component that seems to be maybe at the root of the alarm of, of the latency. Let's pull up its dashboard, look at its metrics. Y, the latency has spiked in that component. Uh did we make any changes to it recently to roll back? No, it's been a while. That's not the problem. Let's look at other metrics on its dashboard. Fleet wide cpu has gone up. Ok. That's that correlates really nicely. That's probably did we lose a bunch of hosts or something? No, the fleet size did not go down. In fact, it's going up. Ok. Maybe auto sc is kicking in and so sure enough, the request volume, the incoming, the work being asked of the system to do has gone up and that corresponds with the latency. So we are getting a bit of like an overload situation. But what, what makes up that we have this, this coarse grain signal that we don't know what changed and what's what's within that delta and traffic. So what observable tool can we use to drill down and get a closer look dimensionality again? I hope we're catching on to this ok dimensionality"

But this is a different. So we can look at the request volume per customer, break it down per source, independent source of the traffic, maybe client IP address. If you have an identity, then per identity, something like that, requests per identity.

Now this is getting though to be a little bit interesting. Like you'll notice that this table have been sorting in this like arbitrary order. Well, it's not arbitrary. There's this other concept within dimensionality called card ana. You have some things like how for the per website dimension. We don't have many, we have one website together that we're working on. We have two code revisions at once ish a few availability zones, but we have a ton of customer IDs like maybe hundreds of thousands of distinct customers using the site all at once.

So let's just go ahead and plot those, let's plot those lines. Ok? That's not, we don't have enough pixels on our screen to be able to render the graph of traffic per customer. Ok? So when we're getting into these high cardinality metrics, we need a better way to sift through the data than we did with the low cardinality ones. There are just so many what we actually want. We don't even really care about the number of requests for every customer. We just care about the ones who are sending the most requests if you think about it. Like that's what we for diagnosing these kinds of traffic spikes.

So we don't actually need 100,000 distinct metrics. We just need like 500 that's exactly what cloudwatch contributor insights does. Instead of having materializing an actual real cloud watch metric for every distinct value, it actually only keeps track of the top 500 at any moment moment in time. So you don't have to have the expense of all those metrics and you don't have to sift through and try to render them on one screen all at once.

So how do we get that? Like what, what's the machinery that we need in order to get contributor insights? Well, let's talk about how telemetry kind of works in cloudwatch. There are a lot of ways, but this is this is kind of my preferred way our application will get all of its instrument through all of its instrumentation. We'll write out metrics into a single structured log file. So every request gets a log record, we'll put that into cloudwatch logs and from there, from the logs, we'll generate metrics like server side. Cloudwatch logs will generate metrics, however, we've configured it and then from there alarms. Ok.

So what is, what is a structured log? Let's look at that a little bit of request log, you know, structured means it's say json or something easy to parse and it has a few different types of things in it. It has things about the request like what was the trace id? What was the client ip address? Who made the request? What instance of ours did it run on availability zone operate? What api was it working on? And then it has some things that happened during the request. Like here's how long it took to render the request. Here's whether or not it was an error or a cash hit or miss for each cache.

Ok. So now we have a structured log record for every request. Then we can also create a contributor insights rule. You just register one of these with cloud watch and you configure it to look at different attributes and say i want to, this rule would say give me the number of requests, the count of matching log records grouped by customer id, whatever the values of the customer id from the log r. So this is a contributor insights rule and this is what you would get. You would get this nice graph that shows the top end requesting client ips just like we've configured. We could have many of these rules whatever we want. That's high cardinality. That was hard to make sense of when we've had real metrics. And wasteful because we don't need it.

Now, what if we didn't have that contributor insights rule there ahead of time and we're trying to make sense of this traffic spike. Well, because we have the structured logs, we can actually write a log insights query to just create whatever metric we wanted on the fly. You know, logs aren't just like text that with like stack traces. If you, if you put structured logs in place, those logs contain any metric, you would want that you didn't decide to materialize. Like you materialize the ones you need on dashboards, use contributor insights. But if you don't need certain metrics, just compute them afterward on the fly when you're doing some investigation with a log insights query. Fantastic.

So now how did we diagnose this traffic spike? What were some practices? Well, we found that it was important to emit logs that are rich with data so that you can cut your metrics on many dimensions. We found that you can record and analyze high cardinality metrics like per customer request volume by configuring contributor insights. We don't need to materialize every single dimension value within its own metric. We just need to keep track of the ones we are ever going to look at and finally slice and dice metrics that you didn't materialize up front in the first place by writing log insights queries, just compute the metrics on the fly. Ok.

Now let's look into how we could discover things that, um, that might not be very obvious from the observable data we have. And so now we're back aboard the ship, we're aboard the ship and we, uh, we, well, it's kind of a spoiler or that there's a hidden storm, there's nothing on the horizon. Our sailing is fine. We are seeing some clouds but those aren't, those are just normal clouds. They're not really stormy. They, they look fine based on what we can tell, but it turns out we get a little bit closer and, oh my goodness, the ship is rocking back and forth. There's lightning.

Ok. Yeah, the the animation takes a second to kick in. Yeah, this is ok. Now we're talking, yeah, lightning crashing down the ship is swaying. We're taking on water really bad situation. Yes. Ok. So we have that happening. It's a, it's a thing that we didn't, we, we, we, our observable data was missing, telling us about it.

So, what are those real, real scenarios here? Well, we might have some external issues that we weren't measuring or we might be mis attributing some errors. Let's get into that and look at external things. So here on all of this, we've been measuring things from the web server like the metrics are being emitted, the structured logs are being emitted from our web server. But what about all the stuff that can go wrong before uh the request gets to us?

Well, let's take a real world thing that can go wrong. Like maybe the, the code that the way that customers are accessing our site is they have a web browser or mobile device and it's getting some of its the code that it's using to render from a cloud front distribution and that's, you know, code that we've deployed there. And so what would happen? Oh yeah, the the client is sending messages back and forth to our server to the origin saying like, ok, hey, like, you know, get me the product data or something.

Now what would happen if we deployed a new version of the code that was incompatible with the server logic? So now we have cloudfront, we deploy to the cloudfront distribution that gets starting to get picked up by the customer's web browsers and now it starts sending these yellow messages and the instances reply back. I've never seen that data format in my life. It's like i have no idea what that is. And so those would be errors like customers would be getting error responses and we would have a harder time seeing that using server side metrics.

So aboard ships, like what we might do is have ships all report their local weather information to some central place, some central weather station and it can send out the summaries like, hey, a ship over there that you're headed kind of in that direction and a ship was seeing a storm. Ok? How can we apply that to our real world? Well, we can use one thing called cloud watch, real user monitoring.

So here these like web web browsers would be there would be code in the web browser that's sending back through side of a side channel saying hey, by the way, here's the success in kind of ratio or here's what i'm seeing the latency time to render for all these things. So you have all of your customers kind of sending in all this data about how you're doing. Kind of like, yeah, it's pretty convenient. You're learning from customers without them having to call you up. So that's a useful pattern. It's, there's a service called real user monitoring that will do that for you.

Now, there's another type of failure mode here, we can maybe botch a migration. So we're running this all on two. Uh but maybe we're trying to switch to a containerized environment. So we'll just set up, set up a whole another stack, the web, the new web server stack with a new load bouncer and all the containers. Ok. Great. And now to direct traffic from one to the, from customers, from the old one to the new one over time, we'll use dns weights using r 53. So we'll basically be starting. But, but what would happen if we broke our new environment? Maybe it wasn't working in some way. We did some like network apple change or something like that and we, it wasn't working, we thought it was working, but it's not, well, when we start to dial up the traffic, now, customers are going to start seeing errors as they talk to the new stack. But, well, we might not be measuring it maybe even before it gets to our application. So we're not going to get the measurement on our application.

Well, another technique we can use um in the, it's sort of like it was the, the the other technique about real user monitoring was working when there was a ship near the storm. But what if there are no ships near that storm right now? Well, we can deploy a bunch of like weather balloons or something like that monitoring stations everywhere, get really nice coverage of all of the seas and so that they can report back and we can pick up on how where the storms are.

So to do that within cloudwatch world, we might use synthetics, synthetics, cloudwatch, synthetics are essentially um it essentially runs your integration tests continuously to say, hey, is this, is this part of the site working? Is this part of the site? Even if your customers aren't using that part of the site right now, it'll still be exercised and tested just to verify that everything is working all the time. So that's another tool that we can use to measure things that are outside of the typical like purview of our server side. Metrics.

So the takeaway is pretty simple for this look at metrics emitted by synthetic workloads and real clients of your application to catch issues outside of your own measurement. Ok. Fantastic.

Now, this is actually my favorite part of the talk. This is to me the most like kind of interesting mind bending part. So let's talk about another type of a failure where we don't know whose fault it is or we have a misunderstanding about whose fault it is. What do i mean? Well, there's another class of bug that we can deploy like input validation bugs. It's kind of like that mismatched protocol thing we had with the cd n and like our server code where the clients were making a request that was just wrong.

So what would happen if we had an api like let's say cloudwatch alarms, we have an api that you all can call to create alarms, you create alarms and you can say here's the alarm name and we will allow you to use alarm names of like some kind of configured bounds just to put bounds on things to make sure you're not like, i don't know, putting something into the whatever just to make sure that we're making sense with those things.

So, um we have a limit of let's say 100 and 40 characters to this input. Um what would happen if we deployed a bu a change that said, ok, now, actually, you can only make requests for inputs of, of 100 up to 100 in length. Well, now we've, we've, we've broken all of you who are creating your alarms that are over 100 characters in name. Like we would be giving you errors, but we wouldn't think that was our fault. Why? Well, because there are kind of two categories, two dimension values of errors.

There's like server faults which these are when, when there's an error in htp. Like when you kind of, you know, you know this, when you use web pages, like if you're getting a 500 something, a 50 whatever back from a server, that's the server's fault, right? It's like come on like there's an no pointer exception, something very clearly, the server's fault.

What about four hundreds? You can see a whole bunch of different things like, oh, that resource you're requesting isn't found. Well, that's not the service fault or conflict gone length required. Precondition, payment required. I don't, i don't know that one, but uh that is a little bit more ambiguous or in this case, it's stated to be the client's fault.

So let's plot the plot the like, ok, let's take the, the, we don't have to show the dimension thing. So let's let's break it down per dimension. We have the dimension for server faults and a dimension for client faults. Server faults really easy to alarm on like if there are any, then you should alarm pretty easy. What about the client errors? Ok. Like, is there a problem here? Like, what should i do? Like, how do i set an alarm threshold here? Like these things aren't our fault? Like i just said, i could deploy a bug. Like, was, is that what the second spike is or what's the first stuff? I don't know, is our customers just calling us wrong? Like the problem is, i don't know where to put this alarm threshold. It's either gonna be too high and i'll miss issues that were our fault or it'll be too low and it'll kind of cry wolf all the time and it'll be noisy and alarm when it's not actually my fault.

So how can we dig in? How can we take this thing, this client fault line and use some observable tool to drill in and zoom in and understand what makes up that, that metric is there an observable tool we could use for that dimensionality? So we can plot the client faults per client, per customer to say how many of these client faults are each of our customers getting? Ok. Great. And the blue line is overall

But so here we can see this in this scenario. This is one customer suddenly spiked and started getting a lot of client faults. Maybe they were writing some, they were starting to implement their application, they're trying it out, they're getting it wrong. I don't know. This started getting fraud a bunch, whatever. So this is a bunch of errors for that customer and that's pulling up the average, they're pulling up the overall error rate and that would cause us to be alarmed, but it's not our fault. This is clearly them doing something wrong. We might want to talk to them and help them, but we don't need to wake up in the middle of the night about this.

Now, if we were to see this picture, ok. This is many of our clients, again, our top 500 clients or whatever are seeing uh errors that are their fault all at the same time. Now, it's unlikely that all of our customers got together and conspired to send requests to our service with invalid parameters all in the same moment, right? So this is probably our fault. So we should definitely investigate when that's happening.

But I don't have a feature in cloudwatch for you to alarm on this shape. I mean, we do have prediction bands but like it wouldn't really work here. Like really there's like the other shape was one client. This shape is many clients. So how do we do the math here to get the right alarm.

The problem is the thing on the left is what we're typically been looking at the percent of requests that have errors, right, numer the requests. But what we want is to say, I want to know how many, what percent of clients of our my customers are seeing errors. It's a more customer focused metric. That's what I want. The one on the right. I want to know how many customers of my site are having a problem. That's their fault. But I need to know because it's a lot of them. Again, they didn't conspire. It's, uh, it's my fault.

The reason is the math, like why these are different and hard is the math on the, on the left, the noisy one is the number of requests with errors divided by the number of requests. What I want is the number of customers with errors divided by the number of customers. So how do I fill in those two variables? I don't know how many customers, like what's a metric that shows me the number of customers that's actually pretty hard to do. The number of customers is if you think about it, if i have like a metric showing the, like the, the um the number of customers would be like i need to know for a dimension, the dimension name of per customer, i would need to know the number of distinct values within that dimension. That's like a harder thing to calculate.

Um right. Like i would need to know it's just, it's, it's actually really hard to calculate, you know, in a distributed system. It's like i would need a database of all customers to know how many there are. And it's, it's tough. So how can we do that. Fortunately, there is a way, there's a way to do this pretty easily with uh using some cloudwatch stuff. It's, it's easy but it's, it's a little novel. It's an interesting thing.

So i mentioned contributor insights will keep track of only the top end, the top 500 dimension value the metrics for you like. So if you have per customer, it'll keep track of the top 500 customers. Great. And that saves it. So you don't have a time series for every single customer that goes to waste. But it has the second capability while it's doing that, it also estimates the total number of distinct things in that dimension. So while it's keeping track of only the top five, the time series for the top 500 it knows that there are say 100,000, it estimates it with pretty good accuracy.

So with two rules, one that tracks the number of errors that are the the customer's fault per customer. And another rule that tracks the requests per customer. Now, we have the number of customers with errors and the number of customers and divide the two with metric math. And you get this perfect graph that we wanted. This is like the percent of clients with errors. Fantastic.

Ok. So now we went into this, this misattributed fault case. So the key takeaways are when many customers suddenly see errors that are categorized as their fault, it may actually be your fault. And so to find that case, we should calculate metrics like percent of customers instead of percent of requests by using contributor insights rules to estimate the cardinality of a dimension. And so we can get you get aggregate metrics in a different dimensional way by using contributor insight's ability to estimate the number of things in a dimension. It's pretty cool. We can be more customer focused in our alarms and understand our customers what our customers are seeing rather than what requests are seeing. Requests aren't customers, customers are customers.

So now let's spend a little bit of time on preventing future issues. We'll talk about using utilization metrics and then how to run kind of controlled experiments.

So resource utilization like it's important on our ship because we don't want to run out of food. So let's pay really close attention to how much food we have or else we will have a very abrupt end to our journey.

So why do we need to pay attention to this? After all, everything in these that i have been using in the architecture here has been elastic. Like it's in the name of literally every component here. The elastic compute cloud, those are easy two instances, elastic load balancing, amazon elastic compute cloud auto scaling. It's technically in there the word elastic. So if those are so elastic, why do i have to pay attention? They should be just growing for me? Right.

Well, they do, they're all built on elastic building blocks. But then at the end of the day, everything that we have that has some like min max, i should be paying attention to like at any instant in time. I have so many two instances and they're running out a certain cpu so clearly, i need to pay attention to that. Um i have so many instances. I have so many discs and they like the file system utilization is at the x percent. So i should be paying attention to the utilization of my discs. Even in our application, we have all of these things with a min max. Like when you configure a thread pool to do something to handle requests to do as synchronous work, you have a thread pool and a thread pool, you're probably configuring to have a maximum amount because if you don't have a maximum, then you could run out of memory. So they can we have these limits throughout.

So we need to have instrumentation and good visibility into all of these things that have at any point in time limits. So we should have alarms on everything. Like a good alarm on file systems would be like sure, show me, show me like if my average dis utilization of storage size is above 50% i want to alarm on average across the fleet, i should also alarm if the maximum, if the most full disk anywhere in my fleet is above 80% then i would also get an alarm. So think multiple alarm thresholds, think through, ok, maybe on per component boundaries, but also aggregate on the fleet, how to measure an alarm on these things.

Should have dashboards. I like to make dedicated capacity dashboards that just have all of these things about how application capacity, server, capacity, everything and then finally, auto scale on all of these things. And we might say, well, so auto scaling is how we're going to take make, you know, that's, that's how we'll take advantage of the elasticity and operate our resilient architecture in a resilient way. You might say, ah great. I'm gonna use auto scale and then i don't have to worry about utilization anymore. That's the whole point.

Well, it makes it easier, but you still have to worry about it because anytime you configure an auto scaling group, there are other properties like how many should auto scaling go up to in the max? Well, now i have another thing with the min max. So i need to pay attention to it, right? So, i mean, it's better, it's definitely better. It'll, but i need to know when i'm approaching the maximum configured so that i can do something differently. And a nice way to do that is because we emit like aws emits a whole bunch of metrics for you like the number for auto scale. And we even tell you what your emit metrics for how many you've configured for an auto scaling group and what its current number of instances are. And so you can do some simple metric math and have this nice graphic for your dashboard and have a nice alarm threshold that's stable. No matter how many you have, this is the one alarm you need. It's the percent utilization of your auto scaling group. It's like it's not really, it's an interesting thing but it makes it so that you can just define the alarm once and then you have it no matter how big your auto scaling group gets.

So now what we're talking about is monitoring the utilization of different types of things. Like on the left here, we have like there are many different compute types, many different database types to choose from. On the left, we have things with a whole bunch of customizability and knobs and things like that, which is important like you need that customizability. How many read replicas do i have in r ds? But on the right, we give you abstractions that kind of help remove things that you give you fewer things to have to worry about. And so then, but there's still going to be something to pay attention to like in lambda, you don't have to think about all the different dimensions of, of utilization, but you do need to think about concurrency. It's a nice abstraction. There are all kinds of usage metrics that we emit for all the api s that you call. So pay attention to those, just pay attention to anything that has some min max.

So to prevent future issues, we should configure auto scaling on all elastic resources to quick to react quickly to changes in load and maintain healthy headroom. And also measure the utilization of everything from cpu to thread pools to quotas by creating alarms and a capacity dashboard.

Finally, let's talk briefly about controlled experiments. So we're on our ship. It's a nice sunny day. Um i would like, um you can, you just jump into the water and uh we'll just practice rescuing just to make sure we have that. And so, like i said, it's important to have practiced operational practices so that we're, we're able to respond to whatever unknown situation in the future. We can at least practice responding to the knowns.

So now good, good experiments like this, good game days are well reasoned like they start with a hypothesis and we make sure that we, that the, our system reacted the way we expected to that they're realistic. They're done in a test environment. That's exactly mirrors production, using production type workloads. They're regular, ideally done during like a code pipeline or something like that and they're controlled to make sure they don't run away and do something we didn't expect.

Unfortunately, there's an aws service that helps with this called the aws fault injection service. So it ha it has these properties and helps you run these controlled experiments just like this. But what about the observable stuff? Well, it's important to actually also replicate the production observable configuration. All the alarms, dashboards run books, everything into whatever environment you're running a controlled experiment in because we're also measuring our own ability to deal with that kind of known unknown or something like i've heard uh observ ability as code, like, you know, include all of the that definition in the infrastructure as code stuff. So you're not clicking around creating alarms. And then when you do the experiments verify that everything looked the way you expected, you're not just verifying the behavior of the system but also the behavior of the observable.

So to prevent future issues, we recommend running controlled experiments and so doing that do to run them using say the a bs fault injection service and then use the same observable tools during those experiments as you use in production by including things like alarm and dashboard definition in your infrastructure as code.

So we've now navigated the seas of observable data successfully. We've diagnosed issues by using dimensionality a whole lot, but in a lot of important different ways, we uncovered hidden issues by measuring from everywhere, aggregating data and slicing and dicing it in a customer obsessed way. And then we auto scaled on everything and monitored our controlled experiments and game days.

Now this is surface stuff. I mean there's some depth but there's even more in the amazon builders library articles about making dashboards doing safe deployments and everything like that. So, um thank you for everybody for coming. Um please fill out the talk surveys. They are the way that we make sure that we have the right content for you throughout all of re invent. Fill it out for this talk, fill it out for everything. So thank you very much for coming.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫