How Amazon uses better metrics for improved website performance-CSDN博客

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134584549

So welcome all. Thank you for coming. Uh my name is Jim Roskind and Frank Stone and I will be presenting how Amazon uses better metrics for improved website performance. If you're in charge of groups that are worried about latency, latency for your end user. But for me, uh and for you, I hope a lot of you. And I've actually been asking you if you're, if you're a developer working on improving latency, it's important to have the right metrics. Actually, even more importantly for the uh for the managers to be able to understand whether your staff is producing good results.

So I'm going to talk about good and bad metrics and there'll be some surprises in here which I think can change the way you think about measuring latency. So let's dive into it. I'm going to talk about good versus bad latency goals. The critical thing is this interesting phrase that I'm going to introduce, called organizational dynamics, which is more than just math. It's about how a whole group of people act in response to your goals. I'll explain why percentile latency goals are problematic. In fact, I don't even have to say why they hurt latency, they harm latency. I'll talk about trim mean why it's a better metric? I'll then discuss histograms as a way to actually visualize a large amount of data and allow individual contributors to make big progress. And finally, Frank will come up and talk about how to measure your latency in four easy steps using the facilities that are supplied by AWS.

First question you might ask is why is this Jim Ruskin guy talking about metrics? Well, it starts out uh a little bit of my history back in 94 I co-founded a company called Infoseek. And at the time I implemented the Python profiler which actually got used in the open source for about 20 years. So it's kind of interesting and the big thing is visualization is key. The hard part is to see what's going on and to understand and to identify a problem clearly. And that profiler did that and really helped Python and a lot of interesting people then I worked at Google and making Chrome go faster. I actually designed and implemented the metrics infrastructure which is inside Chrome. By the way, if you ever type about Cole and histograms on Chrome, you'll see about 2000 histograms by dozens and dozens of developers developing gathering statistics about you for you. Uh and it proposed and architected and led the development of Quick, which is now evolved into HTTP three and critical is experiments in metrics. So metrics are so near and dear to my heart. You wouldn't believe it.

And eventually they hired me in 2016 as a Vice President, Distinguished Engineer and I helped to drive latency down for the ecommerce site using better metrics context of this talk. You're probably here for this reason, you know, this faster is better customers want the responses as soon as possible. And that is end user latency. But there's a surprise question that comes up here. What is a good goal to lead an organization? Well, maybe I should say what good latency should mean in general, it should be indistinguishable from instantaneous. But there is the speed of light thing that we have to worry about. We can't really break that uh and variance with components in computers causes us to have different amounts of latency. In general. We'd like things in a range of like 200 to 500 milliseconds would be darn excellent. But the surprise is that changes of as little as 10 to 40 milliseconds can have large dollar results. This is a surprise. Some people say, oh, under two seconds is acceptable beyond two seconds they're gone. The answer is no, every user is different. It's a continuum. Every millisecond that you wait, assuming you have, you know, a million or a billion customers and hopefully you do real soon. Every millisecond that you wait costs money, distracts the user causes their mind to think about other things. And so when we look at ecommerce and Amazon, we switched to what I'm going to talk about here is good metrics and we improved by about 30 to 40% in the first year that we switched over to it. That's what I'm trying to talk about here using good metrics. And that was in 2020 in 2021. Just in the first half, we were able to identify regressions and latency that saved us from losing 500 milliseconds of latency a half a second. The idea is if you find them early enough and you convince the programmer to roll it back out, you don't get worse latency and you all know the criticality of finding bugs sooner. That's the time to fix them.

When the programmer is there, customers should expect good latency from your product. Well, what do I mean by expect? Well, expectation is, is the average value. Amazon needs great averages. When I sit down to buy a product at Amazon, I want to know that when I click buy it happens, I don't have to go, wait, wait, wait, wait, wait. Ok. I don't want that. I want to expect good results. So then you'll say, hey, so Jim, why not use average? What are you doing all this talking for? Uh and the an answer is Amazon is dominated. Amazon, bad word average is dominated by the long tail. You'll be surprised when you measure end user latency, especially on the web. You'll get some surprising results. For instance, at one point I measured TCP connection latency and the truth is, it's averaging the truth. The real number is about 85 milliseconds. And suddenly I got a 24 hour TCP connect latency took 24 hours to TCP. Connect. Mind you. That sounds like a bad response time. Who thinks it's bad? Yeah. Yeah, a couple of you, right. 24 hours is bad. How did that happen? The answer is a pro a person was sitting at their pay their desk working. Probably a kid walked in and kicked out the cable and then the spouse says, come on down, have dinner and he goes, oh, I'll fix this later. Goes down. Has dinner. Decides not to come back up and play. Comes in the next night. Goes to work. He goes, oh, cable disconnected, plugs it in. I get 24 hours. Imagine what that does. If I tried averaging that in to my 85 milliseconds. A 24 hour, it's kind of a problem. So you can't use all the data that you're getting because the long tail will dominate. And that's why we don't use average. But I got it. Show something better. Don't leave home without it.

What are bad goals? Let, before we talk about the really cool goals, let's talk about some of the things that some of you might be using the first one I'll talk about is under stats. This is an example of an interesting goal and this was actually used for quite a while at Amazon. Not, we wanted 99% of Amazon's pages to be served to 99% of Amazon customers in under one second. Doesn't that sound like a good goal? Maybe you'd like that to happen for your, your products? Sounds like a good goal. The advantage is, it's one metric. If you're a manager, you get to look down and see how it's going underst stas unfortunately are what I call fence post goals. And organizations don't do well with fence posts, you can think of a fence post, a line of fences. If you're on one side, you're good. If you're on the other side, you're bad. That means if you're on the good side, no one really cares whether it's 0.99 seconds or it's 200 milliseconds who thinks 200 milliseconds is better? That's right. You're all good at this arithmetic stuff. Ok. And the bad side is there really no difference between 1.5 and 6.5 seconds. That's the evil lurking inside of a fence post goal. It doesn't reward the developer, it doesn't measure improvements in productivity and let's watch what happens when we start to try to use that.

Suppose I went to India and I was talking to a manager turns out India historically had about 400 millisecond round trip time. This is the server and I, I spoke to the manager. This is a mythical conversation. Hey, manager. You've heard, uh, about the push they had for latency. Absolutely. And, you know, latency in India is terrible. He goes. Oh, absolutely. Absolutely. So, you heard about the goal we have with underst stas? Yup. I said, and you're the worst, worst latency in the world. Yup. So, you must have a lot of programmers working on this, right? He goes, no, i have zero, zero. Hold it. I, I thought you agreed that this was important. He says, Jim, you don't understand. 400 millisecond round trip time. TCP takes one round trip, 400 milliseconds. TLS takes two round trips, another 800 milliseconds. I don't even get the request till 1.2 seconds. If I assign a programmer to work on it, they'll never be promoted. They didn't move the metric. My group, we won't get credit. What should I do? The answer is obviously I assign zero to that. I focus on delivering features where I get credit for my group. Now you start to realize what's happening with a fence post. He's not being unreasonable. He's doing what the organization asked him to do. It's a surprise, but it gets worse than that. But I'll just mention in general, if you can, if you can't better, a fence post don't even try and then once you better it, stop trying more because you've already passed it, you'll get no additional credit. In fact, you can often get worse if you have reason to do so. Someone says, hey, I got a new feature.

The manager says, "Well, we care about latency. How did you do on the understad goal underst that didn't change ship? It does it matter that he changed two seconds of latency to four seconds of latency? You see the problem. Let's take a look at another common metric. I've actually walked around ahead of time and I know some of you use this, here's p 50 the 50th percentile and p 90 on. What's it's sort of called the normal curve? There's a thing called the law of large numbers which everyone thinks, you know, if you add up enough samples, thousands millions billions, you get to a nice little normal curve like this.

This is a histogram think of it as a number of fast things are over on the left and the number of samples that are slower over to the right, the 50th percentile point, everyone says, oh, that's the mean the median, that's the mode, that's 50th percentile on p 90. Everyone thinks when they say i'm going to optimize p 50 that they're causing that mode that, that bulge to move to the left and get better. Everyone thinks when they say we're, we're also monitoring p 90 to make sure the variance doesn't spread out too much it sounds really good. Why doesn't this work? In fact, they actually thought it worked and a lot of you may be using it. Watch what actually happens.

It turns out i could take fast samples that are below the, uh, the median and i can make them worse. Remember what i told you? Can i ship the product? Did p 50 or p 90 move? No, they didn't move. Ship it. I can move samples that are above the medium but below p 90 to make them slower. Can i ship its boss? Sure. Go ahead. As long as p 50 p 90 didn't move. We got a new feature. We'll get a lot of credit for this. And what about bad samples beyond p 90? By the way, i used to make fun of this at amazon. It's customer obsessed and i'd say, do you really not care about 10% of your customers? Uh maybe p 50 p 90. Some of you say no, i look at p 99 the same things apply and watch what happens. This is what happens. You see the two fence posts, the p 50 you have all these little sheep they're developing and now people want to land features as they land features, they huddle up towards the two guard rails at p 50 p 90.

I said, jim, i don't believe it. In fact, this is a nice animation i had, i thought it was an interesting thought and i started pushing people. And then i, i spoke to a senior engineer. He said, oh yeah, show me the data, which is a good thing to ask for. So i started pushing the metrics, folks to gather histograms of real data. And this is the first one i got. This is search, you notice what you see remember on the left is good samples to the right is slow, bad samples, two big modes. It's not that normal curve. We thought we had with p 50 p 90. It's two bulges sitting right up at p 50 p 90. This is called reality.

He says, oh well, that's just one sample. Show me another one. Check out page. Huh? What happened? What's the bulge doing at p 90? What's the bulge doing at p 50? Holy cow. The sheep might really have been moving. They weren't trying to move. This was happening accidentally, organizational dynamics. And to quote a, a vice president, i had a big talk with, he said i get it, jim, we got what we asked for, but it wasn't what we wanted. Organizational dynamics. And in fact, when a measure becomes a target, it ceases to be a good measure.

P 50 p 90. Once you set it up and once you start asking your groups to optimize it becomes a very bad measure, it is no longer what you dreamt. It was, which is an indication of how well, the normal curve is behaving. So just to review before we go into the fancy stuff of what are good goals, what are good versus bad goals? Underst stas are bad goals. They are fence post goals. That is bad. We saw what happens in india. It happens all over the place. People huddle up against the fence post.

P 50 p 90 are bad again. They huddle but it's worse than that. It's not just that you don't improve, you get worse, you know, metrics get worse. Why the answer is, i know a big wamp. Wamp at your company comes and says, listen, i want you to implement backflips on your web page. Oh, backflips. We spent six months working on it. The top engineers, they finally come back to the waw a and say we can do it boss. But i will tell you it'll cost us a little bit of the p 50 p 90 metrics. Could i, could i make them worse just this one time? And the wpa wpa says, well, of course, because we know we optimize p 50 p 90 all the time. I'm sure we'll recover it but you never recover it. P 50 p 90 get worse year over year. This is what's happening. And you say, well, jim, you're dissing p 50 p 90. What you got for me then and the answer is trim me 99. Tm, 99. I've got to talk about with this in a second. It's a good goal for latency. It measures everything. It doesn't ignore success and it notices failures, it notices regressions and latency.

Histograms will also help us reach and surpass a goal. What is the trim mean? What is this magical thing? You're talking about the definition of trim? Mean trim mean 99 i, you say tt m 99 for short, discards the slowest 1% and then averages tm 99.9 would discard the slowest 0.1%. Why would you do that? The answer is i know there's noise in the average. I know i've got to get rid of that silly 24 hour. I'm not sure what it is, but i know there's noise at the end and that's what i want to get rid of by the way, in case there are statisticians here and want to yell at me at the q and a session just to get this out of the way. Officially. A statistic says archly trim means uh jim uh you trim at the top and the bottom. You should learn that and yeah. Yeah. Yeah, that's true. But it turns out the zeros and the little low things weren't distracting and adjusting the average the 24 hours. That was a problem. So i typically only trim at the top. You could trim at both ends. And then another statistician will say there's this fancy dancing stuff called windsor and windsor ri says, you take you, you discard the edge, but you may leave all those happened at exactly the p 99. Again, it's only 1%. It doesn't really affect the mean. Ok. If someone in your group says i want to windsor have at it, it's fine. It's good. That's a good metric still.

But now i want to say this is, this is a really important. This actually, i didn't know what trim mean was. I, i worked on this problem for like six months. Uh actually, i was at google while i was trying to understand a good metric. And eventually i spoke to a statue and she says, oh, that's just trim. Mean i go cool. Now i have a word for it. But trim mean is not discard above n it's not discard above 10 seconds. Let me make it clear. It is discarding a percentage as an example. It turned out i was at a company where their policy was. Yes, jim, we heard about latency. Yes, jim. We heard about those outliers. So we always throw away all the results that are over one minute for, you know, connect obviously that's noise. And i said, well, really? He says, yep, absolutely. We throw those away over one minute and then we average the rest. I said i have good news for you. I know how to improve your stats. And they go, what i said, it's really simple. Your back end response time is 100 milliseconds. On average. It varies light. He says yes, yes. Yes. I said good call the back end if you get the answer in less than 100 milliseconds, ship that thing to the user. If it's over 100 millisecond, over 100 milliseconds, wait one minute and then ship it. You'll see your metrics get better.

They laugh and say, well, you would never do that to me, would you? Uh well, the answer is unwittingly. When you shift your distribution out, you're throwing away bad results at the edge. Think about another way to think of trim. Mean is olympic scoring? Suppose i'm doing olympic gymnastics scoring and i have 10 judges. What do i do? I always throw away the high and the low and average the le the rest. Why do i do that? The answer is the high comes from the martian judge who hopes that the martian gymnast is going to win and the low comes from a, a judge from mercury. Ok

And they're trying to get the Mercury tribe to win. Those are both noise. They need to be discarded. And now we average the middle and we get an answer that's doing a trimmed mean. It really is used. So trim mean is the expected value. It's the real expected value sans the noise, it discards almost nothing, it discards noise. It's unlike P90 trim. Mean 99 watches the worst 10% and it watches all the things in between chimney. 99 is con provides constant organizational pressure. A single organizational metric that's easy to watch. You get credit and blame for almost all latency changes. You can't sneak in around the fence posts. The, it's customer obsessive as it watches almost everything. Trim 99 is smooth. I didn't even read the cool thing that came on the other one. It's smooth over time. And if, if it was smooth, you, if it's really smooth and isn't changing, you might look at TM 99.9.

The truth is the internet has a lot of problems in it. So, TM 99 proved to be a very, very good metric of success. So do, do goals really drive an organization. One of my leaders said, Jim, we've been using TM 90 sorry P 90 P 50 for many years. How long will it take to heal? And I gave my answer. I don't freaking know, but we tried it and let me tell you what happened.

This is an example of January 2020 versus October 2020 ecommerce. This is what happened. If you look at the x axis over on the left turns out you always use a log scale so you can see it over at the left. The first number on the left is 10 squared, that's 10, that's 100 milliseconds in the middle. It says 10 cubed. That's 1000 milliseconds, which means one second at the far right is 10 to the fourth. Ok. That's 10,000 milliseconds, which is 10 seconds. You see what happened? The yellow thing is the bimodal distribution around P50 P90 a mere 10 months later. Do you see what happened? All these samples remember left is good. You want mass left? This is what happened. And you say Jim, that's, that's a little bit interesting. Maybe you got lucky. Ok. Hold on to your seats. I'm going to show you an animation month by month of these histograms. You ready? Here we go.

This is April. September, October, November, December, January, February March, April May, June, July, August. Did anyone see a change? Raise your hand if you saw a change? Ok. That's what happens when you switch to a good metric. That's what happens to your system and the latency. It's organizational dynamics. The organi I didn't come and say get rid of the second mode. I just said this is how you're being judged every time you land something that's problematic, we remove it and this is what happens. We stop the sheep from going up to that fence post and hanging out there.

Well, now let's talk about histograms. You've already seen the picture of histograms which show the obvious value add. But why do I go after histograms? Here's an example of a histogram. This is from Chrome when I did TCP connect latency on Windows. By the way, this is a summary of 9 billion with a B billion samples. It's pretty cool what you can show in a histogram in terms of gathering a lot of data notice. By the way, it doesn't exactly look like a nice normal curve. Or if you're fancy, you'll call it a log normal curve because the x axis is really exponentially spaced, which means you have logarithmic, it's a log axis.

Well, when you look at this, you can see P50 there's P 5053 milliseconds and you can see it since it's exponential P90 right next to it is 300 milliseconds. And that's kind of interesting. And then you start looking at this and wondering this is where individual contributors can do stuff. They start wondering about pieces. Now, mind you, if you're looking at P50 PP90 I didn't show you this, you would have assumed it was a normal curve. That's not, you know, we're not in Kansas anymore.

All right. So then you say, well, what's this thing over on the left? I'm gonna talk more about that. This turned out to be cached connections. I had pre connected TCP for some interesting reasons. I'll mention that in a second and they were sitting there and just got reused. The largest, largest one, you can't quite read the numbers on the bottom is about 10 milliseconds. No TCP connection. Well, some in the world in general no TCP connection over the real internet happens in 10 milliseconds. Let me tell you. Ok, but it turns out they were already sitting there in the local Windows box and then you look at this thing. What the heck is that again? If you're looking at P50 P90 would you have noticed that this is where histograms come into their own showing you what's going on with latency showing you in a way that you couldn't see before visualizing data that you didn't understand.

It turns out that the internet is an equal opportunity destroyer of packets. It roughly 2% of all packets on the internet, 1 to 2% it varies, are destroyed and it doesn't really care what the packets are. It turns out that's because of congestion. When you have congestion going across the internet queues build up. When the queues overflow, we throw away the packet and TCP is trained to go who congestion slow down. And that's why the internet works. In the meantime, it's constantly throwing away packets as TCP causes this overflow. So it turns out the sin packet if you do a TCP connect says sin and then I get a response. It's like, can we talk si n ac that sin packet can be killed, it can be dropped and when it's dropped windows cleverly waits three seconds. You can't see at the bottom the numbers, it's about 3000 milliseconds. That's three second time. Out, which means that let's do the multiplication 2% of the time we have a three second timeout, we multiply this together 60 milliseconds on average of latency added. By the way, this was interesting because the servers couldn't even see that at the time. This is measuring at the end user. Someone asked me about this before, I always look at end user latency. That was an additional 60 milliseconds of latency. And I, once I saw this now, I had a great suggestion. Listen, after we pass P90 that's like only one out of 10 goes after 300 milliseconds. If I still haven't heard in Chrome, a response, raise a second connection and I did, I did. So I brought in most of the connections that would have cost and we would have had to wait for three seconds. I brought them down to 350 milliseconds.

Suppose I was a P50 P90 shop. And I did this, I saved 60 milliseconds who thinks 60 milliseconds is good. Let the record show that some people raise their hands. Ok. So the interesting thing is this was a good thing. But in a P50 P90 shop, my boss would say you added a lot of complexity. You didn't really save anything. P50 P90 didn't move. What are you doing? Ok. And, but the answer is I did something very good and I got credit because I would be looking at trim. Mean I'd be looking at the expected value. A reduction in 60 millisecond of expected value is good. And by the way, the thing over on the far left that all all those things that was a vindication and validation of an existing optimization that I had already done. I had already learned every time I connect in Chrome does this for you today. When you connect to a site, it remembers all the other subsites that it connects to and it preconnect and does extra connections of TCP anticipating their use without asking Google or anyone else it learns for you and it makes these connections, puts them in a queue and then they're available when ready. So this this histogram is vindicating me in terms of showing the optimizations I've done. It's showing me potential for additional optimization. That's why you look at histograms.

Histograms show all the data and every surprise is an opportunity. It's better than watching P99 outliers are visible at surprising modes. You have be sure you use a log scale on the on the x axis so that you have tremendous dynamic range and it fits on the page and you can see things your entire dynamic range will be visible.

So in summary trim mean is a better goal. Histograms are a better visualization and then you say, well, gee Jim that's great. But how do I apply this to my applications in AWS and to talk about that. Frank will take the wheel. Thank, thank you Jan.

You want to be able to respond, know when you have a reduction in your latency. When something's going wrong in your environment, you need to take action and that's something that you can do with CloudWatch as well. Then finally, you want to use as Jim talked about the power of histograms to really look at your data, to find those surprises and identify the opportunities you have to improve your latency.

So when you're collecting your latency metrics, three different ways you can do that. Now, Jim talked about really getting out at the end user, what you can do with that with CloudWatch, real user monitoring. And that's a tool that essentially allows you to instrument your web pages so that you collect the metrics from the clients, bring those into CloudWatch and then you can visualize those and use those as metrics for your latency.

If you have your own metrics that you want to bring into CloudWatch, you're collecting them through your own application process in some way, you can use an API. So we have the PutMetricData API which is another way that you can bring metrics into CloudWatch.

The third approach, which I think is kind of fun is you can actually, if you're bringing logs already into CloudWatch to look at your logging, you can just embed the metrics in those logs and CloudWatch pulls those out. And that's something that you just really get a, you know, free movement into the into CloudWatch. So that's an another nice way to do that.

This shows you the actual screen that you see when you go into the CloudWatch, real user monitoring. And it's a pretty straightforward process. If you want to say, monitor your corporate website, you put in the application name, you fill out a few questions and then you get a snippet of code JavaScript that you can actually put in at the top of each web page. And then when your clients are running, that data gets moved into CloudWatch and you can see information about how long it took to load the page. You can also set up some extended metrics so you can see what region it came from, what type of OS it was running on that type of thing. So you can slice and dice the metrics.

Now, if you have your own data that you want to bring in, you can achieve a similar result by using the API or with support with the command line interface. And you can do it a couple of different ways. Each time you collect a piece of information, you can put that into CloudWatch or you can aggregate the information and make fewer calls and upload that in bulk.

And then one of the things that you can do when you bring that information in is you can use dimensions. So it's one thing to look at your data and say here's the overall data for the application, but often you want to slice and dice a different way. So maybe you want information about whether it was on your mobile devices, whether it was on what version of the application was running, that type of thing. So you can add those in as pieces of information that you upload along with your metric. It gives you the ability to view your metrics from different views.

And the third method that I mentioned was the embedded metrics format. And you can see on the side of the screen here, the JSON example of the JSON that you would use to insert into your logs to add information into the log files that then could be interpreted by CloudWatch to bring in metrics that you can view.

Now, one of the nice things about this approach is that you can actually add a whole bunch of extra context into your log. So if you do see a degradation in your latency and you want to go look into that deeper, you have that information right there at your fingertips, easily accessible.

Now, if you're doing this in AWS on one of the AWS services, a nice way to put this information into your logs is just to use a Lambda call and you can quickly grab a Lambda, put some information in the logs or do that if you already have a Lambda in process to capture information about the activity.

If you're using an agent on EC2 CloudWatch agent bringing the logs in, you can just then grab the application logs, set up your application. So it registers the information that you want to capture from a latency perspective. And again, that comes and it's presented in CloudWatch as a metric that you can visualize.

So now I've captured my data. The next step is I want to measure it. And what we have done is we've added, you have to pick when you visualize your data. What's the statistics that you're going to use to calculate what the, you know, the different data points and we've just added and you can see in the drop down here, trim mean or TM is one of the options that you have. And so really to get the benefits that Jim talked about, don't pick the percentile metrics, pick the trim mean metric. And that's what we really want you to take away from this talk is that's the way that you should think about this.

Now, when you do this, another thing that you do is you actually have to pick the number of samples. And Jim talked about the billions of samples that he had when he was doing the work on Chrome. If you just have a few samples, if you pick a very small period or you have a low number of samples, say I've got five samples and I throw out one of those samples, that's not the outlier, you're throwing away real information. So really think about it a minimum of 10, but preferably hundreds or thousands of data points that you have, when you calculate a trend, main value, I'll talk a little bit more about how you can track, track that in a minute. But that's important is really think about the density of your data and how much you use, what the period is that you use to analyze trim mean.

So a few tips in terms of how you set it up. So one of the things you can do with trim mean is you can also look at your outliers. So you want to measure your average for trim mean in general. But if something happens and you say, hey, that 1% I'm just going to slice off that 1% at the end with TM 99. It's interesting to know if that changes significantly. So you can monitor that as well. You set up another calculation on the metric and you use, you set the basically the left hand guard rail, you bring that up to 99% and not 100%. And that just shows you then the trend main essentially of that last 1%. So that can be useful to monitor another thing that you can do.

And Jim mentioned usually with trim mean, you really want to start at zero, that's where you're going to start. But he showed you that data where you had the caching that happened and that was in the metric and you had these cache connections that were, you know, below the 10 millisecond level. And in that instance, you could actually set the trend mean at 20% take that out of the picture and really focus on the real data, not the cache data. So you have the ability to trim the left side and the right side. And depending on how your data is working out, you can use that approach.

And then there's another setting or statistic that you can set from CloudWatch to really look at the density of data that you're using for your calculations. You remember, I just mentioned that you want to have enough data points to calculate the trend main accurately. Well, this will count them for you and you can watch them. And so if something changes, so you start getting a lower number of users bringing in information and your number of samples start decreasing, you'll see that. So you just instead of TM, you pick TC, pick the same guard rails that you use for TM for TC and you get that count so that you can see that.

And Jim mentioned there's another, you know, some statistics like Windsorized mean versus trim mean. They're very similar. If you want to play around with that, we have that as well. It's just WM for Windsor, I mean, instead of trim main, you can pick that pick similar guard rails, it's something that's available and out there if you want.

Now, once you've got the statistics set up, you've picked your period, you've got your data loaded in. You actually want to use that data to engage and improve your environment as well as to respond to any issues you might see. And CloudWatch has both alarms and dashboards that you can use to make that happen.

So this is a picture of a sample application that I put together and I just grabbed some trending statistics over time. The line at the top represents the alarm that I'd set. One of the nice things about CloudWatch is that every time that alarm or that that data crosses the alarm threshold, it doesn't necessarily trigger and that's good. So I can pick essentially how long I want it to stay above a threshold that I set. So I avoid having erroneous triggers of the alarm, that type of thing.

Once I have the alarm and it's triggered, then I have to decide what I want to do. And CloudWatch gives you a variety of options. The first thing that you typically would do is you want to notify somebody. So you'd send a text message or an email that says, hey, my alarm changed. You also have on your dashboard, the color will change, but people aren't always watching that.

Another thing that we do internally is that we often then automate it. So we just basically create a ticket in our issue management system. So I've already got the information set up in my ticket, that I'm responding to it does the automatic paging the automatic escalation and that just starts the process to respond to this potential user issue where I've seen a change in our latency or our applications.

Now, maybe you've had a good sense of what you need to do when you see your latency increase because often times that might be based on load. And so I need to add maybe more compute cycles than I'd originally provisioned to be able to support that increasing load. So with CloudWatch, you can set an action that says when I have this alarm occur, go ahead and add some compute and that can address your latency issue in real time.

Now, I've got these alarms, they've gone on, I set up how I deal with the immediate issues. But I also want to have more information. And with CloudWatch, you can take the information not only on your latency but on other parts of your environment so that your teams have ready access to the information. That'll give them a sense of what they need to do when something breaks.

Now, this is a simple example. I've just got a picture of latency of my application on left side, I've got a histogram but it's easy to add things like CPU utilization. Lambda starts, database performance memory utilization for my application that might give my team is we're engaging in resolving the issue, a better sense of what's out there. And it's also a really good tool and something that we do where we see have a regular review of the metrics to see how have they changed because maybe I didn't trigger the alarm. But I'm starting to see some anomalies. And by looking that every week, every couple of days through a process, you really can make some significant improvements. And these are the tools that you can use to quickly get to that information and to dive into the details.

Now, Jim talked about histograms and those of you that have used CloudCloudWatch, probably know that there's not a quick histogram button that you push to get your histograms out there, but you can do it just take a couple of steps. So I want to walk you through how you would do that with CloudWatch.

So the first thing you need to do is you need to look at what's the range of data that you want to put into your histogram. Now, typically you're going to start at zero on the left hand side and then you're going to have to see, you know, what does it make sense in terms of what you want to look at? Now, you're going to use a logarithmic approach. So make sure that you can cover a wide range of data and you can see in this particular chart, I've got my histogram, it kind of looks like a little bit like a normal curve on the left side. But off to the right, there's a little bit of data that might be interesting. So that's the kind of thing that you want to make sure that you see in your histogram.

So once you've done that, we've added a statistic called percentile range or PR and what that allows you to do is basically to create slices of your data that you can represent as bars on your histogram. And so once you pick that, you know, and typically you say, oh, I want to have, you know, 4050 bars on my chart and you start to then, you know, set your smallest slice and you define the range, you know, 0 to 11 to, you know, and then you can actually an easy way to do it instead of trying to calculate the exact logarithmic approach is just take a multiplier. So maybe you take the last one and you multiply 1.2 but larger et cetera, 1.3 that type of thing 20 30% more. And that'll give you a rough logarithmic approach to your histogram.

Now, once you've put that in place, then the data comes in, it populates over time, you can put in your dashboards, it's very easy to access and make sure that you select bar is the visualization method. So that's the last thing that you do. And then you have your histograms.

To summarize what we've talked about, Jim really shared the issues that you have when you choose fence post goals and you set targets in your organization that are what you're going after. And using trend mean, really takes that away allows you to look at all of the data measure the benefits you get and reward those people that are driving improvements, even if they're not moving it past a specific number that you've picked. Arbitrarily you can use histograms to really find patterns. So there's a lot of detail if you can look at the histogram and you can find interesting things. You can find the sin packets that are causing the time outs in the windows, clients that are out there that are maybe really impacting some of your customers and to fix them, those examples are out there, the surprises you find are really going to be opportunities and you'll see those in your histograms.

As I mentioned, we support trim mean and CloudWatch, it's a really easy way to get going with this. Just change the statistic that you might have if you're already using CloudWatch from p 99 p 50 to trim mean and look at what the results are and see how things change in your environment when you start using that metric. And then don't just measure the metric and CloudWatch. But use the tools that are available to you automate responses, put the alarms in there, set up your dashboards, look at them regularly so that you can find the areas where there's opportunities to drive improvements and you'll see similar results to what we saw through the changes in process that we put in place.

In terms of going into the next level of detail and exercising this, here's some resources that you can use. This particular link gets you to the documentation on the trim mean statistic. So if you want to see more details about PR TM TC and how you set that up, that's all documented with some good examples.

This next link gives you a link to the documentation on the real user monitoring. And that gives you not only the details on how you get the basic pieces set up, but how maybe you add some of the extended metrics so that you get the region detail or the OS detail or those types of things as well as some of the considerations that you have when you're implementing RUM.

If you want to use the API, here's a link to the API documentation.

And finally, here's a couple of links on the embedded metrics format. The first one shows you the details on how you set up the JSON and put the information into your logs. The second one is a really interesting blog post on British Telecom and what they did with embedded metric formatting in their IoT fleet and basically bringing that information in with the logs from the IoT devices and using that to optimize the latency and the performance for their customers.

So with that, I want to thank you for the time that you've spent.