Surviving overloads: How Amazon Prime Day avoids congestion collapse

hello and welcome. i'm hoping a lot of you are here to hear about surviving overloads and how amazon prime day avoids congestion collapse if you're not, this is a great time to leave. but uh i, i'm jim raskind. i'll be presenting with ankit chadha. uh i'm a distinguished engineer over on the.com side of the house and he's a solutions architect over in the amazon web services.

so the first thing you might wonder other than the fact that maybe i do some pantomime or something is who is jim roskind? exactly. and i was hired by amazon ecommerce and hired to reduce compute costs. we have a lot of computers that we use over on the.com site and we have to rent them from aws and as you get bigger and bigger that becomes a large part of your cost, they hired me to reduce that.

i went over to, it's sort of a nice thing to hear. went over and spoke to folks at aws told them my job over here is to buy less of your stuff. and they said that's great. they said if you can figure out how to buy le less stuff we're gonna teach all of our customers to buy less of our stuff. and then all those people who aren't buying from us will come over and say they want to buy here too. so it's a very long term perspective. it was interesting to hear that from a os. this was for me, an interesting story and that it's a real story.

i was also the architect of quick the protocol, which is the basis of http three. i did that while i was working over at google, uh i drove the metrics and performance uh development in google chrome. if you ever type about colon histograms, while you're using chrome, you'll see 2000 metrics that are being gathered and uh very uh private in the way that they're uploading the information in general. i understand performance. i designed the python profiler which was used for about 20 years uh in the open source market.

uh the, the big thing though is that as i try to reduce compute cost, i'm always a little bit afraid that if i reduce it too much, we'll have a brown out will be unders scaled. and so there's a delicate balance between the two. and so the nervousness was about problems and problems, a big problem to worry about is congestion collapse.

so without too much further ado we'll get into that, i'm going to be presenting the first half of the talk which will be sort of pragmatic examples. and the theory of congestion collapse, trying to get you to understand what it is. and in a broader sense than just hearing about tcp, congestion, avoidance or something like that, it turns out it's all around us and it's a big problem.

part two will be practical aws tooling to avoid congestion collapse in chadha and keat will be presenting that section.

so the first thing overview of my section will be, what is congestion collapse show that there are examples all around us ways to avoid problems, how to find problems. and then i'll argue constantly the theme that i will repeat testing past failures because you can't trust code reviews.

so the first question is what is congestion collapse? it's this very strange phenomenon. a system gets even slightly overloaded resource utilization when it's overloaded. obviously something is a limiting factor. maybe it's cpu maybe the disk, maybe the network something is overloaded, something is at 100%. and when that happens, this interesting thing that follows typically queues start to build up, people start falling behind. and then the surprise, this will make more sense. as we see examples, i claim we're going to get to a point where we have zero productive work. all your machines are running at 100% and none of your users are benefiting from it.

who has ever seen any of their systems? a lot of you are, i think are sre types have seen systems go to 100% and you go, i don't know what it is. there must be an infinite loop hiding in there. see it happens a lot. this is often the real culprit and is confused. and then the most critical thing about congestion collapse is that it gets stuck in this state. you can't even get out of it as workloads seem to get smaller. i'm just saying, well, this still doesn't make sense to me, jim.

so i'm going to give five examples first, starting around the world and eventually ending with prime day 2018. of how this happens. we'll talk about meteor lights on the highway. you have congestions on the highway. mother's day problem. this is the mother's day phenomenon. i learned about when i was working at bell labs. and it's a very, very interesting realization of what happens to phone networks under congestion. tcp congestion. most of you have heard about that and try and explain the subtlety of what's going on there that's different from everyone says, well, i use exponential back off. and so i know i'm safe. the answer is when you hear people saying that to you be afraid they're probably misusing that phrase. i'll talk about web server overloads, which will finally now we're coming even closer and closer to things that you're worried about. and then you came for the fun part to hear that we tripped over some of the stuff in prime day 2018.

first thing metering lights. uh how many of you have ever seen meteor lights on the highway where you, you pull out and there's this entrance ramp and it says they'll flick red and then green and then red, one car per green light. and you wonder what are they doing? they go? oh, they're just trying to be nice and smooth the flow. it's actually much more interesting what they are doing and it all relates to congestion collapse.

so what happens without metering lights assume you have low density uh and the average speed of cars 65 miles an hour. i know maybe some of you speed, but we'll just call it 65 miles an hour and then no one will arrest us. uh and how many cars pass a given line? if i put a line across the highway, i'd say car, car, car, car, car, car, i get throughput. the road is working for us. it's actually conveying traffic. that's wonderful.

but then what happens when it starts getting congested and let's go to the extreme consider a high density. what happens when there's one inch? if they're so congested, one inch between you and the car in front of you. how fast do you drive when it's one inch between you and the car in front of you? he st he this is unusual that someone in the audience is honest with me, he says he stops. ok. i had 11 woman one time told me i drive at 60 i go, i don't want to be on the road with her anyway. uh, so you stop and it's, you're going at zero miles an hour.

so now if i put a line across the highway, how much throughput do i get? the answer is where it's standstill. you've been in these standstill traffic jams, you're getting zero throughput. you wonder why they build the highway. it's not conveying traffic, what's going on here. well, the most obvious thing is no productive work is done by the highway.

so now how did, how did this happen? now, we sort of dig a little deeper. drivers usually get on the highway when it's flowing, they get on the highway travel briefly and they get off. that sounds nice that that was a good day. unfortunately, when you have high density, they sit in traffic using the resource, using the highway for hours. so the interesting realization is they're maintaining the high density, they're preventing the traffic flow.

so what happens is we got congestion and that caused them to go slower and as they get slower, they use the highway more and they cause more congestion. it's a spiral of death. its congestion collapse is very st and you can't even escape it. once you have this traffic jam, it will take hours to recover. in fact, often you are in a traffic jam only to find out that the breakdown is cleared miles away and you're still stuck in a congested highway. that's what happened. it was a spiral of death on the highway.

how do metering lights help while they force you to sit on the side of the road? but don't hurt the traffic density. if the density gets too high, traffic will slow down, when allowed you get on the highway you travel and you get off, that causes the flow to keep going. now you start to realize ways that they actually meter it to evade congestion collapse. that the fact that you're in a standstill waiting at a light is not to get on the highway is not a problem. they don't want the highway to look like that. but once congestion collapse has taken place, a spiral of death remembers the state and when traffic demand reduces, you're still stuck in the big traffic jam. it takes hours to recover. congested users of the highway overuse and delay the recovery.

ok. well, that's, that's highways, jim. uh can you say something with electrons like they go faster? ok. this is, this is the mother's day pro problem. it was also called the mother's day phenomenon when i was working at bell labs. uh and bell labs was worried about architecting telephone networks all around the country.

so this a blue line is at&t long lines traveling and it's mother's day. so we have a student on the west coast decides wants to call mom pretty reasonable. who calls their mom on mother's day. i'm afraid to say it. but a lot of you said, ok. good, good. well, this student has got to try.

so it start whoops. too many clicks. first thing happens, they pick up their phone. this is old fashioned phone. i hope some of you remember when you had to pick up and get a dial tone? ok. what a dial tone did is it said there's a reserved line to be able to make an outbound connection from your local switching office. they finally got, well, right now she's waiting for the dial tone. you can see that little. no, no, not yet. and eventually she gets the dial tone. great. i'm allowed. i have a dedicated outbound line for my switching office going to long lines, dial the number.

once the long lines hear about, here's about it, bam connects across the country. reserves a line. no problem reaches this point where it tries to egress and get over to the switching center near mom. and what happens, bam, sorry, there's no lines available going into her switching office. and that doesn't seem that bad. that happens all the time.

well, the problem is the caller misunderstands what's going on because the caller hears, what do they hear a dial tone? and, and, and actually they don't hear that dial tone. they hear something that only a geek would know about. they hear what's called a rapid busy. it sounds more like, eeee, unfortunately, the non geeks, remember i checked to see who was a geek here. i'm a geek, by the way. um, and the interesting thing is the callers think mom must be on the phone because the line is busy. i got a busy signal. mom is on the phone right now in the house.

mom goes out and in and out all the day. i better call right back right now. this is my only shot because it's probably my brother or my sister who has her on the line. you see how that's kind of bad. the network is saturated, waiting for dial tones, zero end to end connectivity, zero revenue and 100% utilization. we are maxing out the capacity at every single switching center. no productive work is done on the phone system and the bad news is we encouraged people to dial again by telling them busy when busy didn't mean busy, busy meant system under duress.

and at some point, the phone company realized this and they changed it to something you might have all heard. it's a recording which says i'm sorry. all lines are busy now. please try your call later and then the caller and they could have added and your mom is not on the line, but they didn't bother that part. but they said try your call later and it turns out it worked pretty nicely. and they said, oh mom isn't on the line. it's still a random time. i'll call back in an hour. maybe they'll, and that's how they broke out of this by notifying people of congestion. it completely changed the behavior of the caller.

let's take a look at tcp congestion. now, now tcp is a packet switching network, packet switching networks. you send packets, packet, packet, packet, packet, packet, you actually to try to send an even amount you have what's called a send window. you know, we'll just say maybe there's 16 packets you leave outstanding, wait until you get an a back, an acknowledgment that they got and send more. so you have 16 in flight all the time. and maybe you start growing a little higher, a little lower. i won't go into tcp two howard because it's, that's much deeper than this talk. but you're sending packets at some speed and now routers all around they, so they have many goes into and many goes out to. but with stochastic processes, random processes, uh sometimes you have a lot of packets happen to come in on one side. they are all destined for one outbound link when that happens

What's the router to do? Well, the obvious thing is they went to a computer school, they have a queue, they put them in a queue. Uh but if you went to electrical engineering school, you'd also know that they have a finite queue. Ok? And so what happens when their queue overflows, they throw it on the gar in the garbage, they discard it. It turns out the internet, I like to say is an equal opportunity destroyer of packets. All right. That's what it does. That's the way it hints. It gives a subtle hint that there's some congestion. All right, and that's the hint that after queues build and routers have finite memory that the, the person at the far end says, hey, I forgot I didn't yet receive packet four. Would you send it again? Well, that sounds like a nice request. But what happens? You can't simply retransmit the packet because when you really have congestion, you start to hear hints earlier of what really congestion is like.

First, the packet makes it halfway across the country gets discarded. Then I send it again. It makes it a quarter way across the country gets discarded, then it makes it three quarters of the way and gets discarded. And finally, it makes it, what you start to realize is under congestion. A packet is using more of the network than if it just went on and get off. So as the system gets more congested for the same amount of traffic, the same number of packets we're trying to convey, we start repeating packets and using more resources. You start to smell the similarity to what happened on the highway.

So the answer is, well, the answer was hard. It took this very clever guy called van jacobsen who who realized that it was a hard problem and came up with a way of solving it. And I think of him as the savior of the internet that happened way back in time when it almost melted down. And the critical thing is he shrunk the send window, the send window on each connection was encountering by about 50% historically. So he cut it in half. He tried to reduce the rate that you're sending by about 50%. But the interesting thing is to notice this achieves exponential back off on the aggregate flow, the total flow that's being sent. He doesn't just say, well, send the packet after 100 milliseconds, then 200 then 400 milliseconds and 800 milliseconds, then 600 because I'm doing exponential back off that wouldn't have helped worth a whip. Ok. So the interesting thing is he developed a system which allowed him to reduce the aggregate flow very, very nicely. And that saved the internet.

Now, let's take a look. Now we're finally getting close to what a lot of you folks do building web applications and think about a web application getting overloaded. Assume the application can handle 100 transactions a second. What happens when it's 110 transactions per second? You went to computer school. What do we do when we have things coming in faster than they go out? We have a qe everyone knows that. Ok. So we start building a queue unfortunately, at 110 transactions per second. What happens? The q rises, the qe rises and as the q rises, what does the user perceive? What does the user perceive delay? That's right. They sees the little thing spinning. All right. So we have queues, build up. Everyone gets served. Not quite what do users do? What do you do when you don't get an answer for 10 seconds? You hit the refresh button. What a surprise.

So you hit the refresh button and now what have we done? We had congestion at the server and users hit the refresh button and now the effective demand, remember, we only had really 100 and 10 transactions per second because the user said, you know, maybe it's my system, maybe it's my isp. I don't know what it is. I'll just hit retry. Now, the effective demand jumps to 220 now a second ago, we just had 10% over. Now, we're more than 100% over what we were expecting to take on our server. Ok. The spiral of death has begun overloaded web application.

Assume the web server can handle 100. Now, the effective demand is 220. Guess what happens? The q delay grows faster and faster and faster. It's feeding on itself. It's a feeding frenzy of congestion collapse. Three or more reloads later, we're at 440. This has gotten gigantic. The, and the bad news is typically a server will go. Ok. Give me the next. Ok. They wanted a detail page on the product. I'll assemble it here. Here you go. Oh, darn. They hung up the phone. Ok. What do you want? Do you want the home page? Ok. That was easy. Here you go. Huh? They hung up the phone. How much effective work is being done? Zero, effective work is the, is the server running at 100% darn straight. It is, your system is at 100% 0 effective work. The system feeds on itself. It suggests to people that they should overload even more than we're in and it can't get out of this.

So fifo request handling the first in first out request terminate at the requesters terminate before they even have their response constructed. There's no effective work done. So let's throttle load. You say, I know I was only, I only made a mistake on my loading by a few percentage. I got up to 100% as they say on the show. What could possibly go wrong? Anyone seen that show? Yeah. Ok. Maybe you have, maybe you haven't. This is this could be a cultural thing that we saw in america anyway.

So if let's say we throttle, I, I say to the upstream receiver, I say throw away half of the packets that arrive. Now, that's pretty darn aggressive. You say I scaled for 100 transactions. No way I really get 200. But the bad news is right now, we've already pushed it up to 440. You've discarded half the packets. So now we're really receiving to process 220. The queue is still growing. We're throttling 50% of the load. The queue is still growing. Most users have tried hitting reload and finally they go away, throttling, 50% still allowed for 220. The queue continues to grow and grow and the collapse continues. Spiral of death is a state for death. The application is stuck. It's a weird thing to does this start to explain to some people when they've seen 100% what might have been going on? Yeah. At least some heads are shaking. Yes, you have to go back now. Recovery made simple. The common trick. It's actually funny the the it people go, someone put in an infinite loop. I'm tired of this. I'm gonna bounce all the machines, bounce the technical word for rebooting. Ok? So, and what that does by accident is it flushes all the cues and the good news is if you flush all the cues, by the way, a lot of your users have just given up on your site by now. So your effect, your actual demand is probably a mere 50. Mind. You, it's multiplied by four. So it feels like 200. Ok? But now you take the brand new response, you formulate it and you hand it and you actually are doing productive work and users are being satisfied, you have to respond while they're still there serving the next pending request to ring congestion collapse is futile. Once you're in there, you're stuck.

So why doesn't the user back off? Always save us. You know, you think uh the user is going away that, that now it should start eventually coming down as more and more word gets out that your c your site is down. Well, it turns out a lot of people use a service oriented architecture which includes multiple layers where the user calls service one, which then calls service two, which then calls service three, which then access the database. Suppose the database is overloaded. All right, the database is overloaded. Latency starts increasing. And server three says, no, you took too darn long. Something must have gone wrong and he, of course, or she, of course, it, of course, does a retrying and now it's really overloaded. So both of those fail. And now if the server two fails to serve, serve three, fails to service two. And it says, oh, this must be some freak accident. They must have switched computers on me. These happen now and then i know what i've got to do. I've got to do who here can say the word all together now retry at least some of you could do this. Yeah, i'm not, not really good. I have to my stage presence still working on it. Ok. But uh you can see now if we do this and each one does one retry one innocent one, innocent little retry doubles the traffic. So we have a doubling between server three in the database, doubling between service two and the service three doubling between service one and service two. We didn't even have to ask the user to do retries. We in this little picture with one retry, the database is getting eight times the effective load it was intended to handle. So you have to understand that you can't bound. By the way, just going to tell him going down even to one is still problematic. By the way, it's a lot better than five. And as you'll see, i know of systems where we've had five retries at each level and then you get 5 25 625 times the load for the underlying database. It's not pretty. You don't need much actual load when you multiply it by 625 to orders of magnitude, which brings us to prime day 2018, we had a distributed hash table with distributed hash table. You think of hash tables as things, you give it a name and you get a value, the name might be the name of a product or a number for a product and the value might be all sorts of goodies about the product, think of it as a hash table and a distributed hash table means we break up sections of, i think of sections of your dictionary into different ones. And when a request comes in a router decides which one of those databases has the answer. It goes to that, it looks it up and gives the answer distributed hash table. Unfortunately. Or actually, fortunately on prime day, we have some interesting sales. This being cyber monday, it might be on the mind of some of those people and sometimes they're really good sales and we sometimes get what's called a hot key, which means a lot of people ask about a specific product and now a lot of traffic funnels in asking about that specific product at the bottom end in the distributed hash table. We're not getting a nice spread that we were hoping to get from randomness. And when that happened, we overloaded the partition and then it, it timed out and then people above it, people actually, iii i give a personification of all these computers systems above it started retrying and the retry storm, there was a bunch of systems that actually here at amazon.com officially were trying to retry about five times and i think it was more than just three or four levels. So it was pretty darn impressive. And that's why, unfortunately, we were out after the latency failures drove the retry at numerous levels for about three hours. And the thing that really killed us at the end was there was a hidden queue, a cue that people didn't even know. And i bet most of you wouldn't even realize exist if you were doing a code review. This shows how hard it is to do a code review and find hidden cues. It turns out the tcp receive buffer was set at one meg down at the bottom line database. And we were using what's called http, keep alive. We're reusing a single tcp connection to pipeline multiple requests. So each request was about 1000 bytes. We had one meg. So the answer is each one of these tcp lines queued up invisible to everyone. 1000 requests from every single one of the servers to this hot server. We had a ridiculous size queue at the bottom by accident. But how many would have suspected to receive buffer is a cue. Raise your hand if it's obvious, see one or two, either that or he's lying, but he's smart too. But uh i can't tell if you raise your hand or right. Yeah, you're raising your hand

Maybe. Good. Good. So there's a bunch of real good geeks competing, competing with me. All right. So anyway, the interesting thing is we had hidden cues. So now you see this really exploding and you'll, you can read back about 2018, Amazon was at, it's about three hours. It took us to massively throttle slow things down. Remember how low we have to get release patches to fix several of these pieces until eventually we got control of it. No productive work was getting done and it was prime day. So some of the big people upstairs were watching a bad thing. So changes are made.

The first thing, diligent congestion related reviews and testing. Most specifically retries, we brought it down almost instantly. Never more than one. How am I doing on time? Not too bad, good. Uh never more than one retry to per request. The good news is getting someone to change the code from five retries to one retry probably doesn't instigate bugs. You're changing one character. Very simple change. So the goal though is to get the asymptotically no retries. And the way you that is a given service, you aggregate across the service sometimes across the host counting up how many times you've gotten successful responses from the lower level service until you get 100 successful responses, you start turning off retries at your level and now you have a retried budget of something more like one in 100 instead of. So you get 0.01 times instead of two times or 1.01 is the magnification factor.

So we asked them to, we pushed it down as we went into bigger changes and we did restrictions on large and unbounded cues. Found a lot of cues, cleaned them out, brought them down to finite size, but we knew better. We knew that we couldn't even find them all. So we had requirements for crush testing. Crush testing means each of these services we take up over the limit over what we scaled them for. We watch to see if latency grows. If latency grows. It means there's a, what's the word we have to say? You are smart. You answered a lot of a bottleneck. No, the today's magical word was cu sorry. Uh you have a q ok? And there's a cue hiding there and that's what's increasing the latency and you didn't realize it was going on. So the crush testing reveals that understand this whole mission that i was on had to do with availability versus efficiency. Everyone likes to say, oh if you think you got more traffic scale up and be safe and then the person with the dollar says, oh, don't be wasting money on computers. We're not using scale down. So there's this constant pressure and your efficiency requires minimizing the capacity but random traffic variations happen, overloads happen.

So the things that you've got to do, i'm going to go into more details. In a second. We need better handle random overloads, find ways to avoid congestion, collapse, things that i'll be talking about. More overutilization. You need graceful degradation, you need, you have to be willing to fail some requests. In fact, the way to explain it to the marketing people is you say i have two choices as a programmer, i can, i can serve no one or i can serve some people. What would you like? And they say, well, why don't you wait a time and, and serve everyone with a five oq. And i go, maybe you didn't hear me. I can, if i try to serve everyone, i will serve no one. I have to be willing to fail. I have to fail fast and gracefully with minimal work.

So let's talk about evading congestion collapse. There are three big prongs to think about because this is all about living with overloads. I mean, it's sort of funny. Some people comment that as you build your service, the worst thing that could happen is it actually becomes popular and suddenly you have a lot of stuff or a common thing is you get slash dotted and bam a factor of 10. 0, you know, i wasn't really ready for that. So overloads can do happen, be cautious about retries and get more efficient during overload. That's even a a more complicated, interesting thing to think about. But let's talk about each of these three.

First of all overloads happen. If the caller can't tell your overload, overload, it, it can't decide whether to retry or not. So be sure you increase the information being passed back during the error messages. You have to indicate and overload explicitly. We saw the phone company do it. All right. Realize that overloads require service degradation. You can't service everyone all the time. It can't be done. You're overloaded, be willing to fail and throttle requests rather than spiral of death, spiral into a collapse. And the third thing, don't feed the overload. Be very careful about retries, especially for timeouts, retry, storms can be devastating bound. The retries at less than one but at a minimum, get it to one. By the way, we looked into the groups and said, no, we, we think we need all those retries. I say, ok, show me the numbers, show me the data. And when they examined the data, they concluded that the second retry maybe helps them once a century. And at that point, they were willing to reduce five down to one. And if you do retry share and gossip, the fact around most are all callers from that host so that you have, if someone says, well, i do this exponential back off and they said 100 milliseconds, 204 100 milliseconds, 800 milliseconds, that's not exponential back off. That's misusing the term. Ok? That doesn't help you catch them on that. And if they say, i don't believe you say, please go and watch jim roskind's half of the talk and you have to restrain the growth of the aggregate retry budget, not just your retry on one individual request.

So the other thing that's really cool if you want to be super cool, try to get more efficient during overload systems naturally tend to get less efficient at load, potentially leading to this death spiral, which we don't like, who likes it when their systems go down. Good. No one raised their hand, i guess, just to make sure you're awake, who likes it when their systems stay up. Ah, they're not all asleep. Good. Ok. Optimal pattern is for systems to become more efficient. And here's a quote put off for tomorrow, what you don't have to do today because tomorrow you may not have to do it. This should be a programmer's mantra. Unlike what my mom taught me something about don't put off tomorrow. Uh this is a programmer's mantra. How does this work? And you have to consider sort of adaptive batching. Fundamentally, there's some setup costs and requests. Can you batch things together and reuse the setup cost and reduce the aggregate cost of the call? But you have to, by the way, of course, support partial success, you wouldn't want a single request in a batch to poison the whole result. Otherwise you'll actually have failure amplification. Ok? That would be bad.

But let's give an example. First, we'll have an easy one that relates to something people have seen before a lot. Uh which is word processors. Here is a college student who just wrote a word processor. He said his basic loop is you read a character from the keyboard buffer, you uh you update the entire document uh with, with that, that one character and then you refresh the screen completely to be sure it's displayed properly, rinse and repeat, who could possibly what could possibly go wrong? All right. I don't know if you've seen this in a, in a editor. If you're older, you see it on a lot of the old editors. You think? What page do i, oh, i think i have to go down like eight pages from here and, and you press page down, page down page and, you know, boom, boom, boom, boom, boom, eight times and you see it go and they say, oh there's the picture, but unfortunately you hit it eight times and you go ok. So now you're in a lot of trouble there and you start realizing, oh i better not type too fast. And you realize, holy cow, it laid out the entire page and i didn't really want it, it didn't put off or what it could have done today.

The correct and cool thing to do see a terrible response to page downs is read a batch of keystrokes from the keyboard. Update the format on the first modified line in the document. Don't worry about flows that flow off into the next line. Then go to the display buffer and update the first line on the screen that needs to be updated. Don't worry about the other ones. You're probably going to change them in a second anyway. And now rinse and repeat, it turns out i implemented this. I was working with a fellow who later went on to found frame maker or frame. And this was able to keep up with typists no matter how fast they typed on an 8088 processor. The cool thing is your eyes wouldn't even see it as it would fly across the screen with bunches of characters. And once you stopped the rest of the page displayed, but that wasn't what you were looking at. So this is an example of getting past, we say jim. Ok. I did some cool stuff with the word processor. What can you do for me in the world of applications?

So here's an example, duping of price updates. A simple implementation, read a product price change, update the database, read a product price change, update the database. That sounds nice. Unfortunately, if you start falling behind because the database can't be updated, the queue starts growing. So how can we be more efficient? The answer is read numerous price changes as many as are buffered up, take them and sort them by product then remove any duplicates the person, raise the price, raise the price, raise the price, but you haven't gotten around to it. We remove the duplicates. So you actually call the database less. Your system is actually getting more efficient as you go.

And the last thing, this is the thing to take home and be sure you're testing your system crush test beyond the maximum load testing at expected load is not enough. It doesn't explore the state full collapse. So, what you've got to do is test to failure, the level that you thought you scaled up to and beyond watch for unbounded latency because that's a sign of what's the right word now. Q see, he, he, he, he was listening the whole time too. So a cue, that's a sign of a cue. That's the la, the latency and failing fast is good. Failing slowly with a lot of latency is bad in these systems. Watch for problematic retry storms. It's sort of interesting. Once you have these retry storms funneling down over here, they overload a system that wasn't really at the bottom wasn't really at the top. But now that that's overloaded, it starts causing latency delays for this system. So the fire spreads and often you can't even tell where the collapse began as it spreads all across your data structures, all across your entire fleet. Watch for problematic retry storms. And then the final thing, oops, i don't know what that button did, didn't do anything bad. Good. Uh, after the overload, reduce the test call rate to a normal rate and, uh, see how quickly the latency recovers. It less than a minute is good. More than a minute says there's an evil cue that you have to find and you want to find it today in your testing rather than your customers finding it on prime day, which brings us to the second half of the talk.

So i think you're supposed to, you're supposed to read that. Yeah. Why don't you read that, by the way, i'm, i'm jim. Yeah, if you wonder who the guy is in there. Thank you. But here we go, jim draw ski. So folks now that we understand how congestion collapses all around us and how it can impact real world applications. Let's explore how you can use aws to tackle this phenomenon. We'll also see a couple of cat pictures on the, on the way. If we have some time, we'll structure this conversation in three buckets. First, we'll talk about detection or monitoring for symptoms of congestion collapse. Then how you can avoid overloading your systems in the first place. And finally, what specific testing methodologies you should adopt on aws when it comes to handling congestion collapse, when it comes to detection. Amazon cloudwatch provides you with metrics like cpu utilization disc re s and disk, right? S out of the box. You should monitor these metrics because high cpu usage and high disk usage are just the classic symptoms of congestion collapse.

In addition, you should also create user defined thresholds and every time the metrics are beyond that threshold, they would be in an alarm state and then alert your infrastructure admins about the occurrence. Talking about amazon.com, the ecommerce side of the house monitors these metrics on a one minute basis and they alert their infrastructure teams once a metric is in alarm status for three consecutive times or after three minutes. Now, this becomes an interesting balance of, do you invoke your infrastructure admin, admin teams every minute, every three minutes, every five minutes or every 15 minutes? So the exact time would really depend on your business processes.

Talking about networking. Amazon cloudwatch also gives you aggregate metrics like network in and network out, which is a total amount of bandwidth going in and out of your two instances. Additionally, you can monitor network packet in a network packet. So which is the total number of packets in and out of your instances

When you monitor these metrics, you can define a baseline. You can figure out that on a happy day, how much bandwidth and packets per second are you seeing and then create thresholds. For example, if my network in and network out goes beyond a certain value, this is when I start alarming on that metric.

Now, in addition to looking at the network packets in, network packets out and the total bandwidth (which are aggregate metrics), it is also critical to understand whether you're queuing or dropping any packets because of network related throughput in your environment. And for that, you can use high resolution metrics on Amazon CloudWatch. You need to install Elastic Network Adapter or the ENA before you can start using the high resolution metrics.

Now, the ENA comes pre-installed on various popular AMIs like Amazon Linux 2 and Amazon Linux 2023. But just in case it's not installed or pre-installed on your AMI or the AMI of your choosing, you can always install it later as a package.

Once you install the ENA, you get access to metrics like PacketsPerSecondACLExceeded. And this is the metric that tells you if you were queuing or dropping any packets on your EC2 instance because that instance was already at its packets per second limit. And once you start monitoring metrics like this and more importantly, alarming and alerting based on metrics like this, that's when you will realize whether you have network congestion building up in your deployment or not.

Let's focus on load balancers for a second. So here I'm showing an Application Load Balancer that is spraying its load into four backend targets. Now, one of the common symptoms of congestion collapse, you would start seeing an increased number of HTTP 500 error messages from your web servers. So as of yesterday, the Application Load Balancer or the ALB now supports a new feature called Automatic Target Weights. By using this, the ALB can track whether any backend targets are sending an increased number of HTTP 500 error messages. Once it detects any such backend instances, it marks it as anomalous.

Now, you can monitor it in CloudWatch as well and once you track these metrics in CloudWatch, you can take the corresponding action. One corresponding action could be on this slide - if four instances are showing an increased number of HTTP 500 errors, if they are anomalous, then you can update your Auto Scaling group policy and say let's add four more just so that we can handle the load at least for the short term while your teams debug and figure out what happened to these four instances.

All right. So in this section, we discussed how you can use various CloudWatch metrics to monitor for symptoms of congestion collapse. I also recommend that you create dashboards on CloudWatch. A dashboard is a collection of metrics that you can use. And by looking at a dashboard, it's an easy way to visualize what is the current status of your deployment and whether any actions need to be taken.

Now, let's talk about how you can avoid overload within your applications in the first place. Let's take the example of a web app that you may deploy in a region inside of a VPC or a Virtual Private Cloud. The VPC may span multiple Availability Zones. Now, at the bottom, I'm showing a fleet of two instances and you may be running your application code on these two instances. Then there is the good old ALB spraying load into the two instances or the backend instances. And then finally, the users hit the ALB to be able to access the application.

When it comes to avoiding overload for a web app, there are three specific best practices here. Firstly, keep malicious traffic out. Then, ask your users to back off early, which will sound a lot like avoiding the Mother's Day problem. And thirdly, you must throttle upstream within your application, which would sound very similar to using metering lights on a highway.

Now, when it comes to keeping malicious traffic out, you should use Amazon CloudFront to serve all of your static and dynamic content. Amazon CloudFront has a vast presence across 500+ points of presence globally and that provides low latency to users. Additionally, this vast footprint also means that any DDoS attacks can be absorbed at the edge layer and hence it would reduce load on your systems. When you use CloudFront, user requests will first hit CloudFront at the edge locations and if needed, the request or the traffic will go back to the backend targets.

Secondly, you should use AWS Shield Advanced to detect and block any DDoS attacks right at the edge. And then finally, you should use AWS Web Application Firewall or WAF that allows you to create rules for your web application and using these rules, you can detect and reject any malicious traffic right at the edge.

Now, for keeping malicious traffic out - when you use a combination of these three services that is Amazon CloudFront, AWS WAF and AWS Shield Advanced, any malicious traffic is again detected and rejected at the edge location. And hence that traffic never even goes downstream into your application and hence it reduces the load on your systems.

Now, let's talk about asking users to back off early. In the previous section, Jim spoke about the Mother's Day problem and how the phone companies solve that problem by playing a pre-recorded message to their users. And that message really made it clear to the users that the issue is on the phone company side and they should not hit the proverbial refresh button five times in a row.

Now, let's see how you can avoid accidental overload in AWS. So AWS Web Application Firewall allows you to send custom response codes to your users. Tactically speaking, a configuration sample that you can use is - create rate limit rules on the WAF. A rate limit rule looks at the aggregate number of requests that are coming in for your web app. Once this rate limit rule is breached, that's when you can reply with an HTTP 429 code or a custom web page back to your users. Since we are talking about Amazon.com, Amazon actually does a great job of showing a custom error page - have any of you seen a custom error page from Amazon - the Dogs of Amazon page?

Yes, so this is the Dogs of Amazon page that we show whenever there is an issue on the Amazon.com site. And this is what informs the user that the issue is on the far end. More importantly, it informs the users that the issue is not with their browsers or with their internet connection. And this right there informs the users that they don't need to hit refresh three times in the next one second to be able to access the application. And this is your mechanism or your AWS mechanism to ensure that you can influence human behavior and hence avoid a situation where real legit users are causing accidental overloads for your web apps.

Now let's talk about how you can throttle upstream within your application. So since we are talking about within your application, we assume that the requests are already at your application level. This is all legit traffic. Jim spoke about the service oriented architecture. Now, in architectures like those, multiple different components come together to build up an application. And for applications like those, it becomes interesting or extremely important to understand the component to component interaction.

Let's take an example. So out here, I'm showing two components of an application - Service 1 and Service 2. For the next few slides, the term service and application component are interchangeable. So in this example, Service 1 at the top is communicating directly with Service 2 at the bottom. Because of this direct communication, this is an example of a tightly coupled architecture.

Now let's go back to the service oriented architecture conversation. It's quite possible that different software engineering teams own these different components of the application. It's also possible that different software engineering teams may have different auto scaling policies for their components. So what happens when Service 1 at the top scales out faster than Service 2? Service 1 continues to push its messages or requests down to Service 2. But Service 2 at the bottom is not as strong as Service 1. And hence, as you can see, the queue is building up. Before you know, the CPU usage of all the instances within Service 2 might hit 100% and the queue size is building up as well and zero productive work is getting done.

To avoid this, you should use the decoupled system or loosely coupled system. So in this case, the component at the top becomes the producer of messages and it puts its messages on a broker system like Amazon SQS or Simple Queue Service. And then the consumer really fetches those messages from the queue. Since the consumer is pulling the messages, now your application starts behaving in a pull based model. And since your components are not communicating directly with each other, this right there is your AWS mechanism to ensure that one component of your application doesn't overwhelm or overpower or overload another component of your application.

But there is more here. So let's take a break from technology and consider a use case. Yesterday when I was flying in along with many other users, I tried to call my Uber cab and the pricing on those cab hailing services can change at any point in time. Similarly for Amazon.com or in ecommerce like businesses, the prices of items can change at any point in time. So for businesses like those or for use cases like those, you would want to ensure that your applications are investing their precious processing resources only in processing the most up to date messages - for example, in processing the messages that have the most up to date pricing information.

So with that background, let's go back to the slides. The starting point is the consumer is already overwhelmed, maybe it's processing a previous message that's taking a lot of time or it's overwhelmed for a different reason, but the producer does not know about that. So what does the producer do? Producer starts putting more messages on the SQS queue. Now, in this case, the queue is still building up, but it is the SQS queue. SQS by itself is a fully managed service, it scales horizontally and SQS can manage this load. But let's consider this from the consumer's perspective. Eventually the consumer will free up its cycles and start consuming or pulling these messages from the queue. But there is a chance that at that point in time, the consumer is investing its resources in processing messages that are already stale and that might be wasteful expenditure of those resources.

So one mechanism is throttling upstream, which is also very similar to how metering lights on highways work - the highway is already congested and you want to ensure that new traffic is metered. So now translating that into AWS and this architecture, you would first need to understand whether your highway is congested or not, which means whether your consumers are overloaded or not. And for that, this expanding queue depth might be a good indicator of that.

So you can use SQS ApproximateAgeOfOldestMessage metric. And this is indicative of your queue depth. You can track this in Amazon CloudWatch and you can define a threshold. That threshold would define whether your queue is healthy or whether your queue size is healthy or not. If your queue size goes above your threshold, that's when you can alarm on that metric and take corresponding action. In this case, the corresponding action would be to have a Lambda function on the side that informs the producer that hey, you know what, the consumer at this point in time is getting congested. So please slow down.

So this is going to be your net mechanism for ensuring that you can throttle upstream and hence meter your application's load, meter your application's traffic especially at high load. Once the consumer, while the time the producer is throttled, the consumer will eventually start pulling messages from the SQS queue and eventually your queue size will again become healthy. And once this metric is healthy, that's when the Lambda function can stop throttling the producer. And at this point in time, the producer can continue or resume putting messages on the SQS queue.

Now a natural question here is there's only two components on the slide here. What if there's a three tier web app, four tier web app? What if there are multiple components? So for situations like those, you can use the same mechanism in a cascading manner and throttle upstream at each level - until you reach the top component of your application. And what sits above the top component of your application - the user! And how do you ask users to back off early? Which is what we discussed in the previous section by using WAF and a custom error page that you can show them.

So in this section, we discussed how you can use various techniques to avoid overloading your systems. Let's talk about testing.

As you heard in the previous section, Amazon.com has adopted the techniques of crush testing, which is testing your application to its failure point and beyond. Now again, if we focus on Amazon.com, it's going to be extremely tough to create a test bed or a test deployment that is exactly the size of Amazon.com and to be able to stress test it or crush test it.

So because of that, you should create a simulation or a small scale model of your application. Now, this is where the promised cat pictures come into the picture. There is enough time. So if your entire deployment is indicated by this big gray cat, you must create a kitten version of your application and then put it to a good workout.

Now, let's go back to the service oriented architecture conversation again. Let's say your application is built of 10 different components. When you create the small scale app, you should test with all the dependencies because this testing methodology would inform you that when you crush test one specific component, does that have any adverse effect on any other component as well or not? And then finally, you should crush test each layer or each component of your application separately or independently. And this is what will inform you whether there are any weak links in the various components that comprise your app.

Now let's talk about traffic. There are two different kinds of traffic that you should use during your testing cycles. Firstly, you should capture some real world traffic that you get from your legitimate users. Using techniques like Amazon VPC Traffic Mirroring would allow you to capture traffic on an Elastic Network Interface and send it to a destination. Once the traffic is at a destination, you can save it into a PCAP file and then replay that during your testing cycles.

Now, if you keep replaying the same traffic over and over again, there is a chance that your systems might start caching information. And if your systems end up caching the flow information, you would truly not be stress testing or crush testing your application. And to alleviate that, it's also recommended that you add some net new traffic or simulated traffic that you can accomplish using open source tools like Locust.

This brings us to the second testing methodology that you should consider, which is the principles of chaos engineering. And this means inserting artificial faults into your environment to see how your application behaves while those faults were introduced and what the user experience was when you introduce those faults. Now please insert those faults only during maintenance windows!

So to insert these faults, you can use an AWS service called AWS Fault Injection Simulator or FIS. And FIS allows you to create experiment templates and then run those experiments. These experiments can be something like simulating an instance failure - what happens if an instance goes down and then comes back up? Or simulating a networking failure - and the example of that would be a faulty network access control list entry that might prevent communication across two Availability Zones.

Since we were talking about high CPU being a classic symptom of congestion collapse, using FIS you can also simulate high CPU on your instances. And here is a snapshot of an FIS experiment that is in running stage and this experiment is going to cause high CPU on an EC2 instance. Now, while this experiment was running, the corresponding CloudWatch metric shows that the CPU usage spiked.

So by inserting such faults into your deployment during maintenance windows, you get two benefits. Firstly, you get to understand whether your application - what is your application behavior during the failure and what the user experience was during that time. And secondly, it also informs you, it gives you a peek into whether your operational runbooks are set up in a way that supports your business. For example, while this experiment was running, which for example, was causing 99% CPU on an instance for 5 minutes - did you alarm on that? Did you alert the right infrastructure admins or not?

All right. So let's summarize all the things that we discussed.

Firstly, we discussed the specific CloudWatch metrics that you can use to monitor for symptoms of congestion collapse in your environment.

Secondly, we discussed how congestion collapse is such a state phenomenon and because of that, it's best to avoid it as much as possible. For that, we discussed three specific best practices:

  1. Keep malicious traffic out - we discussed how you can do that by using a combination of AWS edge services like Amazon CloudFront, AWS WAF and Shield Advanced.

  2. We discussed how you can use AWS rate limit rules along with a custom response code to ask users to back off early and hence solve for the proverbial Mother's Day problem.

  3. We discussed how you can use Amazon SQS and a Lambda function to throttle upstream and keep the highway of your application decongested.

Finally, when it comes to testing:

  1. Adopt the principles of chaos engineering, which means insert artificial faults into your environment and then test your application to its failure point and beyond - which is the principles of crush testing.

We hope this session was useful. Jim and I would like to thank you all for your time and please fill up the session survey in your mobile apps - that's what will inform us whether these sessions are useful and whether we should continue hosting such similar sessions. And don't be shy to say that this was the best talk you went to - it's ok, you can say that!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值