Centralize your operations

All right. Hello everyone. Welcome to the session. Uh it seems like we can go ahead and get started now. So thanks for taking the time to attend uh for today's session COP 320 Cloud - Centralize your operations.

My name's Eric Weber. I'm a Senior Specialist Solutions Architect here at AWS. Been with AWS for a little bit over 7.5 years now. Uh working on uh helping customers adopt, you know, some of the, the centralized operations management that will be the main focus of our session today.

And alongside me, I have...

Hello, everyone. My name is Oren. Uh I'm the Senior Manager for Operations Management products. Uh my whole reason for being at least at work is uh to build products that remove the drudgery from your day to day operations, right? We can always take you down to metrics and say we want to reduce your MTTR which we do. Uh but at the end of the day, we wanna make operations fun. We're not quite there yet, but we're definitely on a on the journey towards it.

Uh hello. Uh I'm Badri Govind Rajan. I work at MuleSoft um I'm here to say my journey on how we use the SSM framework to manage our patching.

Awesome. So, uh just to give you folks uh a high level overview of what we're gonna be talking about today. Um and you'll see, we'll jump around a little bit. So we're not gonna be too, you know, boxed in by the slides, at least by the slides or how they're, how they're written today. Yes, we did write the slides, but it's a, it's a presentation and we take it a little bit more uh flexible.

We're gonna be talking about automated operations at scale. One of the biggest things that we see with customers is that they start moving into the cloud, they touch, you know, they get a tow in, they get a couple of ec2 instances, a couple of lambda instances and then they blink and suddenly those numbers jump to 10,000 and the tools that they were using to manage the small amounts either sometimes hybrid, sometimes on prem, sometimes in the cloud, sometimes that's not in the cloud don't really scale. So we're gonna talk about how we do scale to, you know, those kinds of numbers and even higher.

We're going to talk about how we centralize those because one of the tenants that we have when you manage your operations is you're gonna have lots of accounts and we all know that managing resources across accounts and regions can get a little confusing, right? I mean, who here has ever had 15 different tabs open when they want to just find out the status of 1 ec2 instance. Anyone. No, no. Yeah, there we go. Uh I have uh 15 is lightweight, by the way.

Uh and then also managing nodes at scale. So we all like to talk about servers and servers is, is incredible and we use a ton of cvs. In fact, a lot of our products themselves are built on top of servers, but we still end up having nodes and those nodes can be c two nodes that can be uh hybrid nodes that can be nodes running on, you know, just anywhere, basically. Uh so we're gonna talk about how we manage those nodes at scale.

Uh and then we're going to move to Barry who has done an incredible job of putting together a lot of these building blocks that we're going to be talking about into a cohesive end to end solution uh for centralizing part of their operations. So we're going to be diving into that maybe Oren.

Uh just a quick question. Uh so, you know, we're talking about how to centralize your operations, you know, maybe just a little bit of a, a bragging contest in some ways, I guess, you know, how many people are currently managing, let's just say tens of accounts just to start it just quick hand up. All right, let's say let's move it up to about one second. I'm gonna, that who here has only got one AWS account? He's got 10, ok. We will chat afterwards. I will tell you about organizations and how we get it, but that's awesome. It's awesome to see that adoption of multi-account.

Sorry, more than 100 accounts. 500 accounts. Yeah. Uh, it's, it's always somewhere right in between those two points. So like, Oren was mentioning, logging into every single one account trying to find your resource that's untenable.

So a couple of customer asks that we generally see fall into kind of three primary categories or three primary asks. Uh the first one is kind of what i touched on when i spoke about the agenda is how do i go bridge from what i used to do to what i need to do tomorrow, right? We see for example, like i work sometimes with systems admins or, or network operations folks who are transitioning from a primarily on prem to a cloud environment and they're maybe changing title to a dev ops engineer. And they're like, yeah, how do i used to deploy? Well, i used to like ssh or scp onto three boxes and bob's your uncle. And suddenly it doesn't really scale when you're talking about thousands of boxes across multiple different regions and accounts, et cetera.

So how do i build tools that will work today? But also work for the scale that i'm gonna have tomorrow. Uh and then the third, the second one and then is ok, fine. I build a tool. But how do i actually manage the operations then at that scale? Because its not just about the tools but how do i run the tools? How do i know it's safe? What happens when they fail? Not everything's perfect. Some things happen, right? And we need to fix them.

Uh and then of course, manual, we're going from a world, we've gone from a world where it is very normal to have one or two servers only. And you have people that are literally sitting on the servers all day long to a world of automation. And as a software dev like to me, that's the most exciting thing in the world because i hate manual work. But so does a lot of people, but the problem is we still have some manual work. So how do you do that safely? How do you remove the human error when you still need humans in, in your, in the equation? And so how do we get rid of that?

So just before we go in, we have the numbers slide. I love this slide because to me, the numbers are just mind blowing and i work at editor and i see these numbers regularly and i still think that they are absolutely incredible.

Um we have a bunch of services that run at huge scale. Now, I'm not putting these up here to say, oh my gosh, look at how awesome we are. But rather we have encountered nearly every problem that you have encountered and we have made nearly every mistake that our customers have made and we have learnt from them and we're not perfect yet, but we're constantly improving.

And what we are doing in our products is trying to bring a lot of the learnings that allowed us to scale to 11 quadrillion metrics to 20 million concurrent instances that are managed within systems manager uh to 9 billion compliance checks across config and a hun 850 billion auditing api requests that come through cloud trail. What has allowed us to scale to? That is a lot of these learnings that we're gonna be talking about today and a lot of these learnings are now codified into our products that we're gonna be talking about.

So when we talk about automatic operations across all of your environments, one of the biggest problems is what do i use because welcome to aws and we never have one way to do things. There is no one way right way. I wish i could come up and just say, do you know what i gave an aws re invent talk back in 2016. It had the one true way of doing things. Go look, go watch the youtube video and i'll see you out on the floor and that'd be it. That'd be the whole talk. It'd be amazing if we could do that. But the reality is that everyone's infrastructure is different, everyone's applications are different, everyone's services are different. The scales are different. The resiliency requirements are different. The fallback though the, the fall through requirements are different.

And so in order to meet your differences and your requirements, we have a lot of different services that are built to help you build to the requirements that you have versus the requirements that we have. So for configuration you have, i'm not gonna talk through all of them. This is the slide that if you're really interested. You go back later on and you look at the slide and you say, ok, config sounds really interesting. I have not looked at config for compliance. I really should.

So i'm gonna go into we have a suite of configuration, observable security and compliance data services that all work together to make sure that you can see the metrics and the data that you need to know what is happening on your systems. And where on top of that, we have automation from systems manager and automation allows you to say uh in a cus manner. Let me run a bunch of code that will go and probe my systems and remediate my issues automatically.

For me, you can, you can kick these off automatically as part of triggers that come from your observable from that observable layer, the layer underneath or you can kick those off manually. Now, again, layer on top, we like to build layers here. This is like lego, we're building up into like a beautiful castle here.

Um systems manager is split into a bunch of different segments that allow you to deal with the issues that you have at hand. So everything from change management, for example, you have an automation document, but it's a scary one. We've all heard the stories and everyone thinks it's mythical, but i can tell you for a fact it's not. But we've all heard the stories of the person who added the extra zero onto the end of a parameter in a command and took down 100% of a fleet instead of 10%. And again, like it sounds mythical, right? Like there are these mythical stories that bounce around these things actually happen.

So change manager is there change management services like change manager, are there to prevent some of these problems by saying, let me go vet some of these inputs to the automation event management and incident management come under a larger umbrella here. Uh this is everything from op center and incident manager.

Um and then of course, we have node manager, which is what i was talking about before, which i mentioned briefly, which is how we go and take those tens of thousands of ec2 instances from you and say what you care about is your services running on these two instances, but you don't care about kernel. You don't care about patching. You just care that it's done. That's the beauty of operations. Like the, the where we want to get the utopia of operations is i don't care.

I i have a tenant. Amazon loves to build tenants for their teams. Every team sets tenants. My number one tenant for my team is be lazy and it might sound funny. And i will be honest that some of my new hires look at me like i'm a little crazy when i give them the talk.

Uh but there, there is an inherent and there is an inherent efficiency in being lazy. I want my developers to be focused on building the stuff that's actually giving value. I don't want them spending time patching instances. I have a report that runs daily. We look at it once a week though that tells me how many of my instances are read ie unpatched, right? We have yellow as well. That's like in a patching cycle still. So they're, they're ok and they have red. I never want that number above zero because i never want anyone to look at it.

I don't want anyone to worry about it. I want it to be automated. I want it to be taken care of. And that's when road management comes into play. I don't want to worry about ssh i behind a vpc, i, i don't wanna worry about any of those bits, but i don't want to worry about patching. I don't want about h i don't wanna worry about how i run code on the devices. I just want someone else to take care of that for me.

And then on top of that, we say, hey, you folks have, we currently have or using an itsm solution something you know, for example, like jira service now, et cetera

So we have the service management connector that helps you connect the outputs from the stack below what we're looking at and helps you bring it into the ticketing systems or ITSM solutions that you are already using.

And then of course, on top of that, we can start segmenting some of those items. And we can say, hey, we don't often look at all of our infrastructure, some people do, right? If you're a security group, you're often looking at all of your infrastructure. But for example, a lot of groups or if you're a dev group, you're just looking at my application or my component. And so we have things like application manager and fleet manager to be able to say, hey, I'm going to segment my area into fleets or applications and look at them like that and manage them like that.

And then on top of that audit, audit is very specific to folks who have specific auditing requirements. So this is not a general purpose service that everyone would use. I'm more than happy to chat about that more afterwards. But this is kind of your stack that builds up again. Like the point here is to say, like we have for every layer of your operational environment that you will need to, we have services that you can put together again, like Lego to build the processes and the, the operational environment that works for you. And it helps you, your devs your engineers, your support people, your ops people, whoever whatever titles you use in your organizations to help them, not worry about the things that don't matter, but worry about the things that do.

So I'm gonna take a break and I'm gonna have something to drink, but I'd love to make this a bit real, right? Like how do we talk about this? Ok. So it sounded all cool, right? But like let's actually make this real by running through a real scenario.

So, Eric, I'd love you to make this real for us.

All right, thanks Lain. So talking about all those different AWS services again, you know, building them together to have a more powerful platform underneath, you know, kind of putting that to picture as far as the real world example, talking about microservice instability on some EC2 instances.

So kind of going back to what we touched on in the beginning, you know, if you have that single AWS account, single region, you spin up some EC2 instances, you configure your CloudWatch alarms, you know, monitoring some metrics within the operating system - CPU memory disk space application health. It's fairly rudimentary to be able to identify that resource when a problem occurs when that CloudWatch alarm starts firing.

But you know, once we start going into the 50 of these accounts, 100 of these accounts beyond that, you know, we start provisioning our resources across all of these different locations trying to map that correctly back to our internal CMDB. You know, it can be very problematic once those alarms start firing, trying to identify where that issue is occurring, you know, being able to even just starting the issue in the first place, based off of some of the observables that Oren was talking about and then starting to actually take action and respond to it.

So, you know, centralizing all of this data, you know, into either a central location, you know, whether that is through a delegated admin account, which we'll be talking about or even through that IT Service Management connector, you know, it's critical to be able to pull that information back to a location where your engineers or responders, you know, can quickly identify where they actually need to go and start troubleshooting.

So let's go ahead and flip back into what are some of the tools that AWS provides that can help with that process, thought you were done with me.

So, um as I said, I manage the group that builds operational management products within AWS Systems Manager so a lot of you are probably familiar with AWS Systems Manager, you're also probably familiar with the fact that is huge. We have a lot of stuff packed into Systems Manager. It is literally your operations toolbox. But some of this can be a little daunting because when you open that toolbox kind of everything kind of pops out. And so our goal is to make that a little easier and we're gonna break down some of the services today that we find fairly useful, especially in this context.

So one of those is OpsCenter, so OpsCenter allows you to, well, like I said before, our goal is to reduce your meantime to resolution. And the idea here is to be able to take all of your operational issues and see them in one place across all of your accounts. You know, within an organization. If you're not using AWS Organizations, you absolutely should be using them. But we do also support a, a mode where you're running without AWS Organizations if you happen to be in that kind of mode.

The goal here is to say, hey, we've taken a signal from the observable layer, but a signal is not enough cause when I'm woken up at two o'clock in the morning, when I get paged, if all I have is a ticket that says II, I don't know, alarm, you could CloudWatch alarm for low disk space fired that that doesn't tell me much. You might give me the instance name, but that doesn't tell me much. Is the instance still alive? What's the hard drive looking like? Like, what's the disk space looking like? Now, what's the CPU load looking at? What's the networking looking like? Like, how does, what's going on with that? What has anything changed? Have there been any deployments? Has anything happened?

So the goal behind OpsCenter, as well as its sister service, which we'll see in the demo called Incident Manager, which is OpsCenter, but specifically when things go, you know, are on fire. So not this space is not just low, but this space has run out and I was woken up at two o'clock in the morning and the goal here is to enrich those signals that you're getting from your observable layer to say I'm gonna bring it all into one place. You don't need to open 15 tabs just to see what the status of your, you know, your Auto Scaling group is. I'm gonna tell you what your Auto Scaling group looks like right now. And I'm gonna tell you your CloudTrail logs that are related to it in one place. You might wanna go dive deeper if needed to go investigate. That's fine, but it will at least give you a starting point and it gives it to you enriched in one location.

It has support, it has deep integration into everything from CloudWatch as single click integration with CloudWatch alarms. Of course, I do not advocate for click ops, but it's way to get started but, but we have a single click or single line integration with our CloudFormation. And we have with CloudWatch Security Hub, EventBridge and a bunch of others. And then of course, we support the, the we have bidirectional integration with the Service Connector because we also recognize that hey, you still do have your ticketing systems with their own flows and their automation and the tools that are built around that. So you probably want to have both. And so we have the Service Connector that allows you to say when an Ops item is created, send it over to Jira and you know, run whatever workflows you've already got there.

One of the cool things about OpsCenter is it's not just about enrichment, it's also about remediation. OpsCenter provides over, I believe it's over 460 I think it's growing. So maybe I have to reinventing it will be over 500 run books that are built in run books that we provide to you that you can use to remediate your issues. Now, these are built on automation and automation is something that you can build and extend yourself. What we see often is customers starting with automation and they put their toe in. They're like, ok, I'll use what Amazon has provided, right? I'll reboot my EC2 instance from an, from an Ops item, I'll make my ASG enter stand by or, or I have an instance, enter stand by.

But where the true power of automation comes is when you say, hey, I've recognized the pattern. I have a service that's deployed onto an EC2 instance or it's a lambda service and it's consuming too much memory. So this is something we see. For example, we have lambdas that run in a high traffic region versus lambdas that run in a low traffic region. So you have one region where the memory is high and you're paying more and you have one region where memory is lower and you're using kind of cheaper lambdas. And then what you have is a run book that says once it hits up high, yeah, I'll change it in my infrastructure as code so that it's there and locked in. But just in the meanwhile, I'll have an automation document that will bump up my lambda memory and I won't need a human to even worry about how to go do that. And automation documents allow you to go off and write those automations that are common for you.

And the cool thing is that you can then share that across your organization to see a theme here about how you manage your AWS accounts in Organizations. But you can share those runbooks. That is an old number that 380. It's definitely 460 plus, if not 500. You can share those across your whole organization. So, for example, what we often see is teams that share into a central repository of vetted runbooks that the company says, hey, these runbooks are really great for different scenarios. So you should use them because, you know that they, they'll work because they've been reviewed and they've been, um and they've been approved essentially.

Now, what's cool about automation as well? is that yes, you can, you know, you can, you can write different steps to call any AWS API, but you can also use Python or PowerShell, which is obviously going to be very familiar to a lot of folks who are in the infrastructure provisioning all the operations of space. So they can use the standard code that they're already used to using. They can run it on automation and it's serverless. Now, that's lying because that's, you know, but there are servers but you don't have to worry about them. So from a customer's perspective, you give us the automation, you tell us, hey, run this, run it at scale and we will just go off and run it. For those of you who are familiar with Run Command, Run Command also uses a bunch of automation. So Run Command takes automation or, or similar documents and runs them on top of your existing infrastructure. Automation takes that to the next level and says, hey, you don't even need to run it on infrastructure, don't provision infrastructure, we'll just run it for you and we'll just get those done. However, you need it to be done.

Probably actually one more thing just to do a quick plug. A couple of days ago, we also just recently launched a new feature for automation. So, you know, Oren is talking about creating these automation run books, you know, before you kind of work through in YAML or JSON to offer these automation run books. But now we have a visual designer. So it's more towards that low code type model. So it's a lot easier to get started with creating these automation run books and all of those run books that Oren mentioned, you know, you can go into the console, see the exact code that is being performed, you can take that yourself, you can clone the run book. So if there's ever one that you know, is of interest and maybe you just need another step or two, you can always just take that existing content and then reuse it yourself.

I don't know how I forgot about that. I literally spent two hours playing with it last night. It is a lot of fun just to play with a visual editor because there is something very magical about a visual editor for code that actually works like it actually just just does what it says, it's supposed to do?

All right. So again, you've heard about a bunch of theory for me. Let's go back to Eric for like making this real.

Yep. So, so thanks Oren for talking about OpsCenter and automation, you know, kind of using some of those AWS tools to bring in the operational piece after working on top of the observable layer.

So again, going back to the central theme of centralizing your automations or sorry, centralizing operations. Again, starting with one of these member accounts, one of these child accounts within your overall AWS organization, you have that same EC2 configured with your CloudWatch alarms. But now we're adding that additional piece of putting OpsCenter on top of it.

So now when that alarm fires, we're going to have an Ops item created over an OpsCenter which we can then flow into our delegated admin account. So again, going in, in the same scenario of having 50 100 AWS accounts have all of these Ops items flow centrally so that we can view them in one single location and start taking action using automation.

So you know, when these issues are occurring, of course, you know, the scenario will always determine what is the appropriate remediation. You know, we talked to a lot of customers, some of it requires manual intervention of kicking off these automation, run books, you know, performing the right task at the right time. It might also mean putting in place some self healing as well. When actions or when alerts are fired, let's just kick off a run book to try and do some initial task while we're waiting for our engineers to respond to it.

Pushing those executions back into those member accounts, resolving the issue. You know, in the case of EC2 might be as simple as just rebooting the box. You know, depending on the application itself, you might need to go ahead and pull some logs. You know, these are things that AWS has out of the box ready to go within the automation capability.

Another sample scenario is talking about like backups. So if you have routine backups that are being performed for your compute nodes, you know, you're creating these EBS snapshots. What happens? One of one of those fails? Well, we can monitor for that type of event over using EventBridge flow through OpsCenter back into that delegated admin account and then use some automation to go ahead and resolve the issue.

Final example, just being a S3 bucket on that previous slide for that automation run book. If you notice on the right hand side, it was for config remediation. So Config you know, we'll monitor your configuration space for compliance and we can use automation to resolve it. So if someone happens to flip an S3 bucket from private to public, you know, we can use Security Hub to monitor for that security event and then kick off automation to flip that bucket back into a private state. You know, hopefully before anyone's able to access it.

And then finally going back to what we were talking about earlier, you know, if you are using ServiceNow or Jira, you can integrate these tools with those products. So from those locations from their respective consoles, you can have the information that OpsCenter will automatically retrieve for you pull into those systems as well as then you can actually take those actions. You can initiate these automation run books from those respective consoles as well and start resolving some of these issues.

So we've been talking about, you know what the example architectures look like, what some of these services are. Let's actually go ahead and see them in practice. Ok? So we're going to take a look at a demo that's going to come from my laptop. It will be on screen momentarily. Here is the remaining transcript formatted for better readability:

There we go. So like I said before, like Systems Manager itself has a lot of different features and we've tried to segment them into logical areas so that you can find the features that are relevant for you when you need them.

Uh we're gonna start today in OpsCenter where I don't have much new account, but there's a problem with this account. I'm not quite sure what the problem is, but I can go dive in and see it.

So what I'm seeing is that I have a problem with Auto Scaling and I clearly have a problem because I have two different problems with it and one of them was created about an hour ago. That was fortuitous considering the fact that we're here.

All right. Um so we're gonna go in and what you're gonna see here is, and i, i'll let me maximize this a little bit. So you can see in the back is again, like instead of just having a ticket that says auto scale and easy two instance, launch failed, which is just an automatic ticket. Like i didn't create this by the way. So this was automatically created for me when i set up cent cent said, hey, when you're setting up center, do you want me to just monitor for the standard aws things that go wrong? And one of those things happens to be autocall fails because that is a very common failure scenario.

We see people to use a technical term futzing with a sgs and fussing with them wrong and ending up with this. And so one of those rules that center created was, hey, if i see a cloudwatch alarm or an eventbridge notification indicating that an a sg has failed to launch or an instance with an a sg has failed to launch, i'm going to go create an s item.

Now, i'm not sure how bad that is yet, but i can go in and double check and see like, ok, let me see what was going on here. So i can see when it was created where it came from. Um so in this case, it came from eventbridge, which is awesome. I can see what resources were related to it. So i can go in and click in through here and then once i go into the resource, so instead of having to go now how many people know where a sgs are, by the way, like even the console, where, where would i find them? You put up your hand? I saw you. Where would i find in the console? Pardon? Yes. Like where would i find auto scaling groups in the console? Well done. Thank you. It's under ec2. Um that's fine. Um just uh just checking you there. Uh so normally i would have to go to the ec2 console. I have to find auto scaling groups. I might, i might type it in up here, right. And then go in. But then again, that's yet another tab.

Now, i do know that i have a lot of tabs open here already, but that's because of the demo. Uh but everything right now is in this one tab, but i can actually just go in here and see. Um ok, this is my auto scaling group and i can see actually exactly that there's probably a problem here, which is that like my desired capacity here is one or max one, but my min here is zero, which is kind of a little worrying. Um and so i probably have some issues here and i can go double check and i can say, ok, what's happened with this auto scaling group? Like, has anything gone wrong with it? And i can actually see immediately here it's like, like center has gone off and searched through cloud trail and has found everything that has to do with this auto group and you can see that some smo, i guess or n i don't know, i don't, i'm not sure who that is. Uh but someone has fussed with the auto scaling group and i can actually even click in this way.

We will open another tab for uh but i can see, you know, that's what i'm saying, like like cloud like center sorry will allow you to real further in without you having to go to cloud trail and then put together a query that says find me everything to do with the specific a sg it's actually already showing me the exact uh it's showing me the exact event that happened. It's showing me exactly what's happened. And then this is just your standard kind of cloud trail and what's what was going on. So i can see that something bad has happened here. Uh but what i can also do is if i go back, i can go in, actually, i can do it from here as well. I can sort of say i can when i run uh automation now center has said, ok, i know what's going on here. You have a problem with an a sg. So i'm gonna filter through the, let me show you how many, there are 452 different run books that are available to you. And i'm gonna tag two as associated to here because they are very commonly run in this scenario.

So for example, in a scenario where you have a problem with an a sg where an instance is not being launched or isn't being launched correctly. One of the first things you might want to do is put that instance into standby so that you can go off and look at that instance and see what's going on with it because maybe the a sg is not configured correctly, maybe what's being deployed onto the instance is going is, is, is wrong and so you wanna take it out of the load balance. So you wanna, you know, log into the box, you wanna do your investigation and then come back out. So one of the first things you might want to do is just enter standby, so i can just run that directly here through from the console. And now, now i'm looking at here and again, like this is like at two o'clock in the morning, i don't wanna have to worry about like what is the cli command for putting something into standby? Right? So i can just run it directly. I can put in like what instances here is already. Um this is kind of already uh prepopulated for me and i can just submit it directly.

Now, the problem is though that in this case, i happen to be the operations team, i, i really should bring multiple hats. So the hat here is that i'm an operations team and i'm looking at this and i don't recognize this as a problem that i've encountered a lot of times. So i'm really worried and i need someone to help me with this problem because i, i can't solve it. Like i've gone through my normal bits and pieces and i need to get, i, i need to, i need someone else and because i'm really worried about my a sg being critical, i'm gonna start an incident. So i'm gonna click the button here. It just says start incident and i'm gonna say, uh the team that owns this is them. I'm gonna say a sg is down and i think it's critical. So i'm gonna start an incident that goes off this and then center will transition me directly into, well, it'll tell me what's associated and apparently i'm on call and i just got paged for this incident.

So i just got a text message with a page that says you have been engaged into an incident. A sg is down in the account ending in 4891 i have a link to go do it. I have an email as well, which i'm not gonna show i'm not gonna bring that up. Um but i can also acknowledge it directly from my phone so that it stops paging me because otherwise this will be a very noisy presentation. Uh and then, um it has now woken me up. So the ops team which hopefully was follow the sun and wasn't woken up at two o'clock in the morning have now paged me and it is two o'clock in the morning for me and i am now in an incident and i'm trying to figure out what's going on and i've just woken up. Uh i don't have a summary yet because no one's um put one in yet, but i do have a rum book that's already run automatically as part of the incident to try to help me mitigate this issue so i can jump to the rum books.

And this is a rum book that was written by my organization that tells me to determine customer impact. It tells me exactly what steps i need to be going and doing. It tells me i need to go update the summary so that other people can see what's going on and then i can just click buttons, excuse me to run on to the next steps automatically. So this is think about your run books that are, for example, in, in, in a wiki page somewhere and a document somewhere that are not living, they're not interactive. They can't run commands to pull data natively. You can take those and translate them into automation documents and then run them natively within your incidents so that they're pulling information relevant to your incident. And you have this click interface that is again like at two o'clock in the morning, you know, you don't have to worry about it. You don't have to worry about where is the document? What's the link to it? Ok. I mean, now i'm gonna read through what step was i up to? I don't remember. I know exactly what step i'm up to. It tells me exactly where i'm up to. It tells me what state each one of them is in and i can go off and, and run each one of them and progress them.

I also have a timeline of the incident that tells me exactly what's going on. It tells me that y i was indeed engaged. Uh and that there is an issue here. I can go in um to look at diagnoses in this case. Uh we don't have findings, findings is an a iops feature that we just launched uh just prior to re invent that will go off and there's a whole talk about this. So i'm not going to dive into it too deeply. Uh but it will go off and look at recent deployments and recent changes in your environment. Uh that may be relevant to here in this case. Unfortunately, i made the change that caused this issue yesterday. It's a little bit old for this one to, to, to be triggering in. Um and then you also have the engagement. So who was engaged again, that would be myself and just to add on the the engagement side. Uh so, you know, he, he demonstrated getting a text, you know, call also works. It also has integrations with aws chatbot. So, you know, you can get uh pings over in chime slack teams. Uh it also has integration with pager duty as well. So if you do, you know, if you're using pager duty for your notification system, you can have incident manager, go ahead and integrate with that again.

Can um sorry, thank you. I can then dive in a little bit deeper into kind of the related items. So the related item here would be the ops item that just created that. And then of course, we have just your um your, your standard um uh sort of the a rns etcetera like the properties of the actual incident. Now, one of the cool things that we have here is that i actually unfortunately cannot solve this. I need some other people here so i can engage other people into the incident, um which i'm not gonna do right now because i don't wanna page anyone else. Uh but, but one of the things that's really important to me is making sure that everyone can see the same platform can see the, can see exactly the same data because what we see when we have large scale incidents is that everyone comes in with their metric and we love metrics. Metrics are incredible. But the problem is it's really easy to drown in metrics. And so one of the cool things that you can do here is create a custom dashboard that gets created automatically and it can pull directly from cloud watch.

Obviously, as you would expect, it wouldn't be, it wouldn't be a demo if it wasn't real. So normally you can just pull directly. I don't know why the metrics aren't found. I will follow up on that afterwards, but you can put in metrics in here that will show up specific metrics. And then everyone has a view of exactly the same metrics and everyone is talking the same language because otherwise what happens is you say, oh, go, look at this dashboard, go look at that metric and everyone's trying to run around and everyone's got you have competing metrics and this way you get one set of metrics. That is your true set of metrics to help you with the mitigation of that incident.

Ok. So i've gone in, i've, i've, i've looked through my cloud trail and i figured out that the biggest problem here is, of course, are there, oh, there are my metrics, by the way, i promise the metrics are real? Mm maybe he didn't, can i get, can i get, can i get a clap or something? So ok, so just very quickly i'll show you what it actually looks like. I apparently did not hit save hard enough. Um but the point here is that we often have lots of metrics, but i only really need two metrics that are useful for here. For example, here i can see there's some problem here with my cpu spiking going up and down. And so this way when someone else is, you know, is page into the incident, they're already gonna see a note here that says, check out the, oh, don't look at my topic, u metrics. And so when bob gets engaged, they're gonna see a note here that says, ok, oren's already taking a look at it. There's some problem with the cpu metrics. Let me go take a look at that and then they'll go kinda go from there. Now, don't look at the real numbers. It's a demo. This is a very low utilization metric. It should be at 100 right? That's the problem. Ok. Moving right along.

So what i've just shown you now here is how we go from a signal, right? You, you have your signal that that came through uh your observ ability signal that came through uh from, from auto scaling group. It failed, it created off an upside from an upside. I did my initial investigation i concluded that it was worse than i thought. I then i related it to an incident and i ran through the incident. We do not fully resolve it. That's not the point of this talk today. There are plenty of other talks where you can dive into like incident management or uh resolving some of those

But what it does give you though is a really great segue into saying, ok, this was a node problem. So how do I manage my nodes at scale? And how do I prevent these things from happening in the future?

So we're gonna go back to the slides, we are going back to the slides, the magic hand wave. There we go. All right, we're gonna write to slide and we are gonna start talking about managing nodes at scale.

So one of the best ways to uh manage nodes at scale is to have something on the nodes that manages them at scale for you. Uh one of the beautiful things about AWS and about running things in the cloud is that you're, while you're running it on our infrastructure, we have zero visibility into what's happening on your instances. They are yours. We have even less visibility in the instances that are running, for example, in a hybrid configuration, for example, on prem.

So how do you open up just a little hole for us to be able to manage some very small aspects of your infrastructure? We have an agent. Not really rocket science. Unfortunately. Um the SSM agent allows you to remotely manage any node. Uh it, it doesn't matter if it's an EC2 incident, it doesn't matter if it's an IOT.

There's a bunch of really cool demos out on the floor uh of folks who are, you have like an IOT car that runs around and it's all managed like uh through SSM and IOT and you, there's a, there's another demo of like managing like a factory with sensors that runs through SSM.

Um and then you also have anything that's in a server or a VM. So essentially we support Linux, we support Mac OS, we support Raspberry Pi Windows Server, et cetera. The agent itself is also open source, which I think is super important.

You are trusting us to host your, your v ms and your code on our infrastructure. But now you're taking it to the next level where, where you're actually gonna run your code on our instances. We want you to be able to go look at that code and validate what exactly is running on there. So that's why it's open sources on GitHub.

We actually, we actually accept contributions. So if you have a really cool feature request for the agent, uh you are more than welcome to go and submit it there. Uh as well as any code we take code.

Uh and then the agent itself allows you to do everything from inventory collection. Uh which we will talk about in just a moment, but also patching, which we will also talk about in just a moment and as well as secure tunneling into those instances.

So for example, uh this is what has allowed us to basically eliminate SSH usage across Amazon pretty much is by using the SS SM agent and by using Session Manager and Fleet or slash Fleet Manager to be able to get into your instances, even if they're behind a VPC in a very secure manner.

So inventory, so inventory allows you to gather inventory, right? Uh it's pretty self explanatory. We like uh service names that are not too confusing um where we can uh it allows you to gather resource metadata about everything and anything that is running on your boxes.

Um you can even customize it. So we have a bunch of standard collectible data. So things like a kernel version OS details, network configuration packages that are running like installed software. Uh but you can also customize it to collect anything, right? You wanna collect is a configuration file present, yes or no. Inventory will allow you to go do that.

Um we can use inventory for everything from patching such as you know what versions of software running every year, but also things like what licenses do I have, how many of them are deployed? Where are they deployed? Where are they configured correctly? Uh how many, how many you know what legacy software am I running that's patched, but it's a legacy, uh, as well as I wanna see what's changing over time.

We all know that there is that one dev team out there that even though you told everyone not to upgrade their MySQL server, they've upgraded it and they've moved to progress. This is how you find them and celebrate them.

Uh the other thing that, that it feeds into is Patch Manager and this is one of those areas that I was talking about before, which is you don't want anyone worrying about patching. Patching should be a, so it's 2023 2024. Basically patching should be a solved problem. No one should have to worry about patching.

And so the way you can do that is using Systems Manager is to automate patching at huge scale and you can use Patch Manager to make that even easier.

Um Patch Manager allows you to automate your patches using the rules that you care about. What kernel version do you wanna go to? How quickly do you wanna go to them? How aggressive? Uh how aggressive are you patching? What are your rebook policies, etcetera.

It allows you to do scanning so you can just do a scan. I just wanna know what's happening. Don't touch any of my stuff. I don't trust you yet. That's fine. You can scan and then you will see that you do trust us and you don't want to worry about it anymore and you'll click the button that says just set it up so that it just runs.

Um, you also want to be able to say like, ok, well, I have a zero, there's a zero day vulnerability, right? We've all been bitten by them. It can be, you know, you know, them all for Jay curl, et cetera. I don't know, they're still bouncing around up there.

Um, so this is a tried true mechanism to reduce your amount of time from zero day vulnerability to patching being installed.

Um and then again, one of the most important things it does is consistency. You don't want every group having their own mechanism for patching. We've been there, right? I, i've had dev teams who have had the weirdest scripts to do things like reinstalling OSs and rebooting machines and they're all different at all different times and reporting is in seven different places.

This way you have one consistent place that literally just works. Now, one of the cool things here is that what I'm talking about right now is just in an account. But as we all know, we don't all just use one account and I would love you to tell us how we do this across the whole organization because that's the way I think the real power is.

Yes. So going into the, the organization focus again across all of these different accounts, across all these different regions. Uh earlier this year. Well, actually back at the beginning of the year, we launched uh patch policies using the quick setup capability.

So patch policies allows, you know all of those pieces that Oren was mentioning to be able to centrally define it across your organization. So put in place a well defined scan schedule or a well defined install operation, you know, have control around the types of updates that you're actually scanning to mark is missing or even approved for installation, you know, target those different AWS accounts as needed.

You can also deploy multiple iterations of these patch policies. So like what Oren was mentioning, you know, if there are those different application workloads that have either just different schedules that they want to patch or just different criteria, you know, your active directory, domain controllers or database servers are not going to have the same patch criteria as you know, maybe your web app servers.

So you can use patch policies to target those different accounts and uh get started within just a couple of minutes. So it's going to take the parameters that you define. It's going to leverage CloudFormation StackSets and then go push out those resources for you. So you don't have to, you know, put in the effort to build all of these different pieces together as well as you know, when we're talking about the operation side, we always have to keep debugging and troubleshooting in mind.

So you can always specify an S3 bucket as well to essentially aggregate that data. So should patching go wrong, you can go ahead and dive right in. And then, uh you know, when we're working with Systems Manager, there are, of course, always IAM permissions involved. So this can help uh put in place the proper ones.

So Patch Policies is going to help set up that operation, you know, put in place a patch scan, put in place a patch install, you know. And then we also have to think about how can we actually report on it.

So when we're reporting on it, we're going to leverage a couple of different AWS services and it all starts with the Resource Data Sync within Systems Manager. So this is something that you would deploy out into each of your accounts, each of your regions. And what this is going to do is going to take patch compliance data, that inventory data that Oren was mentioning.

So things like what applications are installed, application versions, driver details, network details as well as some compliance data and ship it into a central S3 bucket. So in that S3 bucket, all of these objects are broken up by the account by the region all the way down to the individual Management Instance IDs as a JSON object.

So since they're just JSON blobs, we can then use things like AWS Glue crawlers to create a database and table for us, we don't have to build out the scheme ourselves. Take a look at how that JSON is formatted. You know, we can use AWS services to automatically put some of these in place.

Once we have that database and tables, we can then use Amazon Athena to go ahead and query it, you know, ship out some information as a CSV file. If people want to go ahead and check, check an Excel spreadsheet or rather than uh building out that query, you know, looking at just a sample table, we can visualize that information using Amazon, QuickSight, you know, build out some compliance dashboards, build out some application or inventory data dashboards, you know, be a being able to share it out uh across your environment as needed.

So what are the underlying pieces of this type of architecture? Again, what i was mentioning for those patch policies, it's all powered by CloudFormation StackSets. So based on the OU that we end up targeting or the regions that we specify uh Quick Setup will go ahead and tell ClairFormation to deploy those resources into those target accounts, you know, put in place my patch scan, uh you know, every single day, maybe we have our patch insult configured for dev first Saturday, QA second Saturday prod third Saturday, you know, we can have multiple iterations of these patch policies to comply with the differing schedules that might be needed.

And then on the other side, on the reporting side, I'll actually bring up a bigger image in just a moment, but it's, it's all focused around that Resource Data Sync, shipping information into that S3 bucket. Again, these are JSON blobs. So if you do have third party tools and analytics that can ingest JSON, you can get started with those right away.

But the AWS services that would empower querying and visualizing would be around Glue Athena and QuickSight. Uh and can we actually flip back to the laptop again? Can we flip back or? And i need your hand hand wave, the magic hand wave.

We're gonna do another demo. There we go. Nope, it didn't work. Uh oh, there we go. All right. So, so this is that same diagram just bigger. Uh this is actually from one of our public workshops. We'll be sharing a QR code uh in a little bit as far as how you can get to it.

Um but I just wanted just uh to bring this up again. You know, we have our EC2 instances, they're performing a patch scan, they're gathering inventory data. Whenever those actions occur, the Resource Data Sync is going to take that information right away and place it in the S3 bucket. So it's near real time data that we're getting uh made available in that S3 location.

So, you know, I've had conversations with customers. I go and perform a patch. How do I know what was installed? All that Resource Data Sync will ship that information for me and I can query it right away.

So let's go ahead and actually take a look. So this is the S3 bucket. In this case, in this organization, you know, all of my delegated services are in a single account. So maybe we would have this in a logging account instead. But this is that same S3 bucket. I have all of that information available to me. What are those applications installed, billing, information, compliance, data, patching, information networking.

Like i mentioned before, it's then further segmented by the account where those resources reside, followed by the region. And then at the bottom are those individual JSON blobs per instance.

Uh so we can take this information again using Glue crawler to go ahead and build out a database and table around it. So this is uh a sample query that we have in that same workshop. Very basic, just going to go ahead and pull in. Uh, what are my resources that are non compliant for patching?

Uh so let me minimize this side. Yep. Uh and if we scroll down to the bottom, you know, i can see my list of instances that are non compliant. I can see the very detailed information if i want, uh i can get more summarized information, you know, high level, but this is going to, uh bring together, you know, all of that information across each of these different accounts and regions.

And then finally, just again, an example, com uh patch compliance dashboard using quick site. So we have a workshop that's publicly available that walks you through all of these steps. Uh it has sample cloud formation templates as well. So we'll share the qr code in just a moment.

But to put it to even a more real world example, we'll flip back into the slides and we'll hand it off to, to bhari to talk about uh how mulesoft accomplished this.

Thanks, eric. That's too much of information in like 40 45 minutes. And what we did is we took certain things out of it and put it into practical.

Um so I'm from mulesoft, what we do, we provide digital building blocks that it departments can use to automate everything in their environment. Uh they can integrate the data with all and, and mulesoft security is our number one guiding principle. So all the products are um built, have built in security and governance in them.

I specifically work in a team called runtime engine, runtime engine and services which hosts a platform uh integration based in integration platform called cloudhub. Yeah, cloudhub. With this platform, you can deploy applications in 12 regions with a single click of a button. We handle around 120 billion transactions a month. This is completely an two based offering. We host our mule runtime on these two. No our problem statement.

And in a given month, we release around 12 customs to our fleet and we expect customers to take those updates within a specific period of time. And if they do not take those upgrades, we force upgrade them at the end of the month. As you can see in my release, this is basically a release calendar. We we release an am on october 3rd with zero patches. It's a new am. So everything is up to date.

And then on october, this is a real story. This happened in october on october 7th, we had four patches released. And then on october 11, the famous skull patch was released and our security teams are behind us asking us to patch quickly as possible. But the challenge in mulesoft is we can't reboot customer boxes. They are running their workloads, they are a black box to us, but we have to patch them. How do we do that? That's where like live patching comes in.

So as i said, we at any point in time, we run 350 k instances to 400 k instances. And then other challenges, all of these instances, 70% of these instances are t three micros. They have limited memory, limited cpu we really can't do anything in them. So we wanted some solution where we can go patch these boxes when they are taking traffic, keep the kernel up to date. And the most important thing, we should not reboot the box. We should not cause any service interruption or no latency increases no downtime. And another thing is like the mule runtime, which we ship, we want to, we want to apply those patches and make sure that we don't break it.

Here is our solution on how we did it. Basically, we wanted a cloud native solution. As warren was saying, we did not want different solutions across different products. We wanted one single solution that anybody can adopt. Pretty much anybody who has an aws account will be able to adopt the solution. We wanted to leverage aws kernel live patching.

Then what we did is since our two boxes are very resource intensive and we can't, we can't afford to scan those boxes in our production environment. So what we do is we scan discover patches in our lower environment. We categorize those patches, test those patches and then we update into a parameter store for different accounts. So now you have a parameter store in each account for a particular am. So a one, these are the p one patches am two. These are my p two patches.

Then what we do is we have a run book under scheduled maintenance window which basically go read this parameter store. Picks up what patch to apply on that ami and applies that patch one by one. And that is we don't apply all the patches. At the same time, we start our scheduled maintenance window on a saturday, apply the first patch and then tack the box saying that it is patched. And then the next eight hour window, we apply for another patch. If there is another patch that is available, this keeps on happening throughout the week.

If there is a zero day patch, if, if there is a zero day patch required, we always have this other other window where we use a function lambda functional url through which we can say, hey, i want to patch this resource group immediately and it gets patched.

How did we arrive here? Like we were basically last year, we were basically a completely a linux one shop. All these 400 k instances were running linux one. So if we first migrated them to linux two uh with when we bake our ami, we subscribe to amazon linux kernel life patchy with that. What it gives us is we are able to patch any kernel for up to three months. So if a customer doesn't take an update and he wants to, he wants to basically not take any upgrade or for various reasons, we can still patch the oas without having to restart this application, then we created a very robust testing framework for all our patches in the lower environment. We were able to detect patches as it is.

The next thing we did is we brought all our, like how we segregate our ec2 boxes, we got them into tax. And then we created resource groups for using those tacks in for each application then we use system manager, certain components in the system manager. First one is the run command which has the logic which uses the agent in the box locks into a box, reach the parameter, store, place the patch and, and sends a feedback to an sqsq which they say this box was patched. This box was not patched because we did not have the resource, et cetera, et cetera.

Here is a screen that shows like this is our real production environment where we see we patched 34,000 instances. Uh last saturday. How long did that take? This takes around like 4 to 6 hours, four hours and we can go faster. Like basically we can go faster. Ok? We can go faster because run command. There is always this option to like have how, how fast you want to go like you, you can parameterize this. You can say, ok, i want to do 10% of my boxes and if i see an error of two persons just stop it so we can play with those numbers and get it right.

So we initially went slow. We went like say let's do spread it across 24 hours because the schedule window you can do 24 hours up to max. So we try to schedule it 24 hours. Then we shortened it, shortened it. And now we are like 4 to 6 hours and we're still playing with it then.

Now how do you scale across other teams? It's very, very easy now that we have this one framework that that teams can plug in their own devices. So what we do is we ask teams to is, hey, bring in your own resource group, the command is there and then use a schedule maintenance window to schedule your resource group to be passed during that time. This is what i call like bring your own device, get it patched.

What where we are today? Like luckily, like we released this in october in the first week we released it, we found out our security teams were like, hey, there is a curl patch coming up, we need to patch immediately. So we use this to immediately patch all our boxes. Then now we patch our two boxes twice a week or even. We can even increase it to go more. But we are currently doing it like twice a week. Then our security teams are very happy. Now we maintain our patching sl a through this process.

We learned a lot. We, we took a lot of trial and error to get this going. First thing we use resource group, resource group is an eventual consistent database. So there are times where we had terminated instance on a resource group and our maintenance window would pick up that and then hang. That was one problem.

Then we had uh we ran into concurrency issues or like we ran into concurrency limits. With automation document, then we moved to run commands which helped us scale much faster. Another big problem is like we had we scan our patches at the lower environment. So there is always a difference, like you would see something that is requiring a patch in a different region, but it is not available in a different region. So we are still working through this through those issues.

All right. Awesome. Thank you, badri. Uh so again, you know, patching at such a large scale, you know, 400,000 plus ec2 instances twice a week. That's, that's incredible.

So that is going to wrap up. Uh can you go to the next one? So that's going to wrap up our session today, like i mentioned before, you know, if you're looking to get started with systems manager, there's two different qr codes here. Uh the one on the left hand side is going to be for skill builder, you know, get a 100 ish level introduction to systems manager as a broad scope. And then, you know, once you're ready to actually get some hands on experience, uh the right hand qr code is for that 203 104 100 level uh type of topics where you can actually try things out for various different use cases, including all of that uh patching portion that i i showed within the console, you know, going through athena, going through uh quick site and creating dashboards around it.

And then can you go to the next one? Uh of course, you know, if you have more questions, uh, we'll actually be around after the session today. Uh, but we do have a booth over in the expo hall for the cloud ops kiosk. Please let visit us there. You know, we have lots of different uh staff experts that are there. So if you do have any questions around observable or op uh operations, you know, happy to, to answer your questions. And we also have some, you know, prizes, of course.

Uh so thank you again, everyone for taking the time. You know, i know it was a, a late afternoon session. Uh but i hope you learned a lot. I know it was a lot of content that we went through.

Uh obligatory. Please fill out the, the session survey. You know, we're, we're metrics driven, you know, we're always looking to improve, uh you know, so let us know and again, we'll be around afterwards. Uh you know, if there's any other uh additional questions.

Thank you. Thank you. Thank you.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值