What’s new with AWS observability and operations

最新推荐文章于 2024-07-20 19:31:22 发布

李白的朋友王维

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量144

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134829168

版权

To what's new with AWS Observability and Operations

Before we get started, quick show of hands, how many of you have ever been on call or maybe are on call right now? Almost, almost the entire audience. I've also been on call plenty of times. And what I find is the worst part is when you get paged at three o'clock in the morning, you're groggy, you're still asleep, you have to figure out what's going on with some software system that you have. You know, and that's really, really difficult and what we want to make sure that we, we're able to do is help you in those situations and in other types of situations when you're on call.

So this is a What's New session and we've been really busy this year on releasing new capabilities and we've launched a lot at re:Invent and even in Werner's keynote and in this session, we're going to be your guides.

So let's begin the journey. So my name is Brian Denny and I'm a General Manager in the ADS Observability team. I've been on Amazon for 16 years and I remember when I first came to Amazon 16 years ago, I'd been in the industry for over a decade before that. And what struck me very much when I came into Amazon was the amount of observable telemetry data that was at every developer's fingertips so they could figure out what was going on. This was very different from many other companies that I'd been in.

I recently joined the shopping cart team in the retail website. And then we moved into the, the what was then the Monitoring team and is now the Observable team. I'll be joined a little bit on stage by my colleague, Greg Eppel, he's a Tech Leader for Cloud Operations.

So let's briefly talk about our agenda. First, we're gonna go through Mission and Goals. I'll then dive in to the Life Cycle of an Incident. And some of the things that you should be thinking about when you're handling an operational incident and then gonna hand off to Greg. He's gonna take you on the rest of the journey and he's got a whole bunch of live demos through his portion of the presentation. So let's get started.

This quote, I love this quote. And Jess used to say this all the time “There is no compression algorithm for experience.” And at Amazon, as I mentioned, over the 16 years, operational excellence has consistently been one of our top priorities and we've really learned a lot about how to operate some of the largest scale services, retail websites, devices, device networks on the planet. And what we always want to do is bring some of our operational learnings to you in the form of products.

So our mission, as I said, operational excellence is in our DNA, you can't run a retail website, the size and scale of Amazon without really heavily focusing on operational excellence. And in addition, in our cloud computing offerings at AWS, it's even more important because you run your workloads on AWS and we have to be up and running for you. And so we've taken the operational tooling that we use internally in Amazon and we try to offer that as AWS services and just to set some context, Amazon and AWS use CloudWatch to monitor Amazon. That's thousands of developers all over the country every single day are viewing graphs, feeling alerts, troubleshooting issues and resolving them as close to real time as possible. All inside Amazon, we try to bring that capabilities to our customers so they can also benefit from the learnings that we've already made. And these AWS services are built for operating on AWS also on premises and in other clouds.

So some of your goals when you're when you're contemplating operational excellence, generally you want to easily monitor your applications. Increasingly, the application is the customer experience, whether it's booking tickets online, using your boarding pass on a phone, going through airport, playing games online or even ordering food to your house, it's increasingly are almost always the application these days. So being able to centrally monitor your applications is incredibly important.

Additionally, now that we have all these distributed architectures around the world and they are producing a whole ton of data. We want to leverage machine learning capabilities to process that data, to quickly comprehend what's going on in our systems and really understand if there's any anomalies in our application behavior that have started recently. And how do I resolve them? And of course, we all want to save a little bit of time and money in how we operate so that we can get as much automation as possible so that we can take humans out of the equation and allow machines to do the undifferentiated heavy lifting of figuring out all these signals.

Here's a few examples of a few customers who we have helped to avoid those hard day scenarios of getting woken up and paged in the middle of the night. So a few examples from Goldman Sachs and Intuit in the financial services to Expedia in travel. Our partners such as Cognizant, Tech Mahindra and Wipro. It's a wide variety of customers as well as partners who have benefited from following the AWS Cloud Operations model to ensure their businesses are running smoothly without interruptions. And the fact that it's so broad, this is what we all find super interesting in the Observable space. Everybody who's running an application really needs to understand how it's operating so you can serve your customers.

By way of example, we've got a couple of examples here. First EA Sports. So the world of gaming has changed a lot since I started playing games. I'm probably aging myself here. But I think my first game was a little LCD screen, a single player game. And nowadays when my son plays a game, he's sitting in his room on his headset to his friends and they're in the same virtual world or playing the same game at the same time. All while talking on the headset, very different world. And to enable that we have EA Sports who use CloudWatch Internet Monitor to help them really understand what's going on in their application, especially with respect to the network. A lot of when an issue goes off at three o'clock in the morning, one of the key things to know is, is it my application or is it the infrastructure that I'm using? And with Internet Monitor, you are able to pinpoint the source of the issue, you can stand down your engineering team and your own call responsibility if you know the issue is over here on the internet and then you can pinpoint and work with AWS who are already working the issue to resolve it.

Another example is JP Morgan Chase. And as we see with a lot of customers, they frequently have a, a wide variety of different operational tools and that makes troubleshooting an incident reasonably difficult. Sometimes your logging is over here in one product and your, metrics are over here in another one. And so trying to correlate all these different systems together, stitch them together to find the answer. It was really difficult for JP Morgan Chase. And so they leveraged Amazon CloudWatch to help monitor and troubleshoot across a real time unified view across their applications and infrastructure. And that had given them game changing observable and correlation of data across metrics logs and traces.

Another quote from Werner is “Everything fails all the time.” And we know inside Amazon we're operating at really large scale, give you a sense of scale CloudWatch is processing in excess of 9 quadrillion metric observations each month. Now, when I first learned that stat I had to look up what a quadrillion was, it's 1000 trillion. And by way of example, if you stacked dollar bills on top of each other, not end to end just on top of each other, that many would go to the sun and back six times. So it's not a small amount of data and at that scale failures do happen. And so the trick is how do you mitigate those failures by architecting to be able to withstand failures, but also to have a recovery oriented mindset and build recovery oriented architectures and Observability is critical to helping you understand when you need to do something and what you need to do.

So many of you might have a network operations center like this. Ours probably doesn't look as, you know, sci fi as this one, but you might, might have it but whatever it looks like many because the life cycle of an event is largely the same.

So the life cycle of an incident, we can boil it down to three distinct phases. One, you got to detect the issue. Second, you want to investigate it and then ultimately, you want to remediate the issue. So let's dive into each one of these in turn.

On the detection side, it's very important to be able to leverage machine learning to help you detect anomalies process all the data all the time, evaluate when something has changed and help you spot the needle in the haystack as quickly as possible. And you also want to map those business objectives to your application behavior that helps you understand when you get paged. Is this an issue for my actual application performance or is the CPU slightly high? And it doesn't matter? And then when there is an issue, you want to be able to quickly engage your operations teams.

When investigating, you want to automate the discovery of patterns. So you can spot quickly. What is the pattern is something changed? You want to assess the health and performance of your application and the underlying components and ultimately analyze the root cause.

And then on the remediation side, it's really important to be able to simplify, run book management and then automate remediation wherever possible. So you can resolve issues really quickly and then learn from all your operational failures.

So this is just a snapshot of all the launches that Greg is gonna talk about. And without further ado, I'm gonna hand you over to Greg to give you some, some live demos.

Thank you, Brian. Can everyone hear me? Ok. Thumbs up. Ok. Awesome. Thanks. Brian for kicking this off.

Alright. So for the rest of the time, I'm going to be going through pretty much all of our launches. There's a few things you'll want to look out for on the slides. The first you'll see this little wire mark at the bottom, right? That's the Detect Investigate Remediate that Brian just talked about. So that will tell you which phase we're in of this presentation.

In terms of the launches, I'll talk about the launches in that order, the things that help you detect, then the things that help you investigate and then finally, the things that help you remediate, the other thing to watch out for are two watermarks at the top, right? So you'll either see New at re:Invent. This means it was launched as of Sunday or even we launched some stuff at Werner's keynote or you will see this that says New in 2023. So this is something we launched earlier on in the year. So this just helps you differentiate. Basically, the launches I'll call the ones that are in preview and the ones that have gone GA.

So let's begin. So just quickly going back though to the life cycle incident, it's important to remember that you don't spend the same amount of time in all three of these steps, right? You looks like most of you are operators. Detection is usually pretty quick. You have alarms, sometimes you have too many alarms. There's a lot of alarm noise, but you usually detect the issue fairly quickly. I talked to hundreds of customers and they don't have the problem with alarms. They have the problem with too many alarms going off. There's very few customers that tell me it takes too long to detect the investigation is where you spend most of your time. You spend a lot of time here. What's gone wrong? What's the impact? Is it even impacted my customers? Is this a false positive of trying to answer those questions? And then the remediation is fairly quick, right? Either you're pushing something to the prod, skipping your CI/CD pipelines trying to patch it. Maybe you roll back, but the remediation is pretty quick as well. So you'll see a lot of our launches are really on this detection and investigation phase and then we have a few in the remediation. But I just wanted to call out that these are not equal things. You spend a lot of time trying to investigate the issue.

So let's start with the detection launches. I'm not going to list them all off because we're going to go through every single one. But before I do that, I had to stick this kind of somewhere. It doesn't quite belong in detection but helps you do detection more quickly. So we've simplified earlier this year, the getting started experience for Observability in Operations. So Systems Manager now within Systems Manager, you can now deploy the agent in a single click or a single action within the console. And we use something called the Default Host, sorry, Default Host Management connector or DMC. And this also allows you to deploy the CloudWatch agent at scale. You can deploy this across your whole organization through AWS Organizations or select OUs if you want and the CloudWatch agent itself. We also update recently to now support both X-Ray and OpenTelemetry. So you can use one agent to collect your metrics logs and traces. Just an example of screenshot of where you can say I just want this whole organization to get the CloudWatch agent and SSM if you have 7000 accounts, again, single action, it will deploy those things across and start collecting the data that you need for the detection.

So this just launched Sunday night Monday morning. I'm very, I'm excited about all these. They're all my favorites. But this one that I really, really like, it's CloudWatch Logs Anomaly Detection. I'm going to jump right into a demo here. I actually have a demo for each of these sections. So this is going to be the demo for detection. But there's three key features here, Pattern Analysis, Compare Analysis and then Logs Anomalies.

So let me just switch to my demo laptop and I'll walk you through the the experience here. So one second. Alright. So I'm going to go and I think, yeah, you should be able to see that from back there. So I'm going to go into CloudWatch Log Insights

Now, throughout the presentation, you may see in these demos, a reference to a pet site or a pet clinic. I have a, there's a workshop, all of the kind behind this and the premise is that we provide services for, you know, pets in a pet clinic in adoption, et cetera. So if you see these references, this is basically all tied to a workshop that I could talk about after, after the talk here.

But here we have our CloudWatch Logs, Log Insights. If you're not familiar with us, we released this a few years ago. It's our log analytics product. So basically the default query against this log group, I'm looking at, I just ran it for the last hour. It returned 20,000 log events right. Without machine learning, you, you would have to go find the proverbial needle in the haystack. And what the first capability of Logs Anomaly Detection does is it makes pattern recognition extremely simple.

So the way you do this and I'll update my query here, I'll remove these two lines and I'll use the pattern command or pattern operator within Log Insights. And I'm going to apply this to the whole message you can do this just to certain fields. I'll go on this query, we'll give it a moment and what you'll see is it's now taking those 20,000 log events and it's basically distilled them down to about 166 different patterns. It's showing the most common pattern at the top.

So we can see 58% of the time here on the event ratio, 58% of the 20,000 log events has this particular pattern and that equates to about 1100 or sorry, 11,000 events. So what I can do from here is I can go then inspect this pattern. And what's happening is we're then tokenizing all the things are different and we're keeping what's basically similar. That's, that's obviously the pattern there.

So I can look at say token one in this case, it's going to be a time stamp and you'll see the different values here over on the right if I just scroll down. So obviously, that's going to change quite a bit, but maybe I take a look at token two, this looks like a url. So if I take that I could come down here and I can see ok, out of those 20,000 log events, the subset that fits this pattern, 90% of the time. It's that specific.

So this is a really powerful feature that uses machine learning to let you dig through a mass of amount of logs super quickly and identify patterns. The second piece to this. And it's kind of complementary to the pattern detection is you often when you're on call, you're trying to figure out what's gone wrong. I'm seeing error logs were these error logs here yesterday? Were they here last month, last week?

And so the second piece of this feature of this launch is the build you do compare. So I can go in here and can say, give me the previous period. So again, I'll do this hour to last hour. I'm going to rerun the query it, updates it actually with the d command and we'll rerun that and then what we end up with in the second once this runs is going to rerun that pattern matching and then it's going to start to tell us are these patterns, it's discovery new. Have they just started occurring or were they also occurring in the previous period that we're basically looking at?

So very quickly, lets you take this pattern functionality distill this down to the most common patterns and then understand what has changed between two different periods. The third piece to this launch is something called or is log anomalies. So you turn this on, on a log group level and it's always on a feature that's constantly looking at the logs and trying to detect anomalies in those logs. So a chain a change in the pattern of the amount of logs and looking for, you know, keywords, errors within those logs.

So this is just an example, I've had this turned on it. It's found potentially, you know, a few 100 anomalies, you then can train this. So you can say this anomaly, for example, maybe it's not an anomaly and you can go ahead and actually suppress that so that you give it and there's a little plus or minus or thumbs up, thumbs down. And so there's a little bit of ability for you to make the smarter, make it more work for your use case better.

But the idea here is this is always on, it's always looking for anomalous in your logs, looking for errors, looking for growth, maybe we've seen 300% growth in error logs. And then the nice thing about this is it generates a metric. So you can then take that metric, tie it to an alarm and then tie it to some sort of incident response plan if you want to.

So that's a quick overview and I am at the end of this deck, we have a lot of presentations. We have a session that basically dove very deep into this. So we'll share other sessions that you can look at obviously with what's new. We have a lot to go through. So let me switch back to my deck and let's start talking about some of the other detection features that we launched.

I obviously just did the demo, so we'll skip over that. So CloudWatch alarm recommendations, something that customers have told us is you have a lot of metrics. We can obviously create a lot of alarms off this. Give me more prescriptive guidance as a customer. What I should be looking at what's really important. So this is something we launched earlier in the year. It's alarm recommendations. I'll have some screenshots, I can show you show you, but it's basically giving, giving you in-line metric information as well as recommendations for alarms.

So the way you do use this, you go to CloudWatch you toggle alarm recommendations. It will then show you the metrics that have this enabled for it. Once you've done that every single metric that's on the screen, you can mouse over it or hover over it and we'll give you a little pop up like this that have has more information about that metric. What is it measuring? Where are the meaningful statistics relate to that metric from there? You can also look at the alarm recommendations so we can tell you, this is an alarm, how you should set this alarm. This is how you should think about this metric in an alarm context.

You can go and create the alarm directly from this screen or we'll give you an alarm source code. So if you want, you can deploy this through, we'll give you both the json and yaml for CloudFormation and we'll give you Terraform and we'll give you ac command to go create this. So you could, you know, again, obviously take Cloud terraform and go deploy this across a bunch of accounts and organizations.

Again, sticking on the theme of CloudWatch right now, we also have dashboard variables. So this was earlier this year as well. And this helps you quickly like navigate between different operational views. This really helps reduce the number of dashboards you have to create. Often you want to look at and slice and dice the metrics you're looking at. This allows you to create one dashboard and have a more dynamic dashboard.

And the way you set this up is I'll walk you kind of through a screenshot example here. So you can look at i have a bunch of metrics here. This is coming from a product we have called Container Insights within CloudWatch. As you can see, there's some cpu memory network transfer metrics and then we have a namespace and there's different values on that namespace. There's cube system node configuration damian set and so let's let's take these metrics and build a CloudWatch dashboard with a variable in it.

So the way you do this or sorry, this is an example of the dashboard. Without the dashboard variable, you click that little plus button at the top right and add a variable to this. And the configuration is very simple. I won't go ahead and explain all the options you have here. But most commonly you'll be doing a property variable replacement. So you'll select which property do you want to replace? In our case, it's namespace. And then you can either provide basically your own list of values or we can actually extract that list from the metrics that are sent to CloudWatch.

So in this case, I'm telling it to simply go grab the namespace from this particular metric and it tells me, ok, great. I have four different unique dimension values that I'll use and that the variable drop down. Once we have that you'll see the thing that's different here is we now have a namespace at the top left. This is the dashboard variable. And so I can go into here. I'm looking at the cube system metrics. Now I'm looking at the default metrics. So it allows me to slice and dice my dashboard or the data, have one dashboard and look at this through basically different values on, on the Cloud dashboard.

Kind of continuing on that theme of dashboarding and detection with dashboards. We also launched at re:Invent here, managed Grafana front of plugins, additional plugins. So this is a feature that really extends your visualization experience and sorry, extends visualization experience plugins of your choice. So we'll provide 43 prepackage plugins. Now in Amazon managed Grafana and then customers can still discover and install over 300 community built plugins.

So this was also launched, we just did at re:Inventing next AWS user notifications. So this launch earlier in the year, this is not just for CloudWatch alarms, there's a lot of things you can alert on. You can capture things from different services, but you can use user notifications to surface alarms, key alarms and you can be selective about which alarms you want to see. But this will basically pop up in the notification part of the console and this will work across accounts regions and and services kind of continue on that theme.

We've had a console app for some while now and that also now supports push notifications. So that same notification that you set up in the console, you can actually get pushed to your phone. So talk about you know, the experience of waking up at 3 a.m. and and getting alerted. And then all of this is actually ties back to Amazon Q integration, which was just launched obviously at re:Invent a few days ago in Adam's keynote.

So Q is also integrated into these features as well. So both the management console, obviously, which you saw in the keynote, but also chatbot, which chatbot integrates very nicely in the CloudWatch and alarming as well as the mobile application. So you can tie all these if you're using either the management console, the you know, the mobile app or the chatbot. When that error goes off or that alarm, you can start to use Q to try and help you.

I guess this is going more into investigation at this point, which is a good segue into the next section. This is where the bulk of the um the rest of the launches are in this, in this presentation. So I'll just jump right in. This is CloudWatch Application Signals. If you were watching Verner's keynote a few hours ago, this was one of the first things he launched. And so this is basically a best practice and I'll have a demo for this. It's a best practice, prebuilt dashboards for service operators. There's automatic integration between Application Signals and Container Insights and it also helps you monitor your application resiliency.

So I'm going to kind of explain where this sits in the stack because there's a lot of capabilities within CloudWatch. So we take CloudWatch bottom layer instrumentation. I talked about the agents earlier. So we have our CloudWatch agent. We have something called CloudWatch Embedded Metric Format. On top of this, we have our foundational stuff like metrics, logs and traces. We obviously have integrations into our partners and things like OpenTelemetry.

We then have our visualizations like our dashboards, metric explorer. On top of that, we have a series of insights products and I'll talk about a little bit about Container Insights. And then a lot of the functionality we launched last year was really in that digital experience monitoring space. So we have Synthetic Canaries, Real User Monitoring and then Internet Monitor like Brian talked about with EA Sports.

So Application Signal sits here. It combines and takes a lot of this information. It gives, it presents it to you in that, that prescriptive dashboard experience. And it really sits there and then starts to tie into synthetics and RUM and I'll do a demo of this in just a second. And I will in that demo, I am going to, I'll explain when I'm moving from Application Signals into Container Insights. But the app, I'm going to show you in Application Signals is a Java based app running on ECS and this Enhanced Container Insights. It's an update of Container Insights. We launched this at EC2 Chicago this year just a few weeks ago.

So let's go back and it's demo time again and let's go in this time. I'm going to actually start with the console because something else Verner launched just before Application Signals was My Application.

So if I go to my console home right now, what you'll see here, you'll see a widget now on the right called Applications. And here's my Pet Clinic that I was talking about. And this is giving me a dashboard, not only with Observable data. So it's surfacing Canaries alarms and my SLOs. It's also surfacing other data, security data from Security Hub cost and usage data from the data that goes into AWS as well as some compute dashboarding here as well.

And if we keep scrolling down, even things like our patch compliance or configuration compliance on the underlying infrastructure that's running this application. So this is basically the My Application that Verner launched tied to the website or the app. I'm going to show you here in just a minute.

So let's actually go into CloudWatch and I'll show you where Application Signals sit. So there's a dedicated section here called Application Signals. And we've moved some stuff like RUM and Synthetic Canaries into here. But the three top items here, these are brand new as of just a few hours ago, if we go into Services and let me minimize the left hand thing here. What you're going to see is a prebuilt dashboard and I'm going to dive into this in a second. I do want to explain how you set this up because it's incredibly simple. And I want to really stress this point to enable this on my ECS Java based workload. I did not have to do any code changes. I simply had to add or basically update my ECS cluster choose a new add-on that we have an Observable add-on and the rest of this was set up.

So when I go through the rest of this demo, just know in this use case, I didn't have to really do anything other than click a few buttons. And as far as collecting the logs, metrics and the traces for my application. Now, if you want to do this for something other than for an sorry for Kernes and Java workload today, you can do this on EC two on prem or even other cloud providers by doing this custom option. You will have to do a bit more work here. But if you're running on ECS and the Java based, it's incredibly simple to get started.

So the first thing you notice here again, now that I've explained how I didn't really have to do anything for this to show up. I'm getting some key information. I have some services by their ECS status. I have the top services by the fault rate and I have some information about each of the services in the application that it's found. So I have four different services here. You can see some of them have SLOs I haven't created an SLO for every single application, but let's go ahead, go into the Pet Clinic front end and now I'm looking at one of those four key services that make up my application again. I get more information here and it's automatically started to discover these service operations. So I have four services. Each of those four services then have anywhere between 10 and 30 different service operations. These can be calls to SQS or a POST or a GET against another service that I have I've created here.

So I'm going to actually go back, I'll look at different service. So bear with me one second. So what we can see here is there's this Pet Clinic actually, I'll sorry, I'll go back into here. And what we can see is there is some faults occurring in that front end. So I'm actually going to sort buy that and we'll go take a look. So for this particular service operation, we're getting a lot of faults. And so what Application Signals a really powerful feature of it is this ability to do automa correlation. So I can go ahead and click that little circle for that particular point in time. And what it's going to do is it's going to go find the exact traces that are linked to the faults that we saw in the metrics there on the, on the widget with the, the false and errors.

So from here, I can then drive deeper into that particular trace and try and diagnose what's going on and investigate further. So if we go to this map, what you'll see here, you can see this, I'll just explain what's talking to what you read this from left to right. So we have a synthetic canary talking to our container, a service on one of our containers, then talking to another container service and then that is making calls to SQS and then a remote API. So if I scroll down further, you then start to get a timeline view. So you can imagine in a very distributed architecture, you're going to have a lot of things talking to a lot of different things. And this gives you a nice timeline view from t zero to the end. How are things being called, either they're very sequential or they in parallel? And what we can see here is there appears to be some issue from SQS because we, we can see is there's a 400 error here and that seems to be bubbling up and surfacing through, through those containers.

So if we actually take a look at this, I'm going to go up and actually select the last thing that has red on it, which is this container. And what we can do is we can go over and look at the exceptions. So what we'll see here is very interesting. There's an exception when we call SQS, there's a command in SQS called purge q. We're only allowed to call that or my application is only allowed to call that once a minute. I'm actually calling it more than that i've made two calls. So this is the source basically of the error in my application.

So let's just kind of show you the workflow of how quickly you can dive deep and try and understand what's going on here.

So some of the other capabilities now I'll share a session that will actually happen tomorrow morning. There will be a whole hour basically just on Application Signals. Um and we'll share that idea, uh session id in just a second. But then there's also Service Level or SLOs capabilities here as well. So I have four SLOs, two of them are unhealthy. Two of them are healthy. And I've, I'll show you basically how to set these up and you can drive alarms off of these as you get closer to exceeding your error budget.

So you have two options here. You can use a Service Operation. So this is the thing I just showed you in Application Signals where we're surfacing those services in operations automatically. So you can see our Pet Clinic front end or again today, we support ECS Java workloads. Maybe you have a service workload. You can still utilize the functionality here. As long as you vend a custom metric to CloudWatch, you can then create an SLO. So basically off that metric, you then specify I'll select a service in the operation in this case. So we're going to set up an SLO for the GET API for customers. We can say we want to set this either on LA or availability. And I can say, just say we go with four nines this is super critical service. You then specify if you want a rolling day or a business day. So it will do like a rolling 24 hours or you can say, or set the other type of day, a calendar day, you then set your attainment goal and then optionally, you can set college alarms. So these alarms can fire basically when you're say less than 100% or when you're, you're basically like an attainment alarm here or even a warning. So you're getting close, you're 80% of the way to exceed your error budget. So you can also set that up as well. Again, like I said, we'll have a dedicated session tomorrow morning that will cover this a breakout. So if you can't make it tomorrow, I encourage you to try and make it tomorrow, but it will go up on youtube as well.

All right. So back to the presentation. So this is the session, I'll give you a moment just to take a screenshot. It's, that's usually for 9 a.m. friday. So tomorrow, so CO 351 and Brian's peer, the GM that helped build this service as well. He'll be there with one of her PMs and they'll be doing a really good deep dive on just that, that launch right there.

All right, other things to help with investigation is for Prometheus. You know, one thing we've had an Amazon manage service for Prometheus. One big pain point that we heard from customers is it's a pain to run agents to install OpenTelemetry on our clusters to send this data basically to um to Prometheus. Can you help us make this easy? And so this launch, this is new at Reinvents. I'm happy to say it's literally a toggle now. So when you go spin up your ECS cluster and you can do this after the fact as well, select that toggle, we will start to spin up an agentless, basically architecture solution and collect the Prometheus metrics from your ECS cluster. You don't have to run agents anymore. And we'll send this directly to our managed service for Prometheus.

And ok, so I've really, this one's in preview. This is also launched earlier in the week i really like this one as well. But this is CloudWatch Natural Language Query. I'm gonna show you a little demo and Logs. Um and then this also works for a Metric Insights. So this basically allows you to write in natural language. I want to find x I want to look at y and then we'll use machine learning and then generate the query that you need to go do that. So this is an example of Log Insights. And like I said, this also works in Metric Insights. So I'm going to go and look at a particular log, I'll look at my application logs from my Pet site. We're going to go ahead and run this query go take a look at the logs. So there's a lot of Kubernetes metadata in these logs, things like pod name, pod id host and say i want to find, you know, something like give me the number of errors that occur by pod name in my logs. I can go ahead and assist section and go and type that. So I'm going to say count the number of errors by pod name and i'm gonna click generate query, give a second and now it's updated the query. So i haven't had to learn this new syntax. Maybe you're used to SQL. I have 20 years of experience with SQL. I've used many of our partner products, you know, there's always these different syntax. So this is really a developer acceleration tool that helps you learn the syntax and sometimes you don't even, i guess have to learn it. You could just type in natural language and we'll go help you query your logs and metrics.

So I showed you this as a log um experience. There's also a metric insights experience as well. So you can go use that as well. CloudWatch Live Tail. This was launched a few months ago, uh its ability to live tail the logs. Um I don't remember the exact record. So just i'll caveat. Brian talked about the mal metrics that we're ingesting. We also ingest several exabytes of logs every single month. So you can imagine just at that scale and you can use Live Tail to look at your log groups and in real time, look at those logs interactively and you have fine grain control over this. You can actually just recently, a few weeks ago, we had support for regex. And so very quickly, you can basically navigate from this live tail view directly into the Log Insights experience. I just showed you all right, continuing on that log's theme. This was launched at re invent sunday night, monday morning. So, you know, there's a lot of new functionality and logs as you've seen, not every use case requires all that advanced functionality. You know, customers want the ability to make tradeoffs. So i may want to be more cost-effective on these logs or i won't need this rich feature set on these other logs. So this new log class called Logs of Frequent Access effectively for us east one custom logs, i put some pricing here

You can go through the pricing page but it effectively cuts the cost of log ingestion in half 50%. So at us east one for custom logs, you're paying 50 cents per gb. You get the features on the left. There's a lot more features or sorry, you get all those features in standard. But if you, you know, maybe all you need is managed ingestion storage. I need to encrypt my logs and need to query them. You can now use infrequent access to store those logs much more cost-effective. So you, you have the power to make that trade off for your use cases.

Again, there was a session earlier in the week that really dives deep into everything klatch logs.

Now, this is one of brian's favorite launches here. His team worked on this and this, i i love this one. It says cloudwatch multi datasource quean. So we provide seven managed data sources available or make them available to get you started quickly. And this gives you visibility within the cloudwatch console into hybrid and multi cloud metric data. You get this in a single view.

Um and then for a use case that we haven't provided you with, you can also use lambda and go build basically your own. So you can see up here, not only can you do hybrid multi cloud, but you also go into like opensearch prometheus rds s3, even azure monitor as well. So for this prebuilt integration, it's very easy. In this case, i have prometheus, i've just put, i put my data source name in, i select my prometheus workspace, make a decision about should this be in the vpc or not? And that's it. In the case of lambda, we'll give you a lambda to get started a template and then you can write your own logic to injustice data or to basically start querying this data within cloud watch.

This is an example of us doing this with azure. So i've set the azure data source here, you can see the little arrow pointing towards that. I then again, this is a static screenshot, but you would get like an autocomplete experience in each of these text boxes. So i can go select my subscription, my resource group and then the metric i want to look at. So this is actually pulling the data and the azure or in the other cases, it would pull out of those systems. This is not just a visualization time experience. So you can actually drive alarms off of this as well. So we're not just simply doing this at run time or visualization time, you can actually go drive alarms off this data again, on these other services, you can start to centralize all your metric data within cloudwatch.

All right. So we also have a college incident monitor which brian talked about earlier that was launched last year. We had that use case with e a. So one thing that we also launched a few months ago is the ability to support network load balancer. So now within internet monitor, you can get visibility about internet performance for your user traffic, directed to specific network load bouncers. So you basically get additional gran more granular level of observable and you can also send this stat to eventbridge and take action on it.

All right, detection or sorry remediation. The last one. So we have two things here. I do have a quick demo and we'll talk about the last one and then we'll talk about getting started next steps.

So this was just launched every event, this is a low code visualize or visual designer for run books. If you're not familiar with assistance manager, it has a run book capability called automation run books. This also integrates directly into amazon codeguru security. So when you're for example, re a powershell script or a bash script within your run book, you can use co guru security to help detect if there's any policy violations in your code. And you can also export this for local development. You can also use this to update existing run books and i have a demo which we'll go into and show you here in a second.

All right. So let me go back to my laptop and we'll go to this bear with me one second. So I'm gonna go ahead and go to the assistant manager will go to automation and i'm gonna go ahead and create a new rum book. So this is the experience that you're going to get. So previously before this, you would um you could, you would create your rent book basically through yaml or jason and you still get access to that. I can flip back to the code view, but by default, i have this design view.

So if i need to go create run book and i need to go do something because the alarm has gone off. I need to page brian at three ami can start to develop that run book here just by dragging and dropping. So for, i don't know, it might be a silly example. Maybe we need to sleep for a certain amount of period of time uh before we proceed to the next step and then after we sleep for that period of time, we're going to go run the script. What you can then see here is if when i go and click on any of these actions that i put on here, i'll have the inputs and outputs of those actions as well. And i can just fill that in.

So in the case of run script, i can go ahead and it's already prepopulated with the hello world for python. And then if i was using powershell, i could do that as well. So i can go type in my script or copy and paste it. I can run a security scan on that as well to look for policy violations. It's not going to find anything in this simple example. And so this really again, brian talked about saving time, right? For your operators, this really helps you save time, you could build run books much quicker, much more effectively.

So i didn't find any vulnerabilities in there. Again, i can continue to make this as complex as i need it. Not just simply these actions and scripting integrations, there's also the a ps. So if i need to do different things like reboot two instance or do something else in my network. My vpc i can also start to bring in all these a p, you know, and i've even seen a lot of customers that maybe they kick off another run book. So they have their instant response or the response plan, run book and then that will branch to some other run books.

So you can also then kick use this to kick off to another run book. So this low code visual design experience just makes run book creation much, much easier. If i were to go save this, it would show me that there are some ras founds here. So i there's some things i haven't completely filled out in this run book and then there's also some recommendations. So again, it just helps assist your operators in creating these run books and maintaining them. And you can, there's not just a new run book experience, you can go open all the run books you already have or the ones we've given you and then get this experience from that as well.

So let me go back to my deck, the last launch we have here and then we'll talk about just how you can get started with some of this and next steps but is on call schedules with incident manager. So systems manager has a capability called incident management. Uh and it basically you can tie this actually back to claw which alarms. So the things that we talked about alarms pre before you can then use that alarm getting triggered to drive uh a response plan. And prior to this launch earlier this year, the capabilities that you had were i can either go, i always send the incident to greg. It doesn't matter if he's asleep at 3 a.m. or there's an escalation pass. So if greg doesn't respond in 15 minutes, then it's going to go to brian and then it's going to go to brian's manager, et cetera, et cetera.

So this is now a third option in terms of who gets notified, who gets paged 3 a.m. in the morning. And so the way you would set this up, you would go ahead and create a response plan. So i'm just showing you an example of an existing one. And when you create these response plans, optionally, you don't have to. But optionally, you can connect this to the run book that i just talked about in that visual design experience and then you select your engagement.

So again, before we could go directly to a specific contact or we can have a tiered escalation, i don't answer. It. Goes to bryan. Now with on-call schedules, you can set more sophisticated routing logic. So maybe bryan and i come to an agreement that he is going to handle being on monday, tuesday, wednesday, thursday, friday, and i'm going to take the weekends.

So what you've seen here is i've set this up for brian, he's taken the weekday or the weekdays. And if i, i don't have the other, if i were to scroll down on this, you would see me taking basically the weekends. And so that what you can see here on the calendar on the right is basically that schedule that we've agreed to and that we've set up here. And so when that alarm goes off and triggers the response plan, the instant manager, instant manager will more intelligently be able to write to the right person.

All right. So wrapping up, just wrapping this up back to what bryan said. These are really our goals as a as a company around the observer building operations, we want to provide you with application centric operations. We want to provide you with intelligent operations using ml powered capabilities and we want to save you time and money in how you operate uh and provide you with efficient operations.

So it's just to call on the high level really big launches that tied back to these three goals. So in terms of application launches or sorry, application operations, application signals, which bernard just talked about in his keynote. This is a really big thing that helps you look at your application, not just your resources but your application, understand the health of the application.

In terms of intelligent operations. We have the logs anomaly detection, right? This is the set of capabilities that helps you find that needle in the haystack really, really quickly using machine learning and in terms of helping you saving, in terms of helping you save money and time on the money side, we now have a new log class that i talked about that helps you. But you know, in some cases, it will reduce the cost of logs by 50%.

And then we provide you with an in terms of saving time, the ability to more quickly and easily author your run books for your operational workflows.

So if you want to take a screenshot, go ahead. This is everything we just talked about all the launches. If you want to learn more, i'll leave three. Well, i'll first of all talk about the sessions that we have. So i already referenced that session tomorrow for application signals, these have all occurred, most of them should be on youtube. If they're not, they will be by next week and all these will dive deeper into all the or these sessions will cover most of what i've covered. They'll just simply go deeper obviously, in the what's new session we have a lot to cover here.

If you want to get cred, these are three resources. My team has personally worked on. The first is the observable best practice guide. Um that so take a look at that. The one observable workshop is amazing. We years ago set out to say, hey, what if we created a workshop that could demonstrate every observable capability that we have. And so that's what that workshop does within i think five minutes of verner talking about application signals. In his keynote, we had published an update to the workshop with that new feature. So if you ever need to check out what's new, what's changed, check out that workshop. And then in terms of more around the incident management and ssm side, you take a look at the centralized operations workshop. It will go deeper into authoring run books and setting up incident management.

Last thing really important, we are a data driven company. We really appreciate your surveys, brian. And i also really appreciate you coming here 4 p.m. on a thursday. Have fun at relay. We're happy to take your questions. Thank you.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
What’s new with AWS observability and operations

To what's new with AWS Observability and OperationsBefore we get started, quick show of hands, how many of you have ever been on call or maybe are on call right now? Almost, almost the entire audience. I've also been on call plenty of times. And what I fin
复制链接

扫一扫

What’s new with AWS observability and operations

“相关推荐”对你有帮助么？