Accelerate insights using Amazon CloudWatch Logs ML-powered analytics

最新推荐文章于 2024-10-09 12:52:01 发布

李白的朋友王维

最新推荐文章于 2024-10-09 12:52:01 发布

阅读量76

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134837986

版权

Nikil: Hey, uh Nikil, what are you up to there? We've got a session to get going here. I'm trying to find an orange square and a red square, but I can't seem to find it. It's a mess.

Matt: Well, that looks like a pain. Let, let me give you an easier way to do that here. Nikil.

Nikil: What?

Matt: That is so much better. A red square and an orange square.

Nikil: Thank you, Matt. So we have four shelves, 545 shelves with color organized blocks and within each shelf you have different shapes. Thank you. That's genius, Matt. So much easier. If only there was a way to do the same thing with logs, wouldn't that be nice?

Matt: Well, I've got some good news for you. We're gonna find out about it later in the session today.

Nikil: Awesome. Thank you, Matt. Let me put this away.

Alright, welcome everyone. Thank you for joining us today. My name is Nikil Kapoor and with me, I have, I'm Matt. Thus, we are product managers with Amazon CloudWatch and today we will walk you through some of the new launches we announced yesterday, um within CloudWatch.

First, we'll go through some of the common challenges we hear from customers related to log analysis and how machine learning and the capabilities we have announced yesterday will help you overcome those. We'll also show you some of those capabilities in action via a demo.

Next, we'll walk through some of the challenges. Common challenges. Again, we hear around management of logs at scale. When you're dealing with the large volume of logs, how can you manage them effectively both from a cost standpoint, as well as efficiency standpoint. And that's where we'll bring in our new log class in frequent access. Um and how it can help you build a better observable workflows with that.

Uh let's start with a quick uh hand raise. How many folks here are familiar with CloudWatch or have used it in the recent past? Oh, wow, awesome. This makes my job a lot easier. Um just in case a quick refresher uh for anybody who may not be familiar, Amazon CloudWatch is an AWS native observable service within Amazon and AWS which offers you whole pseudo capabilities both from foundational capabilities like logs, metrics, trace injection and management all the way up the stack to digital experience and real user monitoring, using Service Lens, Synthetics, real user monitoring and Internet Monitor. And we are constantly expanding, expanding on these pseudo capabilities to make your lives easier so you can do more innovating and less operating with that.

Uh let me bring on Matt again to fulfill the promise he just gave me or showing me a better way to find those logs.

Matt: Alright, thanks Nikki. Alright. So now we're gonna jump into how we're gonna uh these new features we've just released this week with CloudWatch, use machine learning to help hopefully make the process of logs analysis much quicker and easier.

So, let's start by reviewing a couple of the most common challenges we hear about with regards to log analysis from our customers.

First of all, and I'm I'm sure many folks here in the audience will empathize with this with this one. But for many of our customers, there's simply too much log data. Everyone's applications keep growing and growing. The diversity of log data keeps growing more diverse and more complex with different compute frameworks, et cetera. And many times when our customers need to go into their logs to deep dive and try and find that needle in a haystack, there's simply so much log data available, it can be quite difficult.

Next or another common problem we hear about. And this I feel like is kind of the the fundamental problem of log analysis that goes back uh as long as logs have been around is identifying what has changed in your logs.

So at the end of the day, what are customers typically doing with logs? They're typically troubleshooting application issues and trying to find the root cause of those issues and to answer that. The question they're really trying to get at is ok. My application's having some issue now. It was working fine yesterday, a week ago whenever, tell me what changed that. That's fundamentally what all those queries are trying to get at.

So, wouldn't it be great if we had a much easier way, um, to categorize in group logs and answer that question? Uh another real common problem is proactive detection.

So obviously, um it's great if you're able to use logs to dive in once you know, your application is having an issue and find the root cause of that application issue and go fix it. But how do you know when those issues are occurring, how you proactively detect for them? Of course, you can implement metrics, alarming, et cetera for the issues that you are aware of. But um per this last column here, there's always the problem of unknown unknowns.

So um we're often kind of instrumenting observable and monitoring for yesterday's issues. So we know about some issue that arose, we instrument alarming around that issue. Ok. We, we have observable in place to address that problem, but we never know what changes are going to happen tomorrow. What new pathways are going to be introduced in our code and how we stay on top of this ever changing kind of ephemeral log environment.

So let's take a look at how we're going to use these new features to help address these problems so in other words, how do we go from our big unsorted tin of legos to something nice and easy to analyze like this.

So first, I'm gonna talk about the three different features today, pattern analysis, comparison analysis and how much logs anomaly detection.

So let's start with this pattern concept. So I've got up here on the the slide um some example, log messages just starting with something really simple. It's effectively the same log event repeating over and over that. You're squinting your idea, your eyes trying to read this, don't worry about it. The the exact log content doesn't really matter. But uh essentially we have a log level or severity. So we got a little info um a time stamp, a request id and then it says, oh api request received from some customer id.

So if we're looking at this and kind of trying to sift through it as humans, we can quickly identify this is effectively the same log message repeating over and over. There's some static or recurring components and then some dynamic fields as well.

So what we can do is translate this um and cluster all these log events into a single pattern. So here a pattern looks like this. So we have that info again, each of those little asterisks is what we call a token. So a dynamic or variable field within your logs and then some more static text another token.

So essentially what we've done here is replaced or condensed, this big blob of log text down into single, much easier to read pattern.

So I'm gonna jump over to my demo here and take a look at what this looks like in CloudWatch. Oops.

Alright. So if you're not familiar with the page, I'm looking at this is the CloudWatch logs insights page. So this is our query experience. Um so what I'm gonna do here is use the default one hour. So we're querying over the last hour. I'm gonna go into the log group selector and select my application log group. Um here, I've just got a little sample lambda function and I'm gonna remove that limit.

So now I'm gonna run my logs insights query. And now let's say we're, you know, responsible for, we're the on call engineer or builder for our application. We get a page in the middle of the night and we've got to go troubleshoot the logs for this application. Uh so we run our default query and then we, we run into this, this big lego tin issue. We've got to sift through these logs and try and identify uh what the root cause of this issue might be.

So, traditionally, we might try much of different permutations on the query, query for error query for exception and just dig around in that bin of legos until we found what we're looking for.

So how are we helping to address that? Well, what we've watched. Now, is this new patterns tab? So now whenever you run a logs, insights query without having to do any extra processing or clicking around, we'll simply return to you patterns.

So with the default logs, if you squint your eyes up here, i've got about 4000 different log events in my query. And we're able to distill that down into a much easier to analyze 15 patterns.

So if i dig through or go scroll down here into my patterns table, i can see all the different patterns that my logs were grouped or clustered together into the count of each pattern. What percentage of the total result set was composed of the pattern and as well the extracted severity or log level.

Um so what i can do here is now, i have a much easier time to kind of eyeball through. I can filter for different keywords or i can just kind of browse through here and look for something uh like a error message or anything that might be uh interesting as a on call engineer.

So i can also select a pattern in which case i'll be presented with the pattern and spect view. So this lets me just dive around and and take a closer look at that particular pattern.

So each of these tokens i can click through and see the most common token value. So it looks like this first one was a time stamp. The second one was a request id. So not the most helpful tokens here, but um from our customers who have been beta testing this, they found this token feature really useful for things like uh when it's returning perhaps h tv error code, they can see the most common error codes or um things like a customer id.

So they can see which customers are experiencing the most issues at serato. So we have some other useful things we can click around through here. Related patterns shows me other patterns that were occurring at the same time. And log samples shows logs or examples of the logs that this pattern was extracted from.

so this is pattern inspect. so this is the automated grouping and categorization of your logs much like our little lego tends over here. so that's part of the demo. i'm gonna click back to the slides for just a minute and talk about the next feature.

ok. so we talked about patterns and how we use them to group and cluster our logs together. um but how can we use this in new and more interesting ways? so now that we've kind of done almost like a sequel group by to cluster our logs. another useful thing that we've added is comparison analysis. what this lets you do is effectively um profile your logs using patterns for a one time period and then compare it to the patterns extracted for another time period and analyze the differences.

so if we kind of abuse this lego block analogy. one last time, let's say we have our legos, we leave the room, we come back later and now all of a sudden there's a new 10 with red lego bricks. so that's really easy to identify because instead of being in a a differentiated pile, we're able to quickly just say, oh ok, here's a new 10. let's um move that on. so last lego brick joke, i'll leave, leave that alone from here.

um so from here, i'm gonna go back to what this looks like in logs. so again, don't worry about reading the exact log messages here. but we've got now instead of just one repeating log event over and over, we have three different log events repeating. so what we can do is extract that um group those logs together and distill it down to near three patterns in this example. but if we come in there tomorrow and the logs look something like this, if we're just inspecting this visually, this is a bit difficult to identify what exactly changed. but using um patterns, we can translate these this big blob of logs into patterns and quickly see. oh ok. so there was one new pattern that was introduced. it's this error log database transaction fail. so what i'm trying to convey here is that by using patterns, we can easily do comparison of patterns over two different time periods and use that to answer this fundamental question of what changed to my logs.

so from here, i'm gonna once again, go back to the demo show you what that looks like in cloud launch. so i'm back here and my logs insights, prayer experience once again and what's changed and what's new is this comparison button up here. so up in the time picker now, instead of just the traditional query my logs over some time period, what i can do now is query my logs over a time period and compare it to some other time period.

so our customers have been beta testing this, for example, i've been doing this to do the like compare my logs while my application is having an issue right now to a known healthy period. so we can click this compare and say, compare the results of my query to either the previous period. so the for my one hour quite time period, we're always gonna mirror it with the same time interval. so that previous period would be the previous one hour the previous day. so, hey, my logs were or my application was working fine yesterday compared the logs to yesterday's logs a week ago, a month ago.

um or i can go pick any custom type i want. so i'll click apply and run the query there. so now when i go look at the results of this query, we're once again on the patterns page or tab. but instead of seeing the patterns for my query period. i'm actually seeing the change in patterns. so i don't know if you can see this well out there. but for example, up here, we have a error message that was identified as a new pattern. so this was not occurring yesterday. so that's a pretty strong indicator. hey, error here, that wasn't there yesterday. that's a pretty good candidate for possible root cause of my issue. and i can once again dive into this pattern expect field. and for example, here i see a histogram where the blue is showing me how often this log event was occurring over the last hour versus that orange line down at the flat line at the bottom showing. oh, this log event was not occurring yesterday. so i'm quickly able to easily and visually identify that. hey, this error message brand new wasn't occurring before. that's my probable root causes. and hopefully i can use this information to go quickly resolve my application issue.

so um i'm particularly excited about this feature from all our customers who have been involved in beta testing. uh they've all been really excited about this. some of the use cases i've heard is one that like compare my logs from my application experiencing an issue to a known healthy period. like i've been talking about uh another customer has been using this for their software deployments. so when they're deploying new software, compare post deployment and pre deployment logs to identify anything suspicious and um a variety of other use cases as well.

so i will finish things off with a couple more slides and dive into the last feature which is logs anomaly detection. so we looked at patterns and we looked at d and kind of building on these. we're launching another feature or we have launched another feature this week this week, which is how much logs anomaly detection.

so to kind of take another real world example, if you imagine in a fruit processing facility, they have these inspection lines, these automated machines that are scanning all the different bits of fruit coming across, looking for things like sticks, stones, et cetera in the line and tossing those out. so what we're trying to do with logs, anomaly detection is take that same concept and apply it to our logs.

so what i'm gonna do is come over here and go to the brand new logs, anomalies page within cloud watch. so once i load that up, i'll see all the anomalies that have been discovered in my logs. so the way this feature works is that you'll go into one of the log groups for your applications enable cloudwatch logs anomaly detection. in other words, you'll say, hey, cloudwatch, can you help monitor these logs for me and let me know anytime any unusual changes happen at that point, we'll train a machine learning model on the expected patterns and the volume of each pattern associated with your application.

so kind of like we were using the pattern diff feature before we're essentially fully automating that process using machine learning model under the hood. once you turn that on cloud watch will just monitor your logs on a all on a always on basis behind the scene. so you can just leave it and we'll automatically service things or surface anomalies that are detecting your logs.

so things like, oh a brand new error message occurring that wasn't there before a sudden spike in the volume of your or some particular log message in your logs. we'll even look at things like particular values within your logs. so those tokens that we looked at um if for example, there's a spike in htp four hundreds will not generate an anomaly for that.

so just kind of quick toward the u i, we give the reason for the anomaly, we'll assess a priority. they're all based on that pattern, like i mentioned before, we'll tell you to the detection time. so you can see if this is still occurring or not or if this is kind of stale previously occurring in your logs. you can create um 12 watch alarms off of these. so you can tie it into your existing alarming workflow and you can always pivot from the anomalies down into the underlying logs and logs insights.

so i'll go from here and just show you what the setup process for that looks like because it's pretty quick and easy. so if you wanted to go try out this feature, what you would do is come over to your log groups, page, pick a particular log group that you're interested in monitoring. so let's type into these ets logs. if you scroll on down, you'll find what's new here, which is the anomaly detection tab. so if i go to this click on that, i can simply click that create anomaly detection button from here, there's some settings and the cool thing is it's all optional. you don't have to go do a bunch of complex configuration feed in kind of the expected structure, structure or other, uh, kind of difficult and cumbersome configuration tasks. we'll just use machine learning to automatically sort that all out from your logs using that pattern capability. we were looking at earlier from here at thought watch. we'll just spend a couple of minutes training our model on the logs from your application and the feature will be active and we'll automatically start servicing these anomalies anytime they occur.

all right. so that's the anomaly detection feature and now we're going back and just summarize what we just looked at. so three features, um, we're really excited to have watched these this week at rein reinvented. i hope you are able to go try these out and i hope they, um, really facilitate your team's log investigations.

one is that patterns feature so all you have to do is go in the cloudwatch logs, insights run any query. and you'll now see that patterns tab returning these results. so fully automated clustering of your logs comes with all your queries, comparison analysis is that that compare changes in my logs over time. again, the one i'm probably most personally excited about really cool feature um that says gives you that compare logs from non healthy period to perhaps uh when your application is experiencing issues. and then finally, always on a not detection where you take those two kind of capabilities, combine them into a fully automated anomaly detection feature that will surface and detect things like new error messages showing up in your logs.

all right, with that said, i'll bring my esteemed colleague nikil back up to the stage to talk about some other new stuff. awesome. uh thank you so much, matt, you were right. that is super exciting. now, i just need to figure out a way to build a machine to organize these legos automatically just like a pattern command. i'll get started on that. but anyway, so moving on to the next topic, we, we looked at how the log analysis part of the problem which comes with the the large volume of logs can be solved with these new capabilities. now, what about log management? how can we make that simpler, easier and better?

so let's first dive into what are those challenges i'm referring to what are these key log volume related and log management challenges that occurred due to the large volume of logs. now, of course, when you have a lot of logs, you need to be able to store them, effectively, manage them in terms of their life cycle, as well as perform various workflow, end to end operations on them on, you know how you can access them, what tools to use to query and search. and this becomes especially challenging, challenging when you couple that with the fact that not all logs are created equal.

on one hand, you have your critical application logs, which you need real time detection capabilities for you need pattern analysis compared all that and detection, good stuff. but then on the other hand, you have debugs and stack traces and and dev environment logs, which you may not need those capabilities on, but you wanna be able to capture them. so if there was a problem, you can quickly go search and figure out what what happened, right?

so this is what we call infrequently access logs in this concept where those logs are valuable, but when you need them, only when you need them, right? so um what what we hear our customers are doing have to overcome this challenge today are a couple of things. either they will make the hard decision of limit the logs they are generating which hinders application visibility, right? or they will come up with multiple logging solutions just to overcome this cost concern they have and all this leads to inefficiencies in your troubleshooting and investigation work flows, takes away valuable builder and you know, developer time

Um and and we wanna avoid that. So to solve this problem, we announced yesterday a new log class within CloudWatch called Infrequently Accessed Logs. What this class provides is the scale, reliability, security, and the fully managed experience that you already have with CloudWatch at our scale.

I don't know if you saw the original slide - 11 quadrillion metrics and seven exabytes of logs ingested per month. On average, that's the scale we operate and build our services at. So you get those benefits at the same time, you're also able to use the same familiar Logs Insights query experience including cross account queries with this class.

So going back to my example, where when you need them, you want to be able to find the logs right away quickly without having to jump through hoops. And the best part, it's 50% lower ingestion cost compared to your regular, which we are calling Standard Ingestion Class going forward for custom log ingestion, right?

So you must be wondering how do we get started with this? What's as easy as selecting a new log class or the log class of your choice when you're creating a new log group. So in your order to create log group workflows, you will now see a new parameter called Log Class, which will have the default option of Standard, which is the log class we already had with all the capabilities and benefits it offers. But you now have the option to select Infrequently Accessed Class if you want.

Here is a high level overview of the capabilities available between the two classes. Now I won't read through the whole list, but as I mentioned, you get the fully managed scale security with encryption, cross account analytics, Logs Insights queries.

The same familiar tools and same familiar service offerings within AWS and with our Standard Log Class. Now, not only do you get the capabilities you already had, but you get all the new stuff, including all three capabilities we just announced now included within your Standard ingestion cost, right?

How do you decide which log class is best for your logs? Let's figure out how we can determine where to select which log class for our logs. And a simple way to think about this is: any logs which are critical for your application health monitoring and detection where you need alarming or anomaly detection, right, those logs still need to be in Standard, right? So you can get the full suite of benefits and act on any problems that occur in your application.

And for logs which you are only querying on an as needed basis, primarily for forensic analysis, maybe those can be sent to Infrequently Accessed class. So now you can bring all those other logs which were previously not meeting that criteria for you from a Standard logs class standpoint. Now you can send via Infrequently Accessed class.

And as I mentioned on the pricing, this is an example from us-east-1 - for custom logs in us-east-1, Standard will continue to be the $0.50 per GB as it has been. And the new Infrequently Accessed class will be $0.25 per GB of logs ingested, custom logs ingested.

And by the way, we do already have vendor logs, which are logs coming from most of our AWS services directly to CloudWatch, which are volume tiered. So giving you benefits and discounts at scale. In case of Standard, that used to be $0.50 per GB at the starting tier and at the highest tier $0.05 per GB. So it was already very effective in terms of scaling.

Within Infrequently Accessed class, we have made it even better. So with vendor logs, you will also get start your vendor logs pricing for within Infrequently Accessed class will start at $0.25 per GB as well, right?

So let's recap what do you actually get from this new log class? As we talked about, being able to consolidate your logs of different types, different values, different capability requirements in one place so your developers and your builders are not having to jump from one tool to another to solve that problem. And you're not losing out on visibility into those logs by limiting the generation or something like that, which links back to improved visibility because you're not having to make those tradeoffs just because of cost, you get better visibility into your application health.

This new class is purpose built with capabilities focused on ad hoc forensic analysis, right? So we have chosen the capabilities that we know are needed for those types of logs and included them in this class. And by giving you the flexibility to make the cost feature tradeoffs relevant to your use case and your logs, we want to provide you more value from your logging within CloudWatch.

And this is validated by some of our customers as well. So we've had customers like Fluent Commerce, DesignCrowd, both have been using this new class for the last few weeks. And as you can see from some of their quotes, they love the flexibility. It's been able to enhance their application visibility, increase operational efficiency, better management of cost at scale, and send new workloads to CloudWatch. Best part without needing to limit the logs that are sent.

We hope you will also try it out and get benefits from this new log class - improve your operational observability.

One last thing - we talked a lot, both Matt and I, about Logs Insights and queries and whatnot and how we can make it better, give you faster insights. But what if we could also make it simpler to get insights regardless of your level of knowledge with the query language, familiarity with CloudWatch, or even your actual application stack?

Yesterday we also announced in preview new natural language querying for both Logs and Metric Insights. With this capability, you can give natural language prompts which will help generate queries on your behalf, in context of your data, both for Logs and Metric Insights.

You'll be able to see the explanation for the queries that have been generated - why a certain command was picked by that generated prompt or response, what each prompt means. So if you are not familiar with the query language today, it helps you learn, but if you are, it doesn't hinder your ability to investigate today.

And you can also iterate on those queries. We know as part of log analysis, a lot of the times it's an iterative process, you don't know exactly what you want from the first query. So you can update and iterate the query based on the results you see.

Alright, let's do a quick demo of this capability and it's available in preview today. I'm not sure I mentioned the regions - us-east-1 and us-west-2.

Alright. So we are back to our Logs Insights console and here I will select Log Group and then I am simply going to click on the log generator icon. It'll open this new prompt. For simplicity sake, let me delete this entire query. And because let's say, I don't know the query language, I also want to learn. So I'm gonna select "Include explanations for these generated queries".

So I'm just gonna provide a simple, make a simple ask - "Show me errors generated in 5 minute intervals". Ok. So I don't really know too much about this function, this is all Matt's demo as you could see. I don't know what he's been running, what he's not running. I'm just gonna ask a simple question. Let's see what it says.

So it has generated a query for me and it is giving me explanation this way, what it does. It's looking for the keyword "error" by 5 minute intervals. It's used some amazing things which I don't know much about - "bin" and "stats". Look fancy.

Alright. Let's see how the results are. Let's run this query as is - I'm not, I don't know whether it's good or bad, but I'm gonna run this query. Oh look. So it's generated the results by error and it's pretty easy for me to decipher even somebody like me who may not know much about those queries. Ok?

What if I wanna just now see the top, let's say I don't wanna see all the errors. I just wanna see the top 5. So I'm gonna say "Sort by count and show me top 5". Oh, sorry "Update query". This may have an issue.

Alright, it has generated the query. Let's run it. Alright. There we go. So now it has sorted by error count. Oh, it's in ascending order and it's only showing me 5. So you can iterate on your query as you go to get, you know, the results you need.

And this is also available in Metric Insights. So I won't run through the whole thing again. But hopefully you get the idea. So when you go into Metrics, so in CloudWatch Metrics Insights query, you will see this new option "Query Generator".

So here I just before this we started this demo, I gave it a prompt "Show me location counts across various lambdas". It generated a set of, you know, parameters for the SQL query that Metric Insights is running and when I ran it, it provided me a graph and the same capabilities apply here as well.

You know, you can update the query, you get explanations of any of your queries and you are able to leverage it to learn as well as iterate on your capabilities.

Alright, let me switch back to the slides. Was any of this helpful? Are you looking forward to this? Awesome.

Alright. So let's walk you through where you can... Here's some QR codes, you take a picture, scan them for the three type or it's not even three features. It's three I wanna say bucket of features if you will. But anomaly detection and pattern analysis, CloudWatch Logs Infrequently Accessed, and the query generation - you can learn more about these through these links.

You can also stop by our Observability kiosk at AWS Village anytime - we have some live demos, experts, you can ask questions to, some swag, and both me and Matt will be available for Q&A for another 10-15 minutes at the end of this session if you have questions. But we want to be cognizant of your time. I'm sure the first day was long.

And in case you wanna go grab a snack, we let the everyone folks go early, but we will be available to answer questions if you have any.

With that, thank you everyone. One last thing, please, please, please - this helps us tremendously - please fill out the CSA survey so we can improve this session in the future. A couple minutes to fill the CSA survey goes a long way for us. Thank you so much.

Now, now.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫