Demonstration of what’s new with AWS observability and operations

最新推荐文章于 2024-10-31 13:46:29 发布

李白的朋友王维

最新推荐文章于 2024-10-31 13:46:29 发布

阅读量161

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134866027

版权

The purpose of this session is to do a lot of demos based off of things that Berner just launched. Like the CloudWatch Application Signals in my application as well as some of the other capabilities we launched this year.

So if you think about as an operator, the life cycle of an incident, I'll tie the demos basically to us. We'll go to the demo very quickly. It's pretty simple, right at the end of the day, it's detect, investigate and remediate.

So if you think of it as an operator, I've been an operator. How many of you in the crowd have been an operator before a good chunk of you. This is what you deal with. When there's an operational incident, you have to detect, figure out what's going on. What's the impact are my customers even impacted? You then investigate and then eventually remediate that remediation might be a hot fix that you put on prod bypassing your CICD pipeline. It might be a roll back, it might be just logging into that server. But this is a pretty typical life cycle of an incident.

Again of a broader breakout session in Mandalay Bay at 4pm we're basically going to cover all these features um in that session, but I'm going to do demos of a few of these key features basically in this session right now.

Before I drop the demo, I think it's important to recognize when you're in the life cycle of an instant a customer impacting event. You don't spend the same amount of time in all three of these stages, right? The detection is usually very quick. You have alarms, set up, machine learning is powering those alarms. The remediation is usually very quick too. Once you figure out what's going wrong, typically, you can roll back quickly, you can fix what's in production fairly quickly. It's an investigation that takes a lot of the time during the operational incident.

So I'm gonna move into, I'm gonna go to my demo laptop here. So just one second.

Yes. So the first place you start, I'm gonna start to go over things that Berner talked about basically in his keynote. So the one thing you'll see here on the top, right on my console home, I have a pet website. This website is running on EKS cluster, but I won't go too much into the details of what it is. But just think I think of, I'm running a business that offers and stands at pet clinics around the world. So people have a place to take their pets when they get sick.

So I have an application called pet clinic i've now defined this as an application, the feature that we just launched called my application. I've done this through basically creating a new application linking CFN stacks to the application and also doing other tagging with one particular tag value.

So once I've created my application here, I can go through AWS um the different services and start to tag those resources and then start to bring all this information basically into the my application view.

So there's a number of different widgets here, i have most of them enabled. You'll see at the top rate, there's a man in operations, right? So this is not showing in every single synthetic canary or every single wire in my account. This is simply within the context of the one application i set up, i'll talk more about application signals, which is what Werner also talked about in his keynote. But you're seeing some of that information from application signals showing up here on the top, right, where i can look at my top operations basically by latency, there's other information in the application view.

So there's connection integration to security hub. So i can the security hub will basically bring in its findings. You can see i have three high severity but no critical items. The cost and usage data is coming through again, specifically just my application. This is not my whole account, this is within the context of an application and then other metrics around compute etc.

And then there's other information here as well around like patch compliance or my configuration compliance with AWS config. But the focus primarily is going to be on observable.

So I'm going to dive deeper basically into this application. We'll go into cloud watch, give us a, a second to load. And so i know Bernard really just had a screenshot of this. I'll go in a bit of a deeper dive into this, but we, we now have application signals on the left and there's a few new things that you'll see here. I'll go into the first one, which is the services.

So the one thing i want to point out is this enables you to um basically instrument your eks java based applications today without doing any instrumentation on your own. So if you're running a java based app on eks setting this up is as simple as clicking enable, going to eks, selecting your cluster and basically clicking. Ok. You can also do this in reverse and go to eks and there's a new observability add on that. You can enable basically within eks.

So it's very simple to get started again. We support java today. This is in preview, you can do a custom option as well. So if you're not running kubernetes on ETS, you're doing it on prem other cloud providers um or just on e two, you can also still use this again for java based application today right now.

So i already have this enabled and what this is showing me is i'll sh i'll walk through what this is basically showing me. So without me instrumenting my application at all, it's automatically discovered i have four services running on the ETS cluster. I've already set up some SLOs and SLIs and i could talk about the tracking that we're doing there, but i can go into each of these individual services and find all the dependencies on the operations under that particular service.

So if i click the pla uh pet clinic front end, it's gonna take me a level deeper. Now, I'm looking at that particular service and it's pulling in all the service operations. Again, i need to stress, i didn't have to do anything to enable this other than saying, turn this on for my EKS cluster.

So what we can see here because i have 22 different operations running under this one service, those operations themselves also have dependency. So i can go, for example, in this post operation, i do have two SLIs, one has breached, but i can go and look at the dependency related basically to that. So there's one dependency here and it's opposed to another api i may have another service that talks to SQS or Lamda. You would basically see all those dependencies there as well.

So I'm gonna go back and you just walk through, we'll take a look at something that has a lot of faults right now. So in the case of this post command, this is going to an api um around owners of pets and what you can see here on the right, the widget on the right, there's uh basically spikes of faults.

So the third widget over here is showing our faults for this particular service operation within that particular service. If i click any of those points, what this is effectively gonna do, it's gonna automatically bring in the correlated tracing that's also automatically instrumented in your EKS cluster.

So now i can go and look at why am i getting three faults? It looks like every minute there's some, there's something going on with this particular service operation within the service. So i click the fault. I get my list, there's three faults that occurred and i can click that and then go into the individual basically trace and i can understand why am i getting a fault there?

So if we zoom out a little bit and this is showing a trace view. So this is XRAY basically being surfaced in CloudWatch, i can see here on my trace map. I have a synthetic canary talking to an EKS cluster container, talking to another EKS container making three calls, one to SQS, one to a remote API another one to an SQSQ.

So let's figure out what's going wrong here. Why am i going to error in the fault in this particular operation? So we can go to a segment timeline view and basically break down all the communication that's happening within that single trace, right? Like the trace is connected all these pieces together. And then we get a visual timeline of how this breaks down and where things are breaking down and causing faults.

So what we can see here is our EKS container, there seems to be an error that's occurring on SQS. So our container is calling another container that's calling SQS. Some sort of error is happening and it's bubbling basically back up to our application.

So we can go ahead and take a look at that. I'm gonna actually click the EKS container that makes the call to SQS. And what we'll see here if i click the exceptions tab is, it looks like we're doing something that we're not allowed to do with SQS.

So there is basically an operation called purge Q with an SQS that we're only, we're rate limited to basically calling it once a minute. So what's happening in our application? We have some code that's not respecting that rate limit. And we're trying to call this purge operation twice.

So this is basically kind of showing you at a high level how you can look at the whole application, all the service operations underneath it and then dive really quickly and root cause what's going on by looking at the correlated traces to that dashboard.

So I'm gonna go back up and then talk a bit about the SLO and SLI tracking within us and we'll talk about some of the machine learning analytics we've added to CloudWatch Lots. But if i go back up to cloud watch and go to that, um, the left hand side here, if i go over to service level objectives, and you probably saw this in the other dashboards, i was pulling up, there were some SLOs and SLIs here.

So what you can now do in cloudwatch is actually define the um your SLO rules your targets. And we'll continuously measure basically how well you're tracking against that. So you can define basically an error budget, set some parameters and then even create a lot. If you're saying you're getting the 80% of your air budget already for this month, we could set up a CloudWatch alarm and you can then trigger another workflow.

I basically, i have four SLOs set up right now and for the service level and then the operations under that service, you can have SLOs of both those layers. Um and these are the four i basically have set in my application. So you can see one is breached. This is bree i went, my application said 100%. I create some very unrealistic latency goals of just a trigger and error. So this is what uh unhealthy one looks like. And this would be an example of more of a healthy SLO that you set up. Here is the remaining transcript formatted for better readability:

And so the crane, this is actually really simple. So if you're already using application signals to auto discover the services and operations on your EKS cluster, you can simply go here and we'll prepopulate that list.

So here's my four services i showed you earlier. We have our pet clinic front end and then we have our list of operators or sorry operations. So we can add SLOs to each of those um operations there. Or if you'd like, you can just tie this to CloudWatch metric.

So you can even use the SLO tracking. If you don't use application signals. If you have some sort of CloudWatch metric, you can set your SLI basically against that. There's other configuration that you do here.

So you can basically say i want to set an interval, it could be one rolling day you want or a calendar day. So it's up to you, you want rolling 24 hours or you just want to do it every actual day, you set your attainment goal.

So i want to attain 99.9% 3 nines. You can do that and then you can set some optional CloudWatch alarms, right? So if you've actually gone past that or you're approaching that by a percentage, you're like 90% of the way to the air budget, we can set off alarm, so you can be alerted about that.

So I'm gonna show you one more thing and I'll probably have five minutes for Q and A. So the other interesting thing we launched, this was Sunday night, Monday morning was CloudWatch Logs anomaly detection.

So I'm gonna just go into CloudWatch Log Insights. So CloudWatch Log Insights prior to this week, it was a pretty good analytics tool. You can write some queries and query your logs. Um what we've added is a lot of machine learning capabilities basically to this.

So I'm gonna just run a query. This is what the experience you basically get last week. So it's going off. It's gonna look for the logs over the last hour. The wi fi is a little slow. So it'll just take a second to render here and i didn't find any.

Ok. I'm gonna go to different application logs. So bear with me. Let's try this one. This should be correct. Uh so, yeah, so in the last hour, we've had 20,000 log events for this particular log group.

So what you can now do is use a pattern operator against the whole log event or just the message or, or just a particular field and you can click run query. And what it's gonna do is it's gonna actually look for all the patterns within your logs.

So just give us a second to run. And what you will see is it's automatically discovered on those 20,000 log events. In the last hour. It's identified 100 and 32 different patterns basically within your logs, it sorts it by the most frequent pattern.

So this one is out of the 20,000, pretty much 50/50 percent. Is that particular pattern? So we can go and actually inspect that pattern. If we go to the pull on here, what you'll see is we've tokenized that things are different. We've identified the pattern and now we've created the tokens and so you can compare and contrast.

Ok, which of these tokens are showing up the most. So the first one is a time stamp. So that's obviously going to be very different. There's a lot of cardinality there. But let's take a look at like token 10. For example, we can see that's three different URLs across those 11,000 records on log events. And we can see that 90% time. It's one particular URL so this just helps you more quickly find that proverbial needle in the haystack just with a simple command, go log watch, go finding all the patterns in my logs.

The other common question is operators we get is, oh, I've seen a bunch of errors. Were these happening last week? Is this new? Is this just been going on for a while? I haven't noticed.

And so the other command we add here was this compare feature. So we can then just go rerun this query or any query and compare it to what happened yesterday. It's going to rerun the query.

Mhm we'll give this a second and then effectively, what we, what we get here is now we're identifying the patterns again. And now we're seeing are these new patterns have just started occurring today and we didn't see them last week.

So this is a really powerful tool to help you understand what's changed. And what I'm seeing right now during this operational event is this new, the last piece I'm gonna jump into do a really quick demo of is logs anomaly. And then I'll open up for Q and A.

So this is the third piece of that logs anomaly detection. You turn this on at a per log group level and we constantly, we're looking at your log groups and trying to identify anomalies in those applications.

So imagine you have a steady state of logs. Your application just generally creates 10,000 logs every hour. And all of a sudden you're seeing errors in your logs more often and it grows by 200%. This will actually identify that for you.

So it's looking for errors, looking for keywords in your logs, things that look like errors, it assigns a priority to them. And then you have that pattern that i showed you earlier. And then you also get information about when was the first time we detected this?

Well, we detected this 23 hours ago and the last time we detected this is nine minutes ago, it's still ongoing. It's been ongoing for 24 hours. This also generates a metric.

So as we find these anomalies and they pop up in our logs, it will create a metric in CloudWatch, you can create an alarm drive that alarm to some sort of run book. Um and basically take care of it through whatever processes you have in place.

So that's a really quick overview. Like i said, i'm going to be at Mandalay Bay at 4pm and we're doing an hour session on what i showed you as well as all those 15 plus features that we've launched between e earlier this year and Reinvents.

So I'm happy to take Q and A.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫