Observability: Best practices for modern applications

最新推荐文章于 2024-07-15 22:04:31 发布

taibaili2023

最新推荐文章于 2024-07-15 22:04:31 发布

阅读量421

点赞数 11

文章标签： aws

本文链接：https://blog.csdn.net/weixin_46812959/article/details/134595724

版权

Welcome everyone. Happy Thursday. I think it's the last session of the day before the party. Um so hopefully you'll enjoy this session. My name is Roland Barcia. I'm uh the worldwide director in our worldwide specialist organization for cist work in our at mod team, uh helping our customers build, deploy, run cloud native applications, modern applications and i'm joined, had the honor to present with Greg.

All right, Greg Apple, uh tech leader for cloud operations. So I cover observable compliance governance. I run an internal tech community community of about 1000 folks. Uh and prior to that, I was the observable s a for the organization. All right. Excellent.

So I'll kick off, spend about 10 minutes uh talking a little bit about modern apps and why they're a little bit difficult uh to do operations, observ ability for greg is going to come up here and kind of talk about some of the best practices, actually show some hands on examples. Uh so we're really looking forward to that.

How many people consider themselves admins here. Let's see. Ok. Sres developers, ah all right, architects, security folks. All right. So we got a good mixed bag. I think developers are winning as a dev guy. It warms my heart to see a lot of developers really concerned about observable. Um because um often times in the past, it's always a it was a day two problem, but we really believe observable is a day zero problem. And we believe modern apps if built right are built to be observed.

I think werner said this morning it's a event driven world now, right? Everything is asynchronous and that's the nature of modern applications and people struggle with observable in applications, everything from it worked on my laptop, but it didn't work in production to actually using the, you know, operations and and logs as a source of data to improve your systems.

So I'm not gonna go through the agenda because we're gonna live the agenda, right? But let's get into why modern apps are more difficult to observe and uh fair, fairly simplistic. Uh you know, in the past uh in, in my previous history, i used to build j two ee applications or java ee applications and they were monolithic applications and monolithic applications are probably easier to operate, right? You have one big application uh that you build operations around. You've probably had a single programming language, a single platform tools designed around that platform. It's easy to monitor one thing, right? There might be a lot of things inside of that thing that have built up over years and there's difficulties and challenges there. But in modern apps, we've gotten more and more distributed, right? We're building applications with all sorts of various types of technologies. We're building micro services, cloud native apps, different deployment models like lambda functions, uh containers, ecs, eks rosa, even so tons and tons of services. And you're building lots of little applications that are communicating with each other, maybe across availability zones, across regions, uh all these various different technologies.

And so when we get to microservices, right? Microservices have a lot of different characteristics and this is an ever-changing list uh so loosely coupled, right? You have microservice based applications that are calling each other via api s. If you start building event driven applications, you have things in between like topics, cues, eventbridge, event routers, streams and your your chaining types of events down system, right? So lots of communication. How many people struggle with downstream observ ability across components? Let's see one here. Yeah, big struggle.

The other thing is we say it's stateless. The truth of the matter is there's a lo there's state in systems everywhere. It's how state is handled, right? We don't want them to be handled. Although now even with modern apps, we're seeing people use things like containers and lambda to build a iml pipelines and state full type of use cases as well. So that's interesting polyglot, lots of different programming languages. We love everyone to be a full stack developer and develop everything in the same language, maybe you use something like cd k in my same programming language. But the reality is you have front end applications maybe in javascript and typescript using frameworks, like react a bunch of stuff in the browser might be happening, calling api s. Those api s could be in java um can cause something downstream and it's an a iml pipeline thing in python. So you're dealing with more programming languages.

If you're doing asynchronous development, you're losing context along the way, right? You might begin a context from one event to another event, et cetera, different types of hosting platforms, depending on the types of applications you're building, i'll get into that. You might be deploying functions in lambda. You could also be deploying containers uh using things like ecs or eks or rosa. Um so you're hosted in different environments. Uh you have things like networks in between these things, right? So you might have a lambda function calling out across vpc, s out to another vpc calling something in a container and you have layers within there, you have clusters, et cetera, all different types of things.

Uh all these tools, you know, people use micro services because they want to move faster, they want independent pipelines. So advantageous for developers to remove their dependencies and leave all the headache for the operations teams downstream, right, to kind of assemble these things, et cetera, more emphasis on independent dev ops pipelines, uh more independent, more independence on scaling. You have some components that might scale up to thousands of pods calling something that's a single tin calling something that's scaled independently. So now you have to track, you know, not just the application layer but you know what availability zone or what region or what cluster or where did my request go? If you're using things like service meshes like an sto uh things like that, you have side cars and things inside of applications that get added um by environment. So all these things create issues um that make it hard to monitor these applications and monitor is a traditional word, right? But in reality, what we want to get to is, you know, how do we deal with this complexity because we're moving in this direction for a reason, right? We want to be more agile, we wanna move faster, we want to deploy apps, we want our business to function quickly. But at the same time, we wanna make sure that the intention of microservices has always been to build observable applications if you think of observable from day zero, building applications that are built to manage or observable, right? That becomes part of your process and it makes it a lot easier to do that.

And so for example, right, it shouldn't be this big giant com complex headache. But observ ability is to me, i'm not gonna read the definition but it's really a full life cycle, right? Of, of really thinking about the way you deal with logs in your system, metrics in your system, how you deal with tracing in your system, really treating them like first-class citizens as part of your system and having a full life cycle around it. Uh being able to troubleshoot problems, right? And uh resolve things quickly, right? Will make sure you don't lose money. Right. Every time you have a problem in production, you don't solve it, et cetera.

How many people here write bug freak code. We have developers. Uh i think i got one. All right. Ho how many people have gotten bugs into production? Absolutely. Right. Things fail all the time. I'm gonna see a quote in a minute, but rather than kind of, you know, thinking about, of course, at mod domains include things like dev ops security, et cetera, right? Thinking about observable from day one. How i'm gonna troubleshoot? Things are, are key, right? And so, you know, werner says everything fails all the time and the law of the thermodynamics, enthropy, right? Everything as it goes, runs, runs down, right? Things run down over time. Developers love to build something new and love to say, oh, you know, here junior developer take over the cool thing that i wrote. I'm gonna go build the next thing and you should maintain what i build, right? That's like the nature of a developer. And so let's leave them with observable code, right?

One last point i wanna make when we talk about modern apps a little bit on the developer side is primarily in, in my world, the developer world, right? We've seen 22 strategies when it comes to app mod and customers having a lot of success with both. I talk about this because we talk about modern apps. We in aws wanna make sure you, you know, we got you covered on both sides. So we have our aws serverless approach, serverless first, right? And really that's the notion of really looking at aws and using the tools around aws and consuming as many managed services as possible, using technologies like lambda and fargate to write systems using aws vpc s networking security, using um infrastructure as code tools like call confirmation cd k and using as much managed services as possible.

So technology like l a app runner, ecs, fargate, all these technologies really enable you, especially for teams and companies that don't have large platform, central teams. We have other sets of customers that have gone all in as kubernetes as their primary compute platform for all sorts of reasons, right? Whether they're, they want a consistent platform between they're on prem and their primary cloud, they're deploying to different clouds or the ecosystem around kubernetes. We have managed kubernetes eks, we also have rosa bring uh managed openshift in a partnership with red hat. Uh we have various ways. Over 60% of we in the cncf did a survey in 2022. Uh it's over 60% still kubernetes workloads run um in aws in the cloud.

So what does this have to do with observ ability? Well, depending on the strategy, we have uh customers that have different observable stacks. And we think about aws serverless approach, native approach and you can use different stacks across the board, meaning you can use aws native services um with kubernetes or vice versa. But the goal here is we have a set of managed services in an observable stack. And we also support the various popular open source tools and manage format, things like manage prometheus and manage grafana as well as work with partners if you're doing this with ecas. So a lot of these best practices would apply across the board.

Greg is gonna focus on the native side, right? Servius first with a lot of his demos, et cetera, but there's other sessions that have happened this week and you can see in the recording that talk about, hey, if you're doing eks or kubernetes, how many people here are using lambda? See fargate, ecs, eks and we're using rosa. All right. So got one over there. Awesome. So, uh you're gonna hear a lot of great best practices. So with that, i'm gonna turn it over to my buddy here, greg. All right. Trust him.

Can you guys hear me? Ok. Good. Awesome. Thanks roland. So, yeah, i'm gonna cover uh four best practices this is really me distilling down. I've talked to hundreds of customers over the last few years around observable and i've largely have taken all these customer conversations. We've distilled them down to these four best practices. So the first one is gonna be navigating your instrumentation options. There's lots of options out there in terms of how do you instrument, i'll talk about what we'll do for you and what you will need to do as a customer when it comes to instrumentation.

One thing that's really challenging in modern workloads is there's a lot more metrics and logs and that can lead to high cost. So i'll talk about as the second best practice. How you optimize the cost of high knal after that, we'll talk because there's lots of metrics that also potentially means there could be a lot of alarms going off and we'll talk about how we can reduce some alarm fatigue as well. And last but not least and probably the most important most misunderstood is how do you avoid dangling traces? How do you get that end to end trace in your application?

So let's start with navigating your instrumentation options. Instrumentation on aws is a shared responsibility. We will send a lot of metrics and logs, especially metrics from our services to cloud watch, but you shouldn't stop there. You need to take on some of the responsibility to do instrumentation. I'll, i'll kind of walk through what we our best practices on how you think it should think about the metrics and the logs and the traces we send and what you guys should be doing in your applications. But i can't stress this enough. This is a shared responsibility between us, but let's talk about aws first, this is a vended metric. Um it shouldn't be a surprising one. This is lambda. It's a number of invocations. It sounds simple. Don't make any um assumptions about what this metric means. There is a fantastic and you want to take, probably want to take a picture of this qr code. There is a page in the documentation uh in or docs that links to every single service that sends a cloudwatch metric to aws cloudwatch. If you're starting to use a new service, whether it be kinesis video streams, uh cognito api gateway, i cannot stress enough. You need to go look at those metrics, go to this page, bookmark this page and we go to use another service, understand what we're sending to cloudwatch for you.

So let's take the um this is an example of the lambda, the the link to the lambda docs. And so if we click through this is the lambda docks, there's a few things you're going to notice in every single service in terms of our documentation. First, we're going to tell you what statistics are available, not every statistic is available for every service in every metric. So in this case, this is a a sub sta statistic. So the number of lambda invocations for other metrics, there may be percentiles, there may be averages. It, it will be, the documentation will explain which ones you should be looking at how you should interpret that data.

The second piece that you'll find in our documentation on our metrics is uh you know, english explanation of what that metric is measuring. Again, you make assumptions on all of this stuff. Try not to read through the docs really trying to understand that there's a lot of rich material here.

And third last but not least is the dimensionality. Um a lot of our services will send that metric across different dimensions. So for example, we have our synthetic testing service, we can set a synthetic canary, we're going to send a duration metric for that canary that canary can also have steps within your test and we'll send a duration for each of the individual steps. So it's really important to understand the dimensionality of these metrics.

Let's talk about logs. This is just an example of logs. There's a lot of services that send logs into cloudwatch. Some services will send logs into s3. Uh again, you largely when you use services like lambda, you don't need to do anything. Initially, we will send those metrics and logs automatically for you. And then some services like x ray. It's a oh sorry, like some services like lambda to enable tracing. It's simply a check box or a toggle. In this case, it's a toggle, you go into your lambda function. You say i want to enable x ray and that will start to capture or we start to vend traces every time that lambda function executes api gateway is very similar. And we'll talk more about the trace propagation layer. But let's talk about what we, you know what you need to do here as well because this is a shared responsibility.

And i'm gonna start, first of all with cvs uh more specifically lambda. And then we'll talk about containers in terms of instrumentation. So instrumentation in lambda is fairly straightforward. This is an example of a log, right? You there's a console object in um in the lambda function, we can log different types of logs, log info warning. Again, very simple to get started. It's not you basically just use this object and start logging just 33 examples of me sending different type of logs to cloud watch.

Now, in terms of metrics, how do you send the metric to cloudwatch? There's a few different ways you can call or put metric data api i wouldn't recommend that because i was introducing uh a synchronous call to our api the more efficient way. And i'll dive deeper into this when we talk about high knal is to use something called embedded metric format. Um so this is just an example of me bringing in that name space, applying some dimensions, putting the metric and also seeing some properties. I'll talk more about how that works later on in the presentation.

And then we'll talk about traces from a lambda perspective. So like i said, it's a check box to get lambda to instrument itself for tracing to show up in x ray. If that lambda function needs to go do something, put it in an s3 bucket call another lambda function. There is some instrumentation that you will need to do. And this is just an example of that. I'm using the x ray sdk and then wrapping my cold s3 in that. And this would further propagate that trace uh downstream.

So again, lambda fairly straightforward serve is fairly straightforward. Let's talk about containers. What about containers? There's a lot of options out there and what you'll see, i i put together this little table of your different options for logs, metrics and traces. The one thing you'll notice is there's no one solution out there that does all three currently, right? We have the aws for open telemetry that collects metrics and traces. We have the cloud but doesn't do logs. The college agent collects logs and metrics but doesn't collect traces. We have things like fluent bit fire lens. Those are simply collecting logs, they have nothing to do with metrics, metrics and traces. So right now, there's nothing that covers all all three of these signals. And we'll talk more about my recommendation on, on how you should think about this and moving forward with choosing the instrumentation agent

"But overall, I would say start with OpenTelemetry and then pick what works in terms of logs. And if you're not familiar with OpenTelemetry, it's basically an open standard to collect observable signals, the metrics, logs and traces. So, metrics and traces right now are GA, logs are not, they're in the draft status for the specifications. So currently you can't in, in a GA sense, collect logs right now for OpenTelemetry.

But the idea of OpenTelemetry is just to basically unify the collection of these signals, right? You have your application, you do some level of manual instrumentation, right? So you're pulling for example, system logs, infrastructure metrics off your containers, you have your app logs, you're doing tracing, you have application metrics that you're vending. And the idea is that all three of these signal types come to an OpenTelemetry collector again, like I referenced in the previous slide, we have our own version of the collector. It's called AWS Distro for OpenTelemetry or what we call ADOT.

The AWS OpenTelemetry collector sits there and just consumes these signals in an open format or standard format. There's an optional enrichment process. So as you bring in your logs, metrics and traces, you can enrich those signals and then you can export this and this is where the power of OpenTelemetry is really, really great is you can export this to multiple locations. You can export this for example, to X-Ray or Jaeger or Zipkin. So you don't have to choose, you don't have to use a proprietary SDK to collect this information. You can collect it once and send it to multiple places.

And then the idea is that because you have one central place that's collecting all three of these signals, we can do correlation. So the first best practice, just the, the takeaways here and then we'll move on to the next one. Start with OpenTelemetry for your metrics and traces. If you choose to use our distro, it's supported by AWS Support, you don't have to, you can use other distributions that's fine. But start with OpenTelemetry, then choose a logging agent, right? Use FireLens, FluentBit, CloudWatch Logs, but pick what works for you in your workload. And like I mentioned previously, the OpenTelemetry specification at some point, hopefully fingers crossed in 2023 we will see logs go GA and at that point in the future, when we do have logs, GA as a specification, then use OpenTelemetry for all three.

Alright. So let's talk about the second best practice optimizing the cost of high cardinality. I don't, I'll just, I'll explain what high cardinality is if you're not familiar with this. But here's just some, some metrics I put together we have a latency metric, right? This is measuring latency of something. We have dimensions. So we have three different dimensions. We have our request ID, customer ID and service. So we have a few services here. PutItem, ListItem, GetItem, those first two columns are potential examples of high cardinality. And I'll explain why and the implications of high cardinality.

So let's take a metric like HTTP response, maybe this measures latency. So here's a single metric, let's say we have two dimensions. So we have status code, there's only five status codes - 200, 203, 404, 500, 100. And we have environment. So we have dev and prod. The cardinality of this is a lot just from these two, there's one metric with these two dimensions with a few values, but let's apply this and then scale this out.

Let's think about your more traditional monolith. You have a VM environment, maybe you're running on EC2, maybe you're running on-prem, it doesn't really matter. But your typical monolith, you know, you're probably still gonna have a dev and a prod. I would hope you have maybe we only have 10, we're starting to on that microservices journey and we have 10 different services. We still have five status codes that doesn't change and maybe we're load balancing this off of 10 instances.

So how many metrics is this? It's about 1000 potential metrics. Once you take all those different values, you could potentially have 1000 different metrics. Now, let's shift to a more modern architecture, right? So the same metric HTTP response, we're measuring latency, we still have two environments, dev and prod. Now we have 100 services, right? We're starting to break down that monolith and we have more microservices. We still only have five status codes, but maybe these services now are super ephemeral and they're, they, they're tasks or pods or whatever container system you're using. But we have about 10,000, let's just call them pods. We're using ECS we have 10,000 pods. How much, how many metrics is this? This is 10 million potential metrics.

So we've gone from 1000 potential metrics in our monolith to over 10 million in this fictitious example of moving to a modern architecture. Does everyone guess what this costs? So if you look at the public pricing in us-east-1 for custom metrics, 1000 metrics will cost you about $300 a month. Now, as you use more and more metrics, there is the price per metric gets cheaper, but 10 million metrics would cost just shy of a quarter million dollars a month. And this is not specific to CloudWatch. If you look at other vendors, metrics solutions metrics just cost a lot of money, different people will charge different ways. But the end result is if you're going to send a lot of metrics, you're going to pay for it.

So I work with a lot of customers and customers that have hundreds of millions of metrics. And traditionally, there's been this trade off that you had to make, you could pay that cost and store all that as a metric. But what are you going to do with 10 million metrics? You can't graph it. You certainly are not going to alarm on that, that many metrics. But if you choose to go with logs, yes, it's gonna be more cost efficient, but logs inherently are slower than metrics. So a lot of customers I worked with had to choose between these.

But my recommendation here is use both again, as Roland said, I'm gonna approach this from a CloudWatch and AWS native perspective, but you this principle would apply to any type of workload. You could do this yourself. It just so happens, we provide a feature for this called CloudWatch Embedded Metric Format. And in fact, OpenTelemetry Collector actually has an ability to send your metric data through this feature into CloudWatch.

So let's just imagine, I'll go to a demo here in a second, but let's pretend we have a piece of telemetry here. So you're going to see three colors right in green. I have my dimensions in pink or salmon or whatever you want to call that color. That's our potential high cardinality and then blue or aqua. We have our metric. So we would take this, we take this telemetry that we're pulling off systems and applications and servers. And we could combine this as a log first and then we choose what to actually create a metric of. We don't need to create a metric for every single pod and task ID. That's aggregate those metrics only create aggregated metrics. And this can lead to huge cost savings.

So I'm going to switch to my demo and walk you through a little bit of what this looks like, right? And my laptop is asleep. So bear with me one second. Ok. So we're over on my demo environment now. I apologize. I'm a .NET developer. So all my demos are going to be in .NET. But this is just an example. I just create a little application to create a lot of metrics basically every few minutes. But this is basically how you would um how you would create this manually if you want to send some metrics through CloudWatch Embedded Metric Format.

So I'm basically going and creating dimension sets. I'm saying what type of dimensions do I want to use? So I have a dimension that is going to vend basically two metrics. And the first dimension set is gonna be the service name, the status code and environment. I have a second dimension set which is going to be just the service and environment and then a third which is just the service and so I apply those dimension sets. I didn't put my metrics here. So I'm just generating some random numbers, but I have two metrics. I made up processing latency and processed records.

And then there's some examples of some high cardinality data. So for example, request ID, maybe this is good. You should never ever put a GUID as a dimension in the metric um or say a task ID, you know, if you're doing with tens and hundreds of thousands of tasks. So if I were, this is currently running in the background, I'll show you what this looks like in CloudWatch.

So if we go over to CloudWatch here, I'll just go to my first page. So all this telemetry comes in as a log first and I'm going to go ahead and run this to show you an example of what this actually looks like when it lands in CloudWatch. But here's an example basically of that telemetry. So we have our processing latency, our records, we have this other, these dimensions as well as these high cardinality properties basically. And then there's a specification for for, for embedding the metrics and logs here. And we're basically defining how should this land as a metric? So we're not going to create a metric per task ID. We're going to create a metric based off these aggregate dimensions here. So I'll show you how that looks as a metric.

And so here's my, this is just my name, space in CloudWatch, the AWS Embedded Metrics. So I only have about 100 and four metrics here. So you can see those dimension sets, right? So I can see my environment and my service. So I have four services, a cart service, a search service, a payment service and order service. And I have this separated by environment or maybe I just want to look at the services across all environments. But the idea here is that I'm not creating tens of thousands of metrics. I'm only creating aggregate metrics, things I would want to put on a dashboard. I would want to drive alarms with. But the beauty of this is this is still all available as a log. And if you do need to do some sort of aggregation, you can still do that as a log. I'll just go back to Log Insights here, but I can go ahead and all that telemetry is still available for me to run my queries to do aggregation. It's gonna be a bit slower because it's a log but it's still there. So CloudWatch Embedded Metric Format basically gives you the best of both worlds. Again, if you're not using CloudWatch, that's fine, you can still apply this pattern elsewhere.

So our second best practice, just some takeaways here, identify potential high cardinality dimensions. I'll give you an example. This is a I could talk about this publicly because I did a blog post, I worked with British Telecom and in this blog post, if you go search online, you'll see they, they use this feature, but they have a millions of devices across the United Kingdom that basically provide, they're called Smart Home hubs. They send a bunch of telemetry every five minutes. Let's call it 20 pieces of telemetry. So a million times a million devices times 20 metrics, that's 20 million metrics. And so they use the CloudWatch Embedded Metric Format to basically bring in 20 million metrics every few minutes into CloudWatch. And it only created about 2000 metrics on the aggregation points that made sense for the business. How much jitter is coming off this device in all of England, for example, or Wales or the whole of the United Kingdom or maybe I just want to look at Belfast for example. So they were just like an example of a customer using this at scale.

So again, identify those high cardinality dimensions, ingest your telemetry as a log first and then create metrics using the appropriate dimensions, the dimensions, the aggregate dimensions. That makes sense for your business. If you're using CloudWatch, you can leverage this feature. I just showed you if you're not, that's fine. But if you are and you're using OpenTelemetry, we will send the metrics automatically for you using basically this technology.

Alright, let's move on to our third best practice, reducing alarm fatigue. No surprise. I think everyone in this room with agreed metrics we usually drive alarms. Well, we just finished talking about high cardinality telemetry. And can you, I mean, even in the monolith, it's really easy to get alarm fatigue. Right. There's a lot of metrics. We move to modern architecture. We have a lot more metrics we have to deal with and so you can imagine the types of alarms that would be going off.

I've been there, I've been in ops role when alarms are going off. I'm wondering what the heck is happening. This is usually the look I have, I would say it's a look of fear, confusion like what the heck do I do next? But alarms, alarm fatigue is a real, really big problem with observability.

So the first thing I would say is really think about what you need to alarm on right, alarm on key performance indicators and workload metrics. So this could be order rate. You know, you can imagine a very high volume e-commerce website might really care about how, how many orders are going through a second or the number of orders processed or if you're billing like event driven architectures. What's the queue depth on your SQS, for example? So think of these important metrics things that will tell you that something's gone wrong and it's affecting the user alert when those outcome, when those workload outcomes are at risk.

And a lot of this stuff will actually should sound familiar if you have read the Operational Excellence pillar and Well Architected. Alert when workload, anomalies are detected and automate responses to events as much as possible, correlation is really important, right? If we think back, think of you, I'll use my my .NET background Windows web servers, right? So let's go back and think about those monolithic apps, right?

High CPU is not a great alarm. Please don't ever alarm on high CPU does not tell you if the user is having a bad experience, memory utilization alone may not be a good alarm either. However, both are happening at the same time, let's say CPU is going up to 100% and memory has just dropped off a cliff in a Windows workload with IIS this is a very bad thing for users because what's basically happening is an application pool is recycling, all requests are being stopped and the CPU is going high.

So if you can measure these two things or correlate two alarms at the same time and then alarm on that, this is very powerful. So like I said, this pauses the HTTP request pipeline for Microsoft's web server and the users are going to know this. So as much as possible, again, it's really hard sometimes to figure out what you should correlate and sometimes unfortunately, you have to go through these operational events and learn from them. But don't take that as a like that's an opportunity to learn and improve your alarm posture.

Other best practices around alarms is leverage statistical and machine learning algorithms as much as possible. You want to generate an anomaly detection model to show you an expected range alarm on the dynamic threshold, not a static threshold. And I'll show you why in a few slides or in my demo and there may be valid reasons for spikes and dips in your metrics, right? You don't want false positives. So machine learning can really help with this. They make your alarms more intelligent again because I'm approaching this from a cloud native or AWS native perspective. I'm gonna show you two features in CloudWatch that will help you with this. But you, you want to keep these patterns and principles in mind doesn't matter which observable solution you're looking at.

So I'm gonna switch back to my demo. Alright. So one of the most powerful and easiest things you can do to get started on creating better alarms, do synthetic testing. It doesn't have to be CloudWatch. That's just what I'm showing. But you want a synthetic test, synthetic testing and synthetic monitoring will give you a representative understanding of what you know, a regular actual user is experiencing.

So I just have a little test here, a little what we call canary, we call this synthetic canaries and this runs every single minute I just started this a few hours ago. And so every time this thing runs, it's going to create a metric, it's going to create a bunch of metrics. Actually, if I go over here, I'm gonna get an average duration. So this, when I hit this site, which is a very slow site, it takes about just 100 and 20 seconds to execute. And I can see how the ebb ebbs and flows throughout the day there. But this is a fantastic metric to alarm on because it's measuring, it should be measuring what a normal user is going to experience and it's really easy to get started with this. All you simply do is go to Syn canaries cray canary"

"If you're familiar with Selenium or Puppeteer, both those are supported. A Synthetic Canary is effectively a Lambda layer running the headless Chrome browser. And so you can use Selenium or you can use Puppeteer. If you don't know either of those, we have a web recorder in the Chrome store. So you can just go ahead and click it on a page and generate a script and then paste it into a Canary and, and get started right away.

We also have blueprints if you just want to heartbeat something, I wanna go to this page, take a screenshot, ensure that I got 200 response back, you can do that. So again, if you're trying to figure out what should I alarm on if you don't have synthetic testing and monitoring, start with this? And I, I mentioned machine learning anomaly detection. This is a really good example of that. I have a traffic generator. That's this is this metric here is orders processed.

Um and so it's setting um a metric value and then every hour it's gonna dip, right? Imagine like an ecommerce site, right? The orders maybe you, you're not global, you're just focused on, let's say the US, you're probably gonna see an ebb of flow, right? Yeah, there are people up at 3 a.m. ordering things off your website, but most of the traffic comes during the day. So without when you're using static thresholds, this is really hard to alarm on, right? Like because you can't, you, you have to accommodate for these depths and you're gonna constantly get false positives because this is expected behavior.

So when you turn on an anomaly detection, it will learn what normal looks like. It will say, hey, there's a pattern here and I've seen this before and you know what I think this, this is not something we actually should alarm on. So once you enable an anomaly detection on meric, you can go create an alarm. And as opposed to saying a static threshold which is very difficult to maintain, you can simply say I want an alarm when this goes outside of that band and you can make the band narrower and wider using uh a parameter for standard deviation or you can say just when it goes above or just below, right? Maybe order is processed. That's a good thing when it goes high, right? You just want to alarm when it goes low, something wrong.

The other thing i talked about was correlation and this is also really important. So think about the metrics that you can uh alarm on and, and look at the things that if these old, let's say these three things happen and they all happen at the same time. That's what i wanna alarm. If you can capture that in CloudWatch and in many other systems, you can correlate those alarms. So what i have here is an example of that. Basically, i have a metric for four of my services i showed you earlier, right, the order service, the payment service, search and cart.

Um those all have alarms, they're all alarms, but they're not sending anything to anyone. They're just there to either pass or fail. And then i have some boolean logic if i go to my alarm roll here, that will say this compass alarm only alarm when all four of these alarms are are um in a bad state. So this helps you reduce the alarm fatigue because you're minimizing the amount of potential things to notify folks. And you're also making this more, more intelligent and you can have a compass of composite, uh a composite of composites. I believe it's up to 1500. So you can get quite like an extensive alarm tree there um within Claw watch.

So just an example of this, you can see our synthetic canary here as well. And here's my alarm, my composite alarm call all services down. This thing should be waking up to the cto. If everything's broken, something's gone horribly wrong. All right. So go back to my deck and we'll go on to the last best practice here. Oh, sorry, in summary. So alarm on k sorry, alarm on kpis and workload metrics. Alarm when your workload outcomes are at risk. CloudWatch. Synthetic canaries is a really good way to do this because it's simulating the user correlate your alarms and notify a person. When that correlation happens, you can still have a lot of alarms, just make sure those alarms inside the composite aren't also paging people and then leverage machine learning algorithms to detect anomalies.

So, between these three features in CloudWatch, we have the ability to help you reduce alarm fatigue. Uh and it looks like sorry, some pictures will be taken and all this will be online afterwards in a few days on youtube. All right. Last best practice. My favorite avoiding dangling traces. This really confuses people. It confuses me. It's really hard to understand how tracing works. But let's talk about what tracing is. First of all.

So here's, here's an X-Ray trace ID. It has three components or we we in basically, in our systems, if you look at the request or the environmental variable, the trace ID is described this way, the first component is the root. So this is the trace ID. This is the, the ID um that ties everything together, we have the parent which is optional. So if you're starting a trace and you don't have anything upstream, the parent, the parent is gonna be empty and then you have a sampling decision. But if you deconstruct what X-Ray traces, this is basically what it looks like and then there's these three components.

So let's imagine or how does this actually work? Right? Let's let's take, we have API Gateway classic example, talks to lambda function that lambda function puts an im S3 bucket, the trace, the ideal trace is something that captures all three of these things, right? So each of these steps, you'll see these circles, for example, in CloudWatch surface lens, um i use the open telemetry term as span. We use segment with an X Ray, but i'll, i'll use the open telemetry term. So we ever span, we have our trace ID, trace ID is 123. Now that first span, this what starts the whole trace the API Gateway, let's say that span ID is abc.

So API Gateways could talk into lambda again with lambda to a or API Gateway to lambda. You don't have to do anything, you basically toggle um a setting, but there's a trace context that's passed to lambda. That trace context contains the trace a e 123 and contains upstream parent abc lambda receives that context. It generates its own span ID or segment ID def. And then that context is then passed further downstream. So the trace ID remains the same 123, but the parent is different because the parent is coming from or when this gets passed to S3, it's a different parent.

So let's take a slightly different example. Let's pretend we have an ECS Fargate container. It's pre selling onto Amazon MQ. Rabbit MQ that say now ECS, we've instruments or a container. So with X Ray open telemetry. So it's generated the trace ID, it's generated a span, it's passing it down to or you know, there's nothing in Rabbit MQ that knows anything about open telemetry or tracing. So nothing is actually sent there and then say we have another ECS Fargate container and it's pulling something off the queue. Well, there's no context for it to pull off. So what this will end up doing is you'll have a dangling trace, you're going to see Lambda two. You'll just see basically two lambdas and they won't be connected whatsoever. So this is a dangling trace problem. All right, and i'll come back to in a second. But how do we get started with tracing if all you're using is containers. This is fairly simple as long as your language is supported, just install an open telemetry agent, use the SDKs. That's all you need to do.

Unfortunately, with cloud native services, it, the answer is really, it depends. So bookmark this page. First of all, this will take you to the X-Ray documentation. These are some of the services that we support that natively support tracing. Um and i'm gonna talk about a few of them quickly. Well, then we'll do a demo and then wrap up elastic load balancing supports X-Ray. A lot of people actually don't know this. You're not going to see an uh an A LB on a trace map. But when an A LB is handling a request, if it detects a tracing, it will pass it downstream. If it doesn't, it's gonna generate a new one. So A O BS support tracing lambda. We already talked about this enabling lambda X-Ray and lambda means traces when that lambda function executes automatically get sent. However, instrumentation is required for anything you do further downstream, calling S3 bucket, putting something um in API calling the sequel, uh a server like a relational database. And in some cases like uh Amazon, MS K or Amazon MQ X-Ray tracing is not supported and it's not because we can't do that. Those are open source projects, right? They don't know anything about, about X-Ray. EventBridge. We actually just recently i maybe a few weeks or a few months ago, EventBridge supports tracing, right? However, you need to read the documentation, understand that it will support tracing only if the event came from a put event. So if you're using the EventBridge to put the message and to the route that somewhere else Step Functions is supported.

Um for Set Functions that call SNS SQS Lambda, it's gonna work natively without you having to do anything via Step Functions are interacting with Batch Fargate or ECS, you will have to do some instrumentation. So the point here is look at the documentation, understand what's supported and what's not. So let's go back to this example. I don't actually have, i, i have console apps i i've written. So we'll pretend those are Fargate tasks. All right. So i'm gonna go, i have, you'll just have to take my word for it. This is a .NET app, this pretends we Fargate container. It is going to go put a message. Hello world onto a i have an Amazon MQ cluster stood up running Rabbit MQ and this is going to go put a message basically on that. So i'm gonna just run this once. So let let this run and i'll go and show you an X-Ray what this looks like.

So if i go to X-Ray and we just need to oh organ, that's the right one. Just bear with me. It'll take a second to show up. Yeah, i just had to sorry, bear with me one second. I have my open telemetry collector and i just really have to rerun that. So let's try this one more time. All right. So we'll go through here. I'm just executing. I'm putting the thing on the queue. All right, something's gone wrong here. But i have a backup so i can show you basically what this will look like.

So we have a trace. I i if that code was working, what you would basically see is you would, you would see a trace here with the publisher service. So here's i have a publisher in a consumer service and the Rabbit MQ cluster in the middle. And so basically what you would see is that application, you see this breakdown. So the publisher, ok, it's doing some work and then calls Rabbit MQ. So we get that, that basically that breakdown, that if i go back to my code on the other side of this, i have another service pulling things off this queue and this is also instrumented. So if i go ahead and run this, what you're gonna find when this runs and gets captured is you're going to see um basically a map that looks like this, there is no connection upstream to the Rabbit MQ to that producer service, right? Because there was no context passed across that service boundary.

So the you have to understand what, what services are gonna pass up automatically which ones won't. And in the case that they won't because they're incapable or they just don't know about X-Ray. You, you need to add that context to the header pass across the service boundary and then resume that context. And so that's basically what i've done in my consumer app. If we scroll down here, what you'll see is again, if you're sorry if you're not familiar with .NET and Rabbit MQ, but there's this properties here, this basic properties. This is coming from the q, the messages i'm putting on the Rabbit MQ and on the producer service, i've injected my trace ID.

So i inject injected the trace context. And then i have a function here that basically extracts that context, the trace context from the, the the attribute i added to my message in the queue. And then if i were to change my code here to start a new span basically and use that parent context, what you would end up with is a complete trace, right? So i, i could choose to create a third circle in here that represents rmq. But the point is i have two services, we pretend they were far gate tasks. One is putting a thing onto a qr mq, the other is taking it off. And because i passed that context, the trace context across that service boundary and resume the context, i get a complete trace. All right.

So i'm gonna go back to my deck, did the demo. So just as fourth best practice wrapping this up instrument, all your code, you need to instrument all your code with tracing. I understand which AWS services support tracing and how they support tracing. You need to, i wish there was an easy button. Um but please you need to understand how these traces uh work with our services or health services work with tracing. If a service doesn't support tracing, like i just showed you again rb mq, open source solution has no context of X-Ray. You need to pass that trace context across that service boundary. And then on the other end of it downstream, you need to resume the context and that will give you a complete trace and that's going to help you avoid the dangling traces.

So just to wrap this up, we talked about four best practices, navigate your instrumentation options. Overall, the summary take away from this is start with open telemetry for your instrumentation, choose something else for logs and then once logs is gone g a with an open telemetry, then you can use that for all three signals for optimizing the cost of high carnality, which again is a huge problem. As you build modern apps, we generate more data, more metrics leverage both metrics and logs do not just send everything as a metric that can get very expensive, very quickly to find that balance, reducing alarm fatigue, correlate, correlate your alarms on high level metrics and use anomaly detection and last but not least avoid dangling traces, ensure trace propagation.

So if you look at the different services, you the different services in your architecture diagram, understand if how those traces are gonna get propagated across the system. So just some takeaways. Uh we have a really popular workshop for observable that basically will deploy a fully functioning app with a traffic generator in it. Um there is some cost to write that. So be careful with that. But we've converted this into, translated this into five different languages. It gets hundreds of thousands of views. It is very popular. I do a lot of my demos off this uh link to the AWS district for open telemetry. So we get hub page. You can learn more about open telementry there and we just published a new skill builder on our um training certification site for observable. So with that, I wanna thank you. Thank you Roland for presenting with me. I appreciate uh everyone showing up here on a thursday afternoon. Go enjoy replay. Thank you."