Seamless observability with AWS Distro for OpenTelemetry-CSDN博客

本文链接：https://blog.csdn.net/just2gooo/article/details/134791458

So today we're going to be talking about how to make your production systems more reliable. So I presume many of you are here in this room because you care about operational excellence and you care about reliability. But let's be honest with ourselves for a second, how many of you in this room are on call right now? Some of you. Yeah, because no one else, right? If you're some of the more senior engineers from your company and therefore you're sent to re:Invent and maybe the people who are not at re:Invent are not able to cover for you. So you're sitting here, you're on call and you're worried about, you go to sleep and what happens if the pager goes off at 2am? What happens if it goes from, you're peacefully sleeping? And then da da, da, da, da, da, da da. That's the sound that fills every engineer's heart with terror, right? Especially if it's at 2am.

You should be lucky. By the way, I'm making that sound out of my mouth and not with my cell phone making the pager duty sound because I know that that scares everyone. So what I hope for all of you is that you're able to walk away from this talk with a greater confidence in your own ability to solve problems, whether it be at 10am or at 2am. And also to spread that level of expertise and knowledge across everyone on your team. So next re:Invent you can make a resolution to not be on call while you're here. That's what I want for all of you.

So we care about debugging as a critical requirement for running production services. It's not enough to architect and design services on paper. If they don't actually hold up and stay in production, we have to be able to smoothly roll out our software releases and ensure that our customers are having a high degree of user experience.

So when we're talking about measuring customer experience, we're not just talking about uptime monitors and things staying, staying green or red, right? What we care about is actual customer behavior. So certainly it was the case a long time ago before the public cloud in the early 2000s, I was working as a game developer and we had the database server and we had the game server and if both of them were pinging and accepting connections, the service was healthy, everything was good, right? And if they were down, all of our players were having a bad experience that doesn't really work today, right? Because you are running large fleets that are scaling of two of EKS you're running systems where it makes sense that sometimes maybe you're using AWS Spot, things are falling out and that's intended, everything is working normally.

So you no longer can rely on static definitions of up and down. And ideally, we'd like to be more proactive than waiting for your VP of customer success or the CFO or the CEO to be ringing your phone and asking what the heck is going on? Why are people flooding the call center with complaints?

So how can we balance these two things to measure the properties of our systems without the point of things getting so bad that customers are complaining and it gets even more complicated because we have to think about not just debugging problems that we thought about in advance, but debugging novel problems that we've never seen before because we're all conscientious engineers, right? None of us goes to work every day and says I want to do a bad job today. I want to write a bug that's going to go into production, crash production, right? Not a single one of us wants that.

So by definition, if something goes wrong in production, it's something that we didn't anticipate in advance, which means that we need to be able to figure out what happened in production and be able to resolve it for our customers without saying, ok, Mrs CEO, I'll get back to you in two weeks once I reproduce the problem in staging, right. That excuse doesn't fly with our executives, that excuse doesn't fly with our customers.

So, fundamentally, this capability to debug our systems revolves around forming and testing hypotheses about what might have caused the breakage and how can we revert it or get it back to normal as quickly as possible.

So, in a nutshell, this is what the practice of observability is about - observability is the characteristic of our systems and our people working together to be able to resolve novel problems without having to push new code in production just using the telemetry data flowing out of our production systems and allowing us to analyze that data to figure out what's going on and to bring things back to normal.

And it's not just me saying this, the AWS Well-Architected Framework agrees that operational excellence is a priority as one of the seven Well-Architected pillars. In particular, this is what the Well-Architected operational excellence pillar says. It says that you must implement observability for your workload that you have to identify what your key performance indicators are. And you have to have telemetry data flowing from your systems so that you can adhere to those key performance indicators. And that we have to be able to understand not just the backend services but also the user experience to do real user monitoring. And that last one, there is especially critical to implement distributed tracing to not just have a metric that tells you when something is wrong, but to be able to trace things through the complicated chain of multiple services that you may be running.

So observability, as I said earlier is a socio-technical capability. It is not just a property of, do you have the data? It's what you do with it. So let's talk both about the outcomes that we can achieve with good observability and also how we get there.

So the outcomes that we're trying to achieve, we're not just trying to have operational excellence and resilience. Ideally, we'd also be able to debug if a release process is slow or flaky to make sure that our tests are working correctly and that we're testing things both in staging and production to write quality code the first time. And we have to make sure that we are getting insight into what our customers are doing because after all, does it matter if you shipped a feature, if no one used it, it's like having a tree fall in the forest and not make a sound. Finally, we need to be able to get ahead of the game. We need to be able to think about how do we resolve our technical debt. Do you have any circular dependencies in your systems? Do you have any singular points of failure? And how might we identify them before they break? So those are some of the questions that we might want to answer with observability.

How do we get there? We get there by having high-quality instrumentation that produces telemetry data. We need to store that data in an economical fashion so that we can access it on demand. And most importantly, we have to be able to query it. You cannot be producing all this telemetry data and routing it to dev/null that does not actually help you debug your production systems and ensure high availability.

So when we're talking about these questions, let's get a little bit more concrete and tactical about what I might mean when I say that we should have observability. So for instance, let's suppose that you get an alarm that says that users are experiencing high latency and that their queries are timing out because they're taking longer than 500 milliseconds. I might want to know not just how many of these requests are there, but where are they coming from? Who is worst impacted? Is that a causation or a correlation? Which services do they flow through? What versions of those services and what clients are they using? We have to understand the big picture. What does the healthy population of requests look like and what differentiates the slow from fast requests?

So some of the data that I might rely upon to answer these questions might come in the form of a trace, a metric or a log. So how many of you in this audience use logs? Should see almost every hand up, right? Like we're all very familiar with logging. It's the first thing we're taught as developers, right? Console.log, uh print a standard out and it's really helpful for getting information on a process by process basis of what's happening inside of the internal work of that process.

The challenge in the trade-off is that it can be very voluminous, hard to scan, hard to index and you have to know which log file to look at which machine to look at in the first place. So let's turn then to metrics, how many of you here rely on metrics? Great. So metrics are really great for that kind of aggregate bird's eye view of what's happening inside of my system. And usually people are have started off with the kind of high-level infrastructure level metrics and user behavior metrics may be measured from your application LB. But there's another type of telemetry data that is really helpful as you get more complex and sophisticated. And as you start having microservices interactions with, with managed services like RDS where you might want to understand what requests came in and what requests came as a result of that request that I initially received.

How many of you are currently using tracing today? That's more than I expected. That's really excellent. You're ahead of the game, I really appreciate that. So I highlight traces because I actually think that traces are how you can derive some of the other kinds of telemetry that it's not a matter of pokemon and collecting all the traces logs and metrics. But instead if you can collect all the data in a way that makes sense, you can produce traces or metrics or logs on demand.

Why do I say this? Well, a trace is a collection of spans, right? It's a set of set of spans that correspond to a given request ID that came in. So if you're propagating a request ID throughout your system, so that you can follow the flow of a request and see what log lines contain that request. ID. Congratulations. You're now 70% of the way towards implementing tracing. And what does a trace span actually mean? A trace span just is a start time, a duration and then some set of key value attributes that include on a mandatory basis, the trace ID or request ID as well as a simple question, who called you, right? Where do you come from? And if you have all those things, you have a hierarchy and you can now assemble this into a causal flow that demonstrates what caused what caused what to happen. And you've got the ability to attach arbitrary metadata.

So you don't necessarily have to console log things anymore. If you have something that might be useful, you can attach it to the currently running trace span and then everything else flows from there because you can aggregate traces together to form metrics, you can view raw trace spans as logs. And this helps us understand our systems in a much less disjointed way than having separate tracing and metrics and logs.

So this is where OpenTelemetry comes in. The idea behind OpenTelemetry is to make it easy to generate these kinds of telemetry data and to commoditize it so that you can focus on the higher value parts of understanding what's going on.

So OpenTelemetry is a lot of syllables. So I'm just going to call it OTel for short, but you may see it referred to in either format. So OpenTelemetry is a combination of things. It's both an open standard as well as some software development kits and some additional code to help you adhere to that standard.

OpenTelemetry includes a data format specification which is called the OpenTelemetry Protocol or OTelP. We've got 11 language SDKs and more coming that cover most of the popular languages that are in use today. In addition, OpenTelemetry has integrations with most of the popular libraries in each of those languages. For instance, if you're using Node.js, you might want Express integration, and OpenTelemetry handles it out of the box.

OpenTelemetry also contains a routing component called the OpenTelemetry Collector that acts as a Swiss Army knife that allows you to transmute non-OpenTelemetry data into OpenTelemetry data or vice versa as well as to allow you to have choice in vendors and choice in how you process the data so that you're not locked into one specific vendor in one specific SDK.

And speaking of broad backend support, pretty much every single player in the observability and APM space supports receiving OpenTelemetry data, which means that you can start off by using X-Ray and CloudWatch. Maybe you then choose to use Datadog and then maybe you decide that you want to switch to something else like maybe Splunk or Honeycomb or Lightstep. And OpenTelemetry allows you to do this by just changing a few lines of config in your OpenTelemetry Collector rather than having to rip out and replace all of your instrumentation or your Rusty APM case.

So OpenTelemetry has seen a lot of success over the past few years and in particular, in 2022 it became the, the second most popular CNCF project and the third most popular Linux Foundation project. And this year or next year, we're hoping to possibly eclipse Kubernetes in terms of velocity of contributions because there's so many libraries to integrate with and so many really great feature requests that we've received to help people better understand their systems.

So OpenTelemetry originated as the fusion of two different open standards.

I know there's that xkc comic, right? You know, I've got 20 standards. I'm going to create the 21st standard that's going to obsolete the rest of them. But we actually did shut down and sunset OpenTracing and OpenCensus, which are two predecessor projects. So in 2018, we had the goal of merging these two projects together. And by 2021 we were able to get general availability for the specification of the data formats and of the APIs. In 2022 we made generally available for the vast majority of languages, the tracing and metrics, SDKs. And this year in 2023 we've made generally available the specification for logs and ensured broad support across the SDKs per language so that people can use just one agent rather than multiple agents per signal and per vendor.

So in a nutshell, what OpenTelemetry solves is it solves the instrumentation piece and the data generation piece, but it's deliberately agnostic about how you actually store the data which enables innovation and competition.

So, hi, I'm Liz Fong-Jones and I'm an AWS Community Hero. And I also happen to be the Field CTO at Honeycomb, which is a player in the observable space. But the capacity that I'm here in is as a Hero and also as a contributor to OpenTelemetry. I previously held leadership roles in OTel, including being elected to the OpenTelemetry governance committee and being one of the first contributors to the Go SDK.

So today we've already covered what OTel is and why it matters and how it helps you achieve observable. In the remainder of the session, we're going to talk about how to get started practically with OpenTelemetry on AWS. We're going to talk about how to route that telemetry to achieve the results of being able to understand your systems. And then with any remaining time, we'll talk about some advanced techniques to help you get the most out of your data.

So let's talk about how to get started with OpenTelemetry on AWS. Yes, it's true. I mentioned a bunch of things about a specification and OpenTelemetry protocol, but you don't have to worry about the details of that because the AWS service teams working on Observability have made it out of the box, integrate with AWS services.

So the best way to get started with OpenTelemetry is to implement the OpenTelemetry agents which automatically instrument your applications. If your workload is using a supported language and is running on Kubernetes, all you have to do is to install the OpenTelemetry Operator into your cluster. The OpenTelemetry Operator will automatically detect which languages your workloads are using and will automatically add instrumentation to those workloads to measure, for instance, all the HTTP requests that are coming into your Node.js Express application.

The Operator also will install as a DaemonSet, OpenTelemetry collectors on each node so that the telemetry can get offloaded from your application as quickly as possible and then buffered for sending further upstream.

So, in particular, there is both a vendor neutral open source Operator that is published by the OTel community, but your AWS account team is able to support you if you use the OpenTelemetry Distribution published by AWS, which is called the AWS Distro for OpenTelemetry. But again, similar to OTel being short for OpenTelemetry, I'm just going to call the AWS Distro for OpenTelemetry "ADOT" for short, less wordy.

So what ADOT does is, it is a distribution of the base OpenTelemetry code that's customized for your AWS workloads. And that supports a lot of handy things that you may encounter inside of AWS and it routes out of the box to the AWS Observability tooling like CloudWatch, Prometheus and X-Ray.

If you're using a language that's not supported out of the box for automatic instrumentation, maybe C++ or something like that, you might need to manually do a little bit of configuration and add the SDK and compile it into your code. If you're not using Kubernetes again, we can't go and install an operator across your cluster. You might have to tell us where those applications are running, but then we'll take it from there and generate the instrumentation that you need.

One new feature as of about a month ago is that the ADOT distributions now support logging, which means that you don't have to rely on a separate logging agent to write data into CloudWatch.

So what does this look like in practice? Well, there's a little bit of manual assembly required. So you might go ahead and say I'm going to go to the EKS console. I'm going to go and click on the Add-ons. I'm going to install the AWS for OpenTelemetry and then there's a big block for JSON and YAML.

And the good news is there is documentation and the AWS teams are actively working to make this setup process better. But some of the things that you might want to tune are what are my destinations? What jobs or processes should I collect from and what widgets have I turned on? Right? Am I processing logs only? Am I processing metrics? Am I generating traces? And those are all things that you can configure inside of the configuration.

By the way, I will make a note - click ops is for demos. In reality, for any production workload, you really should use Infrastructure as Code. It just turns out that pasting a bunch of Terraform and CloudFormation on a screen is not quite as interactive. So I'm showing you the console way but actually use Infrastructure as Code, please.

Great. So that covers your EKS clusters. That covers, you know, maybe even your EC2 workloads if you're manually configuring and starting up an OpenTelemetry collector. How do we deal with serverless workloads?

The answer is that there is a AWS Lambda layer that's built to export OpenTelemetry data and to generate the data to send to that OpenTelemetry collector. You want to add again, like you have to add the SDK manually because Lambda is not able to reach in from a layer and interact with your application code.

So in your application code, you may need to add the OpenTelemetry SDK and install the resource detectors and to install the hooks to intercept the invoke call and then automatically read all of the parameters out of the invoke call, like the request ID, API request properties and so forth.

And once you wrap the AWS Lambda invoke, then it will automatically create a start time and an end time and get the duration and get the name of the function that was called and so forth and it'll generate that data so that you can understand how fast or slow your Lambda invocation is running and some of the properties of the caller.

But the spans do still need to leave the process. And this is a situation where the AWS Lambda OpenTelemetry layer can help you. Why do we want a layer rather than directly writing from the Lambda? Well, the answer for that is that if you are running a standard invoke, you have to complete your processing before you return. Because once you return the invocation of the function, normally is frozen, right? So there's not an opportunity to offload the telemetry from that invocation that just ran.

But when you add a Lambda layer that contains the OpenTelemetry collector that allows the telemetry data to get buffered and then it'll get flushed after you've returned the invoke. So the user who called you is not waiting on the response, you're still billed, of course, for the time that the OpenTelemetry collector is running maybe for 30 or 50 milliseconds afterwards. But it enables you to get back to your user as quickly as possible and let the telemetry data get flushed to your backend as quickly as possible afterwards.

It also handles the freeze and thaw cycle very nicely. And if you're using Java, you actually can get that all in one stop shop that will automatically add the telemetry. But in most cases, I would just recommend add the SDK to your code, just have the separation of concerns and have the Lambda layer that's doing the routing.

So why do people care about Lambda? Well, in our case at Honeycomb, we have a service that's called Retriever. This is my half golden retriever here on screen. And the thing about dogs and about a retriever service is they really love to play with all the toys, but then they get bored and they wander off and do you really want to have all of those toys lying around on the floor of your house or would you rather like tidy them up when they're not in use? Right.

So Lambda enables us to do massively parallel computation when we're analyzing telemetry data from our customers and we dog food things, by the way. Yes, the dog references are intentional. Our company was originally called Hound Technology Inc. So yes.

So our Retriever service likes to play with data and we dog food to measure the performance of our query service. So this is what this looks like. We have a very, very spiky workload that runs tens of thousands of Lambda processes in parallel in order to crunch the data from S3 about previous telemetry people have sent us and then it returns the data and aggregates it, but only if someone actually requests it, which means that the data can sit idle there for a while.

So if I'm interested in understanding what's happening inside of production, I might want to use data produced by OpenTelemetry running inside of AWS Lambda. And that will enable me to get an account of for instance, how many traces came through the Lambda service. And what specific parts of that took the longest filtering only to the initial invoke request.

And then maybe from there, I might want to do something like get the median duration of the invoke call as measured from the Lambda running itself, exporting the data through the OpenTelemetry Lambda layer and back into our dog food environment.

So that's basically how we think about measuring services is by having them report their status. And it's not just reporting the basics of, you know, that you could get through CloudWatch metrics. I could also hypothetically break this down by customer ID, tenant ID, complexity of request and really drill into if there's a problem with this, I want to be able to figure out what requests are causing a bad customer experience.

So that's how we kind of do the basic routing of the data. But how do we make the data useful? The answer to that is, well, no, you don't send it to /dev/null, you send it to somewhere a little bit more useful.

So OpenTelemetry because it's vendor neutral, it enables you to try a variety of different backends. So out of the box ADOT supports writing to X-Ray, AWS Managed Prometheus and CloudWatch and it allows you to get basic single recording and interpretation and it's supported by Amazon. It works out of the box.

But the thing that X-Ray doesn't necessarily do is that X-Ray doesn't necessarily let you aggregate it across spans. So you might want to use an APM like LightStep, New Relic, Datadog, Splunk and so on and so forth. There's a list of supporting vendors on the OpenTelemetry website or in the ADOT documentation, but the data from your application is only the start, right, we want to be able to see in a single pane of glass, what's going on with your application and the underlying infrastructure.

So this is where it can be really helpful to get the data from CloudWatch about the performance of your RDS instances, about the performance of your Elasticache, what's going on with your API Gateways or ALBs.

So in this specific case, you might want to use the Amazon Kinesis Data Firehose delivery mechanism for CloudWatch, which supports writing the data in OpenTelemetry format. By the way, this also happens to be about two-thirds cheaper than issuing individual API calls to receive individual CloudWatch metrics because it's just much more efficient to stream the data on a continuous basis.

So how do we set this up?

Well, the first thing that you want to do is it's a little bit hidden inside of the AWS console. Um, so you have to go to CloudWatch and click on the Streams button under Metrics and that will allow you to configure a destination for that data.

So this is what this looks like. So you'll go to Amazon Kinesis first and configure your data firehose and it will give you a list of delivery streams. If you're just starting out, this will obviously not contain anything. And you'll have to specify what's the HTTP endpoint. So which OpenTelemetry protocol destination are you going to write to and you'll want to specify compression to keep your costs down and you'll want to specify that it should run in OpenTelemetry format, which we'll get to in a second. I'm a little bit ahead of my demo.

There we go. Okay. So once you've created the Kinesis Firehose, now you go over to the Metric Stream and you connect the Metric Stream to that Kinesis Firehose that you created. So go ahead again. This is click ops. Please don't do this. Please actually use infrastructures code, but I'm showing you how to do it in a visually appealing way.

So OpenTelemetry 0.7 protocol is currently what CloudWatch exports. Vendors are currently supporting this. I'm working with the CloudWatch team to hopefully get them to update it and you don't have to stream all of your metrics again. If you want to keep the cost down, you can choose to stream a subset of metrics as you so desire.

And then once you have that data, then you can query it inside of your observability tool of choice. Again, we dogfood Honeycomb to measure Honeycomb's performance. So in this case, I really want to understand what's the RDS CPU utilization and that enables me to do things like correlate, hey, are these requests that are slow correlated to bad RDS performance or overloading our RDS and to do things like set alarms. For instance, we know that if we get over 55% CPU saturation on our RDS instances, that's a cascading failure that's coming. We need to start shedding load or otherwise scaling up our RDS instance.

So that's how to get the basics of telemetry data flowing. All you have to do is to specify the endpoint. Once you have the OTel collectors and operator running inside of your cluster, you don't have to do any of this configuration if you're just writing it to CloudWatch. But if you are using an OBI, there may be OBI API keys to add and so on and so forth.

So that kind of covers the automatic instrumentation piece of getting the basic request response data of every request that is coming into your service and that is outgoing from your service. So you can understand what your critical path is, but we don't capture certain kinds of contextual data because we don't know what is and is not important to you and we don't know what might be a security risk.

So for instance, the integration for Node.js and Express is never going to capture the entire HTTP POST body and shove it into CloudWatch, right? That would be bad. You would be uploading password hashes to CloudWatch, don't do that.

So instead our answer to how to think about adding instrumentation on fields like user ID, card ID, card amount and so forth is to make it as easy as log, adding a log statement, but contextualized to your trace span to enable you to as you're processing the data that you're getting out of the request and assigning it to a variable, why not just put a key value pair that corresponds to that data so that you have the data for analysis later.

And additionally, it's not just limited to adding attributes. You can also create custom trace spans because we know that any request that crosses the network is something that's important to measure, right? That if you have an HTTP request coming in, that's important. You want to measure how long it takes. If you have an RDS database call, that's going out, of course, you want to measure how long that, that, that call takes.

But you might have particular parts of your workload that take longer than others that you might want to wrap a trace span around because the thing that no one really wants is to be staring at a long span that seems to be doing nothing. And then all of a sudden it issues a database request and it finishes, right?

So you might want to wrap functions that are expensive so that you know more granularly what's going on inside of your distributed traces and services.

So this is what this looks like in Java, for instance, as to how to add a string payload to your currently running trace span. So for instance, if I am operating an e-commerce site, and I'm trying to measure what are people searching for, I might want to choose to get the, to get the text of the user search query and to shove it in in an attribute called app.search_text.

And as you can see, it's literally just like a console log except for it's much nicer on the telemetry analysis side because it's structured because you have that key name to search by rather than having to search all of the string payload of the log.

Similarly, if you're trying to search for numbers, right? Again, in most languages, there are fully type generic methods for setting attributes, but there might be in your specific SDK something that specifies, you know, I want to set a number versus a string or so forth.

So in Java, again, it's just app.search_results or app.search_result_count, you might set it to zero.

And as far as wrapping functions again, this depends upon language. So in Python and Java, for instance, it's just a decoration on your function to say, hey, by the way, this is just call this, you know, call this a span ideally, you name it after the same as your function, but some people prefer not to do that for some reason.

But it allows you to get more granular instrumentation function by function. And if you need instrumentation for every function, we are working in an OpenTelemetry on a profiling signal. But that is far, far, far from GA yet, we're just working through some of the details of that.

And in terms of wrapping individual code, that's not, it's a stand-alone function, you can use a scope inside of, inside of Java. In order to wrap specific critical sections of a function, you might want to set the error code to indicate whether it's a success or failure.

Because again, we can't infer that from the HTTP status code if it's not an HTTP request.

So what does the glorious output of this look like? Well, what you wind up with is something that looks like a trace waterfall, right? That has a causal chain of who called, who called, who called who and it tells you where was the time spent.

You know, maybe you're doing a bunch of parallel processing and then you're waiting and blocking until all of the parallel computation finishes. And then over here on the right hand side of the screen, what you can see is there's a list of attributes, right? Like that's essentially the structured log that I was talking about, right? That you could think about each of these key value pairs being output to a log as JSON.

But in this case, we're choosing to be able to slice and dice and analyze them dimension by dimension as well as to see all of the dimensions associated with one given trace span.

So adding custom telemetry does have some challenges at larger scale if your organization has more than 10 engineers, right? So imagine that you have a situation where you have app.cart_id, app.cart_camelCase_id, app.cart_ID, with both the I and D capitalized, right? That's not a situation that you really want to get into where you have multiple different fields for the same thing or where people can are disagreeing about whether it's a string or whether it's a number, right?

So semantic conventions are really helpful for resolving this problem because they allow you to have commonality in how you query across the data regardless of team. And this isn't just an organization specific problem. This is also a cloud problem as well, right?

That there are different kinds of instances, there are different kinds of runtime, right? And you might want to understand what's the performance per runtime. So for instance, if we're talking about Lambda, the OpenTelemetry AWS Lambda resource detector has semantic conventions. And it specifically says the cloud.provider will be aws, cloud.region is us-east-1 or us-west-2, right?

So this allows you to make sure that you are having common patterns in how you name these fields so that you can hop from team to team or from organization to organization and have that familiar querying experience.

So you want that for your organization as well as you start adding more fields in instrumenting more services. You'll want to make sure that you're writing down like in a wiki or somewhere. What are we naming things? How are we naming things? What are the types of those fields and so forth?

Another thing that has bitten people before is that the implementation of OpenTelemetry is not completely uniform throughout AWS because it is an organization that allows teams a high degree of autonomy.

So for instance, if you are routing requests directly from the API Gateway to your Lambdas, the API Gateway only supports emitting tracing headers in X-Ray format. And OpenTelemetry primarily works with the W3C trace format. Basically, it's a method of encoding, what's the trace ID, what's the parent span ID? What additional baggage metadata am I passing?

And because X-Ray predates OpenTelemetry, you might need to have installed in your or configured in your OpenTelemetry SDK, something that converts the X-Ray header format into the OpenTelemetry header format.

So for now, this is a thing that ADO makes easier that you wouldn't necessarily get out of the box with the completely open source OpenTelemetry of having the X-Ray trace header propagation turned on so that you can seamlessly feed in both the logs from your API Gateway as well as the traces generated from your Lambda and have those combined together to show end to end trace spans.

Another really great thing that you can do with OpenTelemetry is I've mentioned vendor portability and being able to switch between vendors. But what if you didn't have to switch? What if you wanted to use multiple vendors in parallel? Maybe you're doing a proof of concept or what if you are in a situation where you want to retain your telemetry data for fast debugging for less than 12th or 32nd query times.

But your security team is insisting that you have to keep your data for up to 60 days or 120 days or maybe two years. They don't, maybe don't care necessarily how long it takes to query that data, right? It might be ok for it to take five minutes or 10 minutes to query a year of data to figure out, you know, did the batteries get in, get into one particular system?

So because OpenTelemetry is flexible and has an S3 exporter, you can actually use a common data feed from OpenTelemetry to drive both your observability and your security posture. So you can write the data simultaneously both to S3 and to index it with Glue schemas and Athena and use that for longer term retention where the query time matters less and also send the data to CloudWatch or to Honeycomb or Lightstep or another vendor for your observability and real time debugging.

So there's a blog that's on screen now that you can scan the QR code or I'll share the slides afterwards and share some resources on my LinkedIn.

So this is kind of a really cool application of the vendor portability.

Another really great thing is that you can add OpenTelemetry, not just to your applications but also to your setup code. Maybe for instance, you're using Chef or Terraform to manage your cloud environment or to manage your individual instances. And there are some really great integrations to write OpenTelemetry trace spans out of your Chef bootstraps.

So you can figure out why is my Chef synchronization taking so long, which recipes are taking so long. And there's a tool written by Equinox called OTel CLI that enables you to treat any console shell command as something that you can wrap in a trace span from the command line, which I think is really cool.

And also you can make your CI debugging easier and make it easier to figure out why are my tests flaky, why are they taking so long, what's the critical path? Because many common CI tools, for instance, CircleCI support OpenTelemetry measurement of your jobs and of the individual work that's happening inside of those jobs.

So today we've talked about these four things:

Why you should care about observability and operational excellence and how OpenTelemetry helps.
We've talked about how to install the OpenTelemetry agents via ADO or via the open source operators.
And then we talked about how to get value out by routing the data to somewhere that you can analyze it and what kinds of queries you might want to run.
Finally, we've talked about ways of annotating your data further and adding custom attributes and spans in order to help you better understand what's happening inside of your production systems.

So my encouragement to you is to make sure that you're able to get a peaceful night's sleep, not just tonight, not just tomorrow night, not the week after this when you get back from re:Invent, but next re:Invent too. Like let's make sure that you're able to have a quiet night of sleep on call and that the people who are on call can have a calm time debugging incidents in production rather than panicking.

"Oh my god, this incident happened and I'm not going to be able to go back to sleep for two hours because I don't have the data I need to analyze what's going on."

So get a more peaceful night of sleep by introducing OpenTelemetry and observability into your organization.