Operating with AWS open source observability

最新推荐文章于 2024-07-20 19:31:22 发布

litaibai-04

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量140

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/littlechenlin/article/details/134784666

版权

All right, good morning, everyone. I'm excited to be here.

Well, um customer interest in using open source software for observable continues to grow. They love using open source software because of the flexibility it offers in customization or uh the the ability that they have in integrating with other open source software that they love and operate in their environments. And also the fact that um they can stay vendor neutral and live at the edge of innovation that happens in the open source world is really very exciting.

But at the same time when we talk to customers, they always express concerns about how difficult it is for them to operate a uh the underlying infrastructure to support the open source software. It is extremely difficult to operate a highly available resilient uh infrastructure. At the same time, that is cost efficient.

That is where AWS managed open source observable services can come into play.

My name is Iay Kumar Jathan. I'm a principal solution architect at AWS. And in the first part of the first part of the session, I'm gonna talk about some real world use cases and we discuss about details on uh some customers that face some certain issues or in their uh observable challenges and what kind of decisions they made and how they choose open source software to solve their problems.

And then Gustavo Franco from Vex will talk about V ES journey into using AWS managed open source services to solve their observable needs. Then Marks will talk about um we'll give some insights into how we operate these services at large scale and also give some insights into some new features and launches that we launched recently.

All right. So before we move forward, I want to set the stage and make sure everybody is on the same page on what observable is and what the topic that we're talking about this term originates from control theory, right? So essentially a system is defined to be observable when you are able to infer the internal state of the system by looking at the external symptoms or the signals it is producing in software engineering or computer science. You can relate this to being able to identify the root cause of uh of an issue that you're troubleshooting in an application by looking at the signals or the symptoms that you're able to look at externally.

Then you might ask. Ok. So what exactly is monitoring, right? So monitoring is essentially the overall actions that you are performing whereas observable is about building those systems. So you can identify and correlate signals uh that are that are captured from the application so you can troubleshoot uh quickly.

The signals that we're talking about are primarily metrics, logs and traces metrics are the best way for you to understand or get a quick view of what is happening in your environment. And that helps you visualize uh an application uh particular para parameter like cp utilization or memory utilization or anything like that. You can visualize it. You can get a quick insight into what exactly is happening, you can alert and so on.

You can compare uh a particular situation, uh identify patterns, do anomaly detection and so on. It makes it very easy and also very cost efficient. You will go to logs when you want to dive deep into a particular context of the application. You want to do more analysis and find out the root cause and that's mostly done after you get an alert, that is essentially most of them sit on metrics with microservice architecture is becoming so common.

Um it is very um normal for customers to start looking at how can i capture um signals that is across service and process boundaries, right? That's where traces comes comes into play. Tracers help you understand service to service interactions at the application layer, gives you information about your service latency, your service errors, faults, exceptions, et cetera. And by looking at a service map which basically is uh a real time representation of how your microservice architecture is performing, you can very easily find out which particular service is causing a specific problem. So you only need to dive into that particular service to understand uh or find the root cause and fix that issue at aws.

We believe in giving op options for customers to pick and choose the kind of services that they want. So they can build the most optimal architecture that really works for them. On the left hand side, you see AWS native observable services with CloudWatch and X Ray primarily with Container Insights, Lambda Insights, Contributor Insights and Application Insights.

You can choose specific purpose built features for your um specific use cases. For example, using CloudWatch Synthetics, you can build canaries to test your applications from customers point of view and find faults and issues and errors in your websites and API s even before your customers come and tell you that there is a problem or real user monitoring or CloudWatch RM will help you understand um or capture information from, from customers experience in actually using your application itself. Give you information about page load issues or navigation issues or any kind of user experience problems that help you understand how your customers are using your application. You can use that information to inform your user experience, designers, your service api designers and and so on, on the right hand side, you see open source managed services from AWS with Amazon Manage Grafana, a Amazon Manager for Prometheus and Amazon OpenSearch Service.

It's really not this or that option. Customers really can choose, pick and choose any combination they want. We have customers that use both AWS native services and open source open source based services and really solve um build architecture that really works for them. You can collect data from any environment and send to any of these services, using any of these agents, using CloudWatch Agent or FluentD or a Data Prepper. A WG for Open TeleRet Collector and so on. And it doesn't matter what environment or um where you're operating. Uh either so it could be AWS, it could be on prem it could be your laptop anywhere. You can collect all the data from anywhere you want into AWS services.

First, let's talk about AWS Distro for OpenTelemetry. First of all, OpenTelemetry itself is very exciting for customers simply because of the um breadth of um of tooling that it provides. What I mean by that is it can support any source up to any destination. You can collect data from anywhere and send to any, any destination that you want. And you can also process the signals in memory. That basically means you can add additional context, you can uh redact information that you don't need and you can aggregate, you can, you can correlate, you can pass on additional metadata, enrich the signals that basically means that helps you capture really meaningful information that helps you troubleshoot later on when you are working with the data itself.

All of that is open source and all of that can happen in a single binary collector called OpenTelemetric Collector makes it even more exciting for customers. The A W is distro for OpenTelemetry is essentially a redistribution of the upstream OpenTelemetry project. Every single line of code that we write goes to the upstream CNCF um project. And what we do is we pick some of the important um receivers, exporters, um uh processors extensions that are basically there are components of the Open Telemeter Collector and have that go through our ABS a review process and make sure that it actually is doing what it is doing and redistribute it to customers.

And all of these components that we pick is based on direct feedback that we get from customers, right? And when customers use the A R collector, they also get AWS support. So they have the peace of mind that they're actually using an enterprise grade product and they have someone to rely on when something goes wrong. We also support Java JavaScript Python.NET languages as well.

We'll talk quickly about the Management Media Service. Prometheus is a very popular metric monitoring solution. We all everybody know everybody knows about that and people love to use ProMQL because of the flexibility it offers and how easy it is to deal with high cardinality metric situations and so on. However, it is also a common knowledge that it is extremely hard to operate a highly available Prometheus environment and it is difficult to provide high availability, uh resiliency and long term storage and all that stuff. It it becomes very challenging that to at scale.

That's where Amazon Managers or Prometheus comes into play. This is a fully serverless solution where customers really don't have to worry about capacity or availability or anything like that and they only pay for what they are using. And it doesn't matter where your metrics are originating. If it could be AWS, it could be on prem it could be other cloud service power. It doesn't matter. You can consoler it all the metrics into 11 specific area or one particular uh space called workspace. And uh you can query all of that, right? So it makes it very easy for customers.

Then you have an Amazon OpenSearch Service. OpenSearch is a fully open source and analytics engine for use cases such as log analytics, real time, application monitoring and click stream analytics, right? The OpenSearch service is a service is a managed service that makes it easy for you to operate the OpenSearch software on AWS. This service provisions all the resources that your OpenSearch clusters need. And also at the same time, identifies faulty nodes in your OpenSearch cluster and replaces automatically, right? It it's it's handled that's why it's the managed open source service.

Then we have Amazon Managed Grafana, which is essentially the visualization layer of this OpenSearch or open source um stack. But this is a fully managed Grafana environment that allows you to uh again, not worry about infrastructure or anything like that. You get all the goodness of Grafana while getting all the benefits and advantages of it. It already comes with enterprise grade plugins and it comes with uh uh opportunities for you to um uh have the AWS data sources if you want or the third party data sources, et cetera and really not have to worry about the underlying infrastructure or availability because we take care of that.

All right. So now let's dive deep into the specific use cases that i've been talking about. The first one. is, is about a customer who really wants to um uh who has modern workloads. But before that, as part of my role at AWS, right, i um i talk to hundreds of customers every year and uh it gives me uh it's a privilege for me to talk to these customers so closely and discuss with them about different problems. And i idea about different solutions and services and help them design their architecture that they, that would work best for them.

But this particular customer that we've been working for a while, um they are currently migrating their workloads to AWS. They are moving away from a SAS based um observable software into open source observable solutions. And um this customer is Fidelity.

Fidelity is a privately owned financial services company that offers products and services for personal investing, brokerage 401 ks institutional investing and asset management. They have about 6500 applications running on cloud. They operate about 1400 AWS accounts today and 60% of their applications in their entire portfolio runs on cloud. Right.

Fidelity is still in the process of adapting to a fully open source managed observable stack. They are also particular about reducing uh their operational overhead by using uh managed open source observer stack or software or services where they can they want to build a platform that is vendor neutral and also that leverages integrations to other aspects of validity software that they are infrastructure that exists today.

This is a simplified high level representation of Fidelity's observable architecture. Fidelity still has workloads running on on prem environments

They have a small portion of workloads that running on a third party cloud service provider but has a substantial presence on AWS. Um while they have a robust pipeline that is currently being built to handle different use cases, they have the tier zero and tier one applications today that they operate.

They are all designed in such a way that they are, they have regional fail over availability, right? So uh tier zero, tier one are really critical applications or mission critical applications that they want it to be available. Even there is an uh even if there is an issue uh with one of the regions that they operate, they, they want to design this, the observable uh architecture also to be very similar.

So today, they use the open telemetry collector to collect metrics um from uh from their environments. And in a particular region, they send it to other regions as well to at least one another region, at least for tier one and tier zero applications. So they have high availability that way.

And as a result, again, as i mentioned before, they're still in the process of rolling out, rolling this out of production. But they were able to do some simulation testing in a pre-production environment. They found a lot of benefits, right? They were found that they were able to easily scale for peak telemeter traffic while they achieved great flexibility in designing for signal collection pipelines, and they were able to set up a github process and uh which helped them automate the entire pipeline and deployment.

They also were able to create a create design patterns for their observable pipelines for different teams based on their application uh criticality and different needs for that particular teams themselves. And uh this really helps them to on board their, their applications to this pipeline very easily.

Next, let's look at another customer who had traditional workloads and they wanted to really move away from a self hosted legacy software. They also had the requirement to monitor workloads in a hybrid environment. This customer um is philip 66 so philip 66 is a global energy company with nearly 100 and 50 years of experience in the energy industry. And they depend on thousands of applications to run their businesses around the world. The application is run on a variety of environments uh on aws on third party cloud service provider, which is a smaller portion, but also they have a lot of workloads running on on prem environment as well.

They were using a legacy monitoring solution that really did not scale, that was not highly available, that was very difficult to operate. At the same time, they are currently doing a lift and shift of their applications to aws. And it's mainly easy two based, as you can imagine, their goals were to build a fully open source based observable stack and also try not to manage the infrastructure themselves. And at the same time, they don't wanna, they are very cautious about designing an architecture that doesn't um uh make them end up being in the same situation. they are in today five years from now.

Their primary goal like i mentioned is is to collect metrics today because they're doing lift and shift. The applications are not completely modern. So it is it is very easy to collect metrics and logs. They used the a dot collector on in the on prem environment uh on the virtual machines and on prem and on the third party cloud servers providers, virtual machines and on e two, they used easy two service discovery uh mechanism to discover targets very easily.

They collected, they deployed node exporters everywhere they can. So they were able to enrich the metrics that are being collected and they added external labels to those metrics to easily identify the source of um the environment that the metrics are coming from. And um with that, they were able to create a single dashboard for all of these environments on prem third party cloud service providers and aws into one grafana dashboard. And they were able to go very quickly filter and design and uh look at what is happening at any given time.

And um there are about 200 users, i believe the last that are using grafana, uh this particular grafana dashboard today for all of this um uh environments. Um as a result, they've, they achieved their goal of building this single platform for this. They just have one place to go. It doesn't matter where the metrics originate from where the workload is running, they have one place to go and they found that they were able to see 30% reduction in meantime, to resolution as a result of deploying this uh architecture of these two, the one that stands out the most is what the customer said over over a phone call like two weeks ago, they said that my teams come back and tell me that they were able to sleep better, which is such an important important statement when you have teams that are able to rest well, that basically means that they are productive in their, in their work environment.

Also, they are able to focus most of their time into actually doing some mission critical business critical operations. Then really focusing on operating the underlying infrastructure for observable toing. They also have future plans to on board. Um sap workloads into aws and into the uh sap workloads running on aws into the same architecture and also monitor rancher clusters running on bare metal and on in their virtual machines on prem as well.

So um next use case is uh is a very interesting one. This customer really has modern use cases. They have tracing enabled in all their environments and their goal was to not only collect prometheus metrics but also extract metrics out of the trace data that they're collecting and alert based on that right. This customer is choice hotels, choice hotels is one of the largest lodging franchises in the world, a challenger in the upscale segment and a leader in mids scale and extended stay.

They are more than 7400 hotels all over the world with 625,000 rooms. They operate in 45 countries and territories with uh a diverse portfolio of 22 different brands. They are an aws customer since 2014, they are all in on aws and they set a goal of migrating all of their applications to aws by 2024 in 10 years time frame, they were actually using a third party uh monitoring solution that really did not scale well. And that really did not fit the requirements that they have today of, of this particular goals in collecting the red metrics or um request uh request rate, error and duration metrics out of these traces.

Again, they also wanted to build a open source based solution. At the same time, they also want to don't want to end up in a situation where they have additional infrastructure to manage. So they choose. So they chose um management ets and open search service for their use cases. That's how their architecture looks like they used. Uh they collect metrics from uh and large from eks and e two instances and they use the lambda layer that we offer for with the open telemetry lambda layer to collect the um uh traces from there.

What they did is they created a um a pipeline uh a gateway pipeline in this in a separate aws account. So they have separation of concerns that way. And again, there are multiple ways to solve this problem. This is how they decided to solve it. They used the upstream hotel gateway. That is because they wanted to use a specific uh component in the open telemetry collector, which is not available in the ad a collector today. Scholar span matrix connector.

So span matrix connector or connectors. In other words, are new form of components in open telemetry that are both receivers and exporters. They allow you to um extract metrics from receivers and export that to a and send that to an exporter in the same collector that can be sent to a destination. So that is what they use to extract metrics from the span or trace data.

They um today use um uh opensearch for logs, they still use yr for their traces, but that is all self managed. But as you know, these things continue to evolve and they are considering uh evaluating the long term fitness of yr for the trace data um uh storage and uh querying.

Ok. As a result of deploying this architecture, they are able to collect 200 million active time series metrics there from their environment, a prometheus metrics over from 5000 ec2 instances and from 537 applications also, they found that they're able to do all of this by saving 40% on infrastructure, right? which is, which is basically checking all the boxes that they uh set uh set to achieve in the beginning.

The fourth and the final use case is about a customer who wanted to migrate from a from self fostered prometheus to manage prometheus environment because they simply were tired of monitoring and operating a self uh uh a large prometheus environment themselves. This is about uh this is a story of northwestern mutual, right.

So their northwestern mutual is the direct provider of individual life insurance in the united states. Uh they were established in 1857 a fortune 100 company with more than 4.9 million uh customers today. Their customers since 2014, they completed their migration to eks in 2021 they are mainly um operating. Uh their main goal is to containerize everything and throw it on eks. That's, that is their goal.

And um they have uh hundreds of aws accounts. Uh they have thousands of developers using more than 80 plus aws services today. Well, with all of that infrastructure in place, northwestern mutual was not finding it very efficient to operate this large scale management, large scale permit environment themselves. And so they, they decided to um choose a service that provides them high availability, gives them long term storage and at the same time is resilient and that they can rely on.

So that is the architecture they chose. It's obviously a very simplified architecture. And uh the reason why it's simplified is because i didn't um i wanna make sure that how easy it is for you to really replace your existing complex architecture to something very easy. All they did was they, they primarily focused on deploying the a dot operator add-on that is available in ek si mentioned that most of their applications are container based. So this is mainly the eks use case here.

And the operator add on actually can manage the life cycle of the open teli collector for you, right? You could tell the operator to deploy it as the collector has a demon set or sidecar or anything and to manage everything for you. So after, as a result, they, they achieved tremendous benefits.

They found 50% savings in infrastructure costs. They were able to save half of one fte time. They found that their metrics availability is um has um has increased significantly uh during firefights. And also another most important thing, they saw 35% reduction in pages uh to their teams that essentially translates to better work life balance and um lifestyle altogether.

They also saw that their um application teams were able to build pipelines so that can, so the applications can be automatically loaded into this observable um architecture that they built their m tt r or m tt uh meantime, to resolve, reduced from 1 to 2 hours to just um 10 to 30 minutes, right? So it is basically 75% reduction in the mean time to detect. It used to be 1 to 2 hours that uh reduced to 10 minutes in most of these scenarios.

So customers using uh open source services uh achieve tremendous benefits, right? In summary, um this enables customers to attain higher efficiency, operate at a very large scale without worrying about infrastructure maintenance, provides them the freedom, the ability and the ability to um uh to integrate their solutions into other systems.

All right. Having said that i'm gonna invite gustavo to talk about uh b tex's journey into using aws m.

You might be wondering what is VTA?

VTA is the enterprise digital commerce platform. I like to explain to folks that we have the auto update feature. We are a multi-tenant composable SaaS platform where customers from all over the globe work with us to unify their order flow under a single pane of glass.

Before we dig deep into observability and the VTA observability journey, let me tell you a little bit about how technology at VTA is organized.

Technology at Vtech (Vtech means the combination of engineering, product and design teams) - development and production are a shared responsibility with the Technical Infrastructure team.

And why am I telling you all of this? You probably realize in your day to day work that Conway's Law really applies. If you're not familiar, it basically dictates that the system organization tends to mimic how you structure your teams. We'll talk a little bit more about it as we progress in this presentation.

What about the technical stack we're using?

When it comes to compute, there's quite a bit of Beanstalk. And we also use EKS heavily. Within Beanstalk you're looking at C# as the primary language. And with EKS, it's mostly Node.js and Go.

For networking, we have a fair use of ELB and also we have our primary router, our application router that we wrote ourselves.

And for data, we use RDS, S3, Redshift and also several other AWS services.

Let me tell you a bit about the observability by the numbers at VTA:

Roughly 4 terabytes of logs per day
150 million active time series
2 billion spans ingested each day

How did our observability stack look before we started working with AWS and our OpenSearch journey?

On the left hand side, you can see apps or services and those apps will generate telemetry/observability data and send it to a primary vendor and also to several other vendors. You can see them on the right hand side.

This posed several challenges. Let's talk a little bit about the observability challenges we had with the legacy architecture:

Our observability budget was being overrun
We had lack of policy control and zero governance
There was a perception of a single vendor solution that was in fact a sprawl of shadow IT
Developers claimed the experience was good

Let's dig deep into each one of those:

What do we mean by "observability budget was being overrun"? The investment in observability as a percentage of tech quality investment was accelerating. It became a bud within the budget, which could be good, but the numbers did not look good in our case. And by reading social media, it seems to be fairly common for companies of all sizes, possibly because of the next challenge I will discuss.

Lack of policy control and zero central governance - what does that mean? Product teams could generate as much telemetry data as they wanted with direct export to third parties. Nothing in between. And I'll make the callback to Conway's Law in a second, I promise. So no theoretically or technically enforced policies other than some data retention config was in place.

The perception of a single vendor solution - what do I mean by that? The telemetry data was almost always exported to one primary vendor. And what was in fact being exported to more vendors with no clear criteria defined. So that's the key sentence here.

The developers claimed that the legacy experience was good - the primary vendor stayed with us for 5 years, so there was a ton of institutional knowledge about how to use that stack.

How did we go about mitigating those challenges?

There is a lot more than just the technical side of it all here. I will cover mostly the technical:

First we decided on separating concerns - there is a lot that can be optimized when it comes to telemetry data generation and processing. Aggregation/sampling in the middle is for us a separate concern. And last but not least important - you talk about observability data consumption as a third concern. So we separated those.

When we went about designing the new architecture, we also decided early on to go with an open source first approach. But keep in mind that we would probably end up using one or two vendors or leveraging a partnership there with them.

Why open source? Well, first - community. We want to leverage and contribute back to the community. We use a fair bit of open source at VTA. And also we had personal familiarity with Prometheus, Grafana and similar systems. And also we realized that the stack we wanted to observe integrated well with open source solutions.

So then it was basically an all hands on deck situation - efficiency improvement strategy was key to the company.

Here's how the new architecture looks:

On the left hand side, you can see the applications. In the middle, we have the governance that we were seeking from the get-go. And we leverage OpenTelemetry collectors that we run ourselves on AWS.

The Conway's Law callback here, if you've been waiting, is that because we have now a Technical Infrastructure team in between - that's exactly how we designed the system.

And on the right hand side, we have the data sync and visualizations with AWS Amazon OpenSearch Service, Amazon Managed Service for Prometheus, and we also use another vendor for a tracing solution.

It wasn't without its fair share of shortcomings and mitigations, of course.

Here are just a few examples of those:

We had scalability issues in the beginning between collectors and data syncs that triggered some data loss for us. So mitigation there was basically to keep a buffer between our collector/aggregation/sampling/filter infrastructure in the middle. And to weather any downstream issues - basically any data sync issues between OpenTelemetry and AWS services - we realized that scale is not massive but even so, at RCA we realized it was a good idea to go beyond a single OpenSearch service cluster and a single Amazon managed Prometheus workspace. One of each was not enough due to reliability concerns and even scale.

How are we doing today? Here's a scorecard:

Observability budget was being overrun - I told you this was a challenge. I'm happy to report we're well within budget and AWS gives us more flexible billing options than we had with our previous primary vendor.
I told you about the lack of policy control and central governance - that's been dealt with because we have this in-between managed stack that is governed by Technical Infrastructure/SRE.
The perceived single vendor solution that was in fact a sprawl of shadow IT and several teams working and using many different vendors on top of or aside from the primary vendor - now that is down to a dual vendor solution.
And last but also not least important - the developers claimed the experience was good. In parallel to the observability effort, we rolled out a holistic Voice of Developer program - this is an effort steered by our Developer Experience team. And what we discovered is developers say observability should be roughly the #4 focus area to continue to improve. But we realize there is quite a bit on the user experience side that needs to be done for observability as well. We have a much better technical stack but we can do quite a bit more when it comes to user experience for the developers - that's the feedback we're getting.

So what's the plan ahead to address that and some other challenges still?

Evaluate Amazon Managed Grafana to improve developer experience - that's next on our list
Provide more VTA focused training - we gave more general observability training on the new tools, and now we're thinking to be more specific for VTA use cases
We're also looking to contribute/rebase some of the OpenTelemetry customizations that we apply ourselves. As I mentioned earlier, in the middle of the stack we have OpenTelemetry collectors that we run and customize quite a bit. And now we're looking to rebasing that with the open source community.
There is also quite some telemetry cleanup that can be done - I mentioned earlier the split of concerns - and one thing we didn't tackle head on at first was cleaning up the telemetry data being generated by the applications - and that causes/triggers us to have to sample quite a lot. We believe that by cleaning up the data and increasing the quality of the telemetry data being generated, we can relax some of our sampling/aggregation rules - and that will also improve developer experience.

Alright, that's it for me folks! Next I'll call on Mark and he's going to tell you all about the features that I keep asking him for. Thank you!

Thank you, Gustavo. There's a long list of features that you keep asking me for. So thank you for showing us another great use case around open source observability and a new open source journey.

My name is Mark Cheney. I lead product and engineering for our open source observable services at AWS, including Amazon Managed Service for Prometheus, Grafana, and AWS Distro for OpenTelemetry.

What I enjoy most about working at AWS is our customer obsession. You see proof of this in a lot of these use cases that you see here - our ability to operate at scale and our operational excellence. Our operational excellence is our commitment to building resilient, available software for our customers. And a lot of what I'm going to talk about today will show proof of that and how we do this in terms of how we manage our own services.

So let's talk about scale to give you an appreciation since we launched these services two years ago in general availability. We operate over 142 trillion and collect over 132 trillion metric samples across 14 different regions worldwide.

In my session today, I'll talk to you about how we do monitoring and observability internally for one of my services. I'll also share with you some of the new features that we're launching at re:Invent for some of these services.

AWS builders are also operators - we run, scale, upgrade and proactively remediate problems that impact our customers.

As an engineer, you build it, you operate it. This DevOps culture is at the core of all the services and features that we build. I'm sure who has talked to AWS Support at any point when using an AWS service. Probably everyone. Our first responder engineers are the most critical point of contact to our customers. Maintaining a short time to respond to issues is very important as the service teams at any point in time, they can escalate issues to our services. And it's important for us to make sure that they are properly educated. They understand how to run the services, they understand how the services are running. So they can quickly answer these questions that our customers have.

We measure SLAs we measure first time response. And for me, most importantly, we measure i measure, we measure the time that it uh how many tickets actually get escalated to the service team. And for me, one of the goals that i have is always make sure that we maintain a less than 10% of the actual tickets that are being escalated to improve the overall support workflows for our customers.

We have a primary and secondary on call. The majority of our primary on calls that deal with all the customer issues, deal with all the high severity tickets have a one day rotation and our secondary on calls are on one week, rotations, moving to a one day on call has certainly help with um operator fatigue, reducing a lot of our, our our tickets and helping us overall creating a better work life balance for our operators. But it's not the only thing that you need to do in terms of reducing some of these high severity or low severity tickets, right? Make sure that you properly categorize your tickets. Make sure that you also prioritize these things as part of your sprint planning and your quarterly planning activities, proactive notification has different layers at AWS.

We have aggressive alarms that monitor our p 90 p 99 values. We measure everything at AWS part of our operational excellence and we also have um alarms based on failed canaries that automatically create auto cut tickets for our operators. We then have an escalation management process. This escalation management process includes our ability to post when customers are being impacted to our personal health dashboards and our service health dashboards. This is an important aspect of being able to earn trust with customers because we want you to know if there's a degradation in the service and how quickly we are remediating that issue that includes anything from an api error rate failures or an increased higher latency in some of our api s in terms of the mediation investigation or mediation process.

As part of our operational excellence process, we monitor this request, the request availability, latency and error metrics that we have for all of our public api s, the red metrics that amaya was talking about, right. The request rate, errors duration. This helps us measure availability and performance of all of our public api s and our console. We also have global and regional dashboards that helps us monitor what's causing the most pain to our customers.

We have meetings, weekly meetings within our teams to monitor these operational dashboards. But most importantly, there's also an AWS meeting that this happens and you get on the wheel. And if you're lucky enough, you got to go present your operational dashboard across all the other AWS services. So you got to make sure that you're representing the right data on those dashboards.

So let me tell you a little bit more about how we manage one of our services, one of the services that i lead. So for Amazon Managed Service for Prometheus, we use a cell based architecture. So a lot of the technologies that we build and the best practices that we have. We contribute back through some of these open source contributions, the blogs and article similarly to this cell based architecture. If you're, if you're not familiar with cell based architecture, please, you know, copy this uh this uh this qr code, go, go have your teams learn about it. This is what we're using internally, right?

We operate everything on native AWS for our service in terms of our control plane, everything is is is serverless. We use Lambda and API Gateways, some of the services that we use for control. And we also use ETFargate as our cell based router. And then in our data plane, we actually manage, we use EKS to manage all our containerized environment. The cell based architecture create this isolation, reduces the blast radius across the region. You have multiple cells across different regions.

So what about our tools, our monitoring and observability tools that we use by a show of hands who uses fewer than two monitoring tools in their environment fewer than two. Because i do want to go talk to you if you're only using two. How about less than five? Ok. What about more than five few of them continue to grow?

So for our Prometheus service, we use five collection services and we also use five services where we store query and analyze observable data. And most importantly, we can visualize all of that data inside of Amazon Managed Grafana. Yes, that's right. We also use our open source manage services that i need to be able to monitor all of these technologies to simplify that.

So starting top down for AWS Lambda, we use our cloud native uh logging uh to send traces to CloudWatch Logs, traces to X-Ray for a for uh Fargate. We also use a lot of logging native logging capability for sending to CloudWatch logs. For EKS. We use Fluent Bit to filter logs locally before we send them to CloudWatch Logs. And uh we also use for metrics. We use some CloudWatch Agent to be able to send some metrics to CloudWatch. Uh but the, the majority of our metrics are all Prometheus metrics that we send to Amazon Managed Service for Prometheus. And as i mentioned, all the dashboards or most of the dashboards and investigations by our engineers are all done using Manage, uh Manage, Manage Grafana.

So let me share a couple of lessons learned that we've learned while operating and deploying this infrastructure for monitoring and observable. So, one of our leadership principles at AWS is frugality. So managing observable costs for our infrastructure and for our observable is an important thing as part of our monthly activity for us. We scrutinize month over month, cost for the infrastructure that we use and for the observable services that we use. One of my goals is to always maintain a cost for our observable tools less than 10%.

Some of the ways that we improve our costs is some of the things that you may have heard through Adams, keynotes, some of the initiatives that we are taking place right now is rolling out Graviton, which we are hopefully will be able to share with you next in terms of the improvements in terms of cost benefits and also performance benefits that we have. The other thing that you can take advantage of in terms of observable or logging. We've recently announced in CloudWatch Logs. A patterns command, the patterns command is used to automatically cluster logs with similar patterns as you can imagine and makes it very visual, makes it easy to visualize the types of logs and the volumes of logs you do have engineering that can get overzealous and try to log everything in full logs and production logs that may work well in development. But when you get into production, that scale, it creates a lot of logs that you may not need for investigation and diagnostics. So a great opportunity to reduce some of the the log cost, everything fails all the time. Quote from Dr Warner Vogel, Amazon CTO. That's one of the reasons why we use some of these Amazon Managed Services, open source services and CloudWatch native services. But the services are only as good as the data that's being produced. Right? If our a jets go down, we're not going to get any visibility. So it's also important as part of your monitoring, your your monitors to make sure that you deploy your collectors into a high availability in a in a high availability way. So that's something that we do for our EKS clusters. We always deploy our Prometheus collectors in high availability so that they can work independently and produce duplicate metrics. Ultimately to Amazon Managed Service for Prometheus, where Amazon Managed Prometheus will automatically duplicate all the metrics at no additional cost to us. And to you, of course, we also monitor all of the logs and metrics from our agents, right, flu and bit a dot logs to be able to monitor performance in case we have potential memory leaks or anything else so that we can recycle those components.

As part of our service architecture that you saw in the previous slide, we use several AWS accounts to isolate functions per account. For example, we have um five plus accounts per cells. We have an account for a control plan for data plane for our canaries uh for the console and so forth. So CloudWatch uh use uh CloudWatch usage metrics is an important thing that we even have to monitor as part of a service team usage metrics helps you monitor how close you are to your service limits, right? So things like a number of EC2 instances, you have number of auto scaling groups you have. So it's very important to be able to detect and be able to forecast what you need in terms of those service quotas so that you can put in those requests through the Service Usage Quota console or through a support ticket to be able to do that. So if you are not doing this already, I would certainly recommend to put those right alarms in place to go and monitor that.

So we also publish a lot of different blogs and articles and documentation about observable practices. Uh you can look at our AWS Digital for OpenTelemetry documentation. What we have in the dedicated section around observable practices, feel free to go look at those and provide some more feedback. If you need some more information on how to get started.

In terms of how we deploy our services for monitoring Amazon Manage Services or Prometheus, we create one Manage Service for Prometheus workspace. The workspace is simply just an isolation, the metrics and the queries and the alarms, the rules that we have per region. So that one workspace per region will collect all of the EKS data that we collect for all the running containerized environments in one centralized workplace to make it easier for us to go and visually analyze the data.

We also use CloudWatch cross-account observability. As i mentioned, five accounts per cell, some regional accounts, several cells in the region. You can easily see how you can get 50 100 plus accounts and it's becoming very, it gets very hard to understand where to go to to go look for your uh your vended metrics, your CloudWatch logs. So by using CloudWatch across account observable, you can centralize a central monitoring account where you can get all of the metrics and logs in a centralized place to make it easier to query this data.

So that leads into our third thing, naming conventions for data sources using Grafana. So for every data source that we use, we want to be able to simplify and leverage some of the strengths in Grafana for editing these dashboards and using template variables to quickly isolate and compare performance across CloudWatch data that we're consuming as well as Prometheus data that we consume. So to give you an example for CloudWatch as an example are all of our data sources are prefixed with cw dash the environment. So beta gamma production dash the region name and then sometimes we even go further cname as well or cell i. So that makes it a lot easier to take advantage of that

We use a lot of the dashboards and to give you an example of what that can look like. So this is a partial of one of our dashboards and our, our operational dashboards to monitor the alert manager, you see on top the stage, uh the region and the vendor. And it's easy for us to compare across environments across regions. What goes on into uh monitoring the health and performance of our um in this case, the alert management component.

Um this is an example of a memory leak that we detected again. We we, we contribute back, we use a lot of open source components. Um and we noticed that when we were deploying an updates on the alert manager, alert manager is a component as part of our prometheus stack to be able to inhibit and route alerts to sn ss, then you can go and route them wherever you want. And we noticed uh uh through the baking period, we noticed a slow memory leak, right? Everybody i think has probably had the chance of experiencing memory leaks. But ultimately, it's how quickly do you detect that memory leak? Right? When you start operating at scale as you start rolling out this change? So that one roll back the change, number one and then two go remediate the problem. That's exactly what we did. We were able to find the problem using a lot of prometheus metrics. And we also use continuous profiling data as well to be able to isolate exactly the code that was driving the memory leak. And we were able to then uh to roll back the updates to the alert manager and fix the problem.

You see the change that it had in terms of impact who uses open source technologies. I hope everybody lifts their hand up. So this is an important part of, you know what we do. That's one of the reasons why i built these services three years ago was to be continues to be thought leaders and good stewards in the open source community, really help our engineers in what we built, help contribute back to these projects. These are some of the projects that uh our, our teams and our sister teams in terms of like open source as well contribute back.

Um and if you're looking to contribute to some project, if you've never done contributions to open source project. We'd love to welcome working with you. It's something that we do on an active basis. It's part of the career development of our engineers and that's what they do. And you can see a lot of the contributions like the alert manager that i was just talking about and several other contributions. So if you're interested to learn more on how you can contribute upstream or if you're, if you want to work upstream, please come and see me. I'd love to know how we can partner better on this area on these areas.

All right. So let me walk you through some of the new features that we have for our services that we've announced recently. So since we launched these services 22 years ago, in general availability, we continue to improve our security uh and controls around the services such as improving, adding so compliance for both managed prometheus and managed grafana also adding a bunch of certifications around security iso pc i compliance for prometheus as well.

We've also added network access control for grafana. So this allows you to control the inbound traffic that are coming to a specific workspace to restrict it based on customer name prefixes as well as v pcm points. We've also added secure connection to vpc data sources. I have to add that one back on the slide because i continue to get that question all the time. How could i connect to uh sensitive workloads that are running in my private vpc s. It's something that we've added like a year ago. Right. It's an important feature. If it, if it's, you know, if you want to monitor your r ds instances or your uh self manage, open source or any other self managed data sources that you have, you can do this today with our manage grafana incident.

Scaling and regional expansion is a continuous journey is what we do at at aws. We've uh recently announced at the beginning of the year 500 million active time series metrics per uh per workspace and 30,000 rules per workspace. That's over a 20 x improvement since we launched a service two years ago in general availability. Similarly, for grafana, we support 10,000 provision users, 500 concurrent users and over 2000 dashboards and data sources. And we're continuing to improve on that based on customer customer asks. So continue to work with your account team or come and see us to figure out how we can improve those. So ultimately, right, this gives you the ability to monitor, you know, billions of prometheus metrics across all the regions we have across all the different workspaces.

Next, in terms of single painted glass and unifying the experience. We're continuing to prove the experience between manage grafana and manage prometheus to be able to view all of these alarms that you have these 30,000 alarms and plus, right, that you're managing that are triggering that you want to be able to silence. So creating a much better experience directly in manage grafana to manage these scales of alarms, managing costs.

I think how many times have we heard cost optimization in this session? We recently announced uh amazon ets multi class cost monitoring. So this is a uh an extension or an upgrade on top of what we had before. So through an ets and q cost partnership that happened last year, we we were supporting the ability to do cost monitoring for one cluster. Um in i believe it was june of this year, we've added the ability to do this across all your clusters. So you can send all your cluster data to a centralized managed service for prometheus workspace to monitor all of your costs that are going on across all your clusters so that you can compare we had a customer recently in the messaging space that said that within 10 minutes, they were able to deploy that to several clusters and they were able to find $100,000 worth of saving opportunities around the utilization of their infrastructure. This will give you breakdown of cost by name space by pads and everything else who has 10 minutes to spare to be able to go and test that out. Please come and tell me, give me some more feedback around that feature. Love to hear more about things that are working there.

Open search tracing analytics. So this is the ability to use all of the traces that you're storing in opensearch and be able to use the same opensearch plug in to not only query and create analytics on top of your traces, but also correlate that with your logs that you're storing. And that also gives you the ability to correlate and be able to compare it in the single dashboard, your metrics as well that you're storing in prometheus or any other data source for that matter, advanced sampling.

So this is part of the aws distro for open telemetry. We've added advanced sampling. So the ability to group by the ability to tail tail based sampling, we've also added support for kafka, both as receivers and exporter components as part of the a dot distribution. This was important because customers were using the um c fa as a telemetry bus, right? They want to be able to go filter uh do better aggregation. So they want to be able to send things to c a fa do some manipulation on the data and then send it back to an n a dot so that you can then send it to a destination of choice and then some other other features that we've announced.

Oh, i jumped ahead already. I wanted to, my punch line was gone. So as part of the service team that we were talking about, i was looking for some volunteers actually who loves to um manage agents, deploy agents scale agents. Uh how about um making sure that you manage configuration drifts between all of your clusters and your accounts? I think any volunteers that want to help me out with that. I think no one loves to do that, right.

Uh ultimately, one of the great things that we just announced that re invent is a support for an agent list collector. So we will be able to discover and collect all of your prometheus metrics inside of your eks clusters and be able to securely do that and be able to do this based on how however many end points that are available to be scraped and how many metrics you want to collect. So if you haven't tried that yet, i recommend you go on to try that.

Um for those of you who continue to love to manage agent, feel free to do that, that's always an option with their a dot collector. But this is certainly going to simplify some of the oper uh operator pain that may exist around managing and scaling and optimizing your infrastructure for collecting those metrics.

Next up. Two years ago, we announced traces last year we announced metrics and now working with the and collaborating with the open source community we've stabilized and now support logs. So now you can officially use a dot aws for open telemetry to collect all three signals, metrics, logs and traces and be able to send them to destinations of choice.

We've added and test bridge marked and performance tested the log, the log receiver receiver and we've also added support for cloudwatch logs exporter. You can also use the open metric telemetry protocol wire protocol to be able to send data directly to opensearch as well.

Next amazon manage grafana now supports up to 300 plus community plug in that includes data sources, visualization plug in not only the ability as an administrator to go and select these community plugins that you have access to through the community, but also be able to test the different versions, go figure, right? Be able to test the different versions before you deploy it to your team, right? Make sure that they work with your dashboards. So we have full control over that instead of a managed prometheus workspace that was launched about uh i think a week ago.

So what does this mean for us? So we had what five collectors, five different services. So some of the things that are going to change as part of our next, the things that we're testing out, baking out and we're going to continue to evolve over 2024 for our managed services is we're going to reduce the five collectors to four. So using the a collector for tracing and logs, and we're going to use our servius feature to be able to collect, discover all prometheus metrics from all our clusters. So really reducing some of that operator pain that we have in terms of managing these agents.

Well, well, yeah, we'll come back with questions. Uh so during, during the session, we shared four great customer use cases and benefits for using some of these open source services. I also shared uh features recently announced for three of these open source observable services including amazon manage grafana for your rich data visualization. Amazon manage service for prometheus for your high cardinality, metrics and aws di for open telemetry for your instrumentation, collection of all of your signals, metrics, traces and logs.

So what's next if you're just getting started and you want to learn more about some of these observable services, check out these workshops, these best practices, these accelerators. It's a great opportunity for your teams to just start using them, start learning how you use and deploy these technologies come visit us. I will also be there tomorrow. Um but we're going to be at the village. We can ask more questions about open source technologies that we have. And there's also some swag that everybody needs more swag to carry home and check into your luggage.

All right. Thank you so much for.