SaaS operations in action: Buy with Prime

最新推荐文章于 2024-09-27 17:58:23 发布

李白的朋友王维

最新推荐文章于 2024-09-27 17:58:23 发布

阅读量167

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134795247

版权

Uh hi, everyone. Thank you so much for joining us today for session BWP 301 SAS Operations in Action for Buy With Prime. Today, we're going to share with you some of the best practices that the With Prime team has learned in operating multi-tenant services. And by the end of the presentation, we'll also share some sample code so that you can get started as well.

My name is David Ramos and I'm a Principal Engineer with By, With Prime. Today, I'm joined by two great co-presenters. Uh uh Joan Yon, Senior Solutions Architect at AWS and Jimmy and Kim Solutions Architect for Buy With Prime.

As a quick intro before I get started, I've been with Amazon now for over 11 years. Uh I'm not exactly sure where the time went either. Um most of my time has been spent on projects with a machine learning focus. So I've worked on the Fire Phone if anyone remembers that, Alexa, Audible and most recently Buy With Prime. I joined the B With Prime team because I was really impressed by the technical vision, the way that we approach multi-tenant services and composable really enables some interesting use cases as well as posing unique challenges as we learn to scale. And of course, I'm a Prime member, so I love using Prime everywhere.

Today, I'll briefly go over what By Word Prime is and a very brief overview of our architecture and multi tenancy. Then Juan will come on stage and describe an ongoing mechanism that we use to build full confidence that we're enforcing per tenant isolation of customer resources. Next, Jimin will speak about what metrics the Bio Prime team uses before summarizing the the different techniques we've shown.

We'll have some time at the end for questions and I'm a new dad. So I'm obligated to show you some baby pictures. So let's get started.

Um Buy with Prime lets millions of Prime members shop directly from merchants, online stores with the trusted experience that they expect from Amazon. This includes fast free shipping, a seamless checkout experience and easy returns.

Buy with Prime is available today for US me merchants and some of you may have already seen the Buy with Prime button on some of your favorite stores. Merchants can integrate, Buy with Prime directly onto their website, either by installing a JavaScript code snippet to get the Buy with Prime widget or for Shopify merchants, installing the Buy with Prime app for Shopify.

We also offer a growing number of a ps and real time event subscriptions. We want to meet our customers where they are and be as flexible with their interfaces as possible. In order to aid adoption, Buy with Prime is a win win for merchants and their customers.

Prime members like me get to enjoy Prime shopping benefits. And a familiar way to check out while merchants can nurture customer relationships and build brand loyalty. So far, Buy with Prime has been shown to increase shopper conversion by an average of 25%. This data point measures the average increase in shoppers who placed an order with Buy with Prime when an available purchase option was available versus when the button wasn't shown during the same time period since we launched in April last year.

We also held our first Prime Day event with our customers. In aggregate merchants who participated in Prime Day activities experienced a 10 x increase in daily Buy with Prime orders and so an eight x increase in daily revenue from orders during that Prime Day event period. And this is compared to the month before we announced Prime Day.

So now let's bring our focus to how we actually provide this service to our customers. Our services are all multi-tenant, meaning that a single service supports many distinct instances of use. Ensuring that tenant resources are isolated isn't optional for us. It's mandatory to how we earn merchant and customer trust.

So with that said, let me give a brief overview of the Buy with Prime architecture. This is just to set some context for the main topics of our presentation which is multi-tenant data isolation and observability.

So if you're interested in learning more, you can use the QR code in the top, right, in order to uh see our re invent session from last year, which dives a little bit deeper.

Biome embraces the microservice architecture pattern in order to achieve developer concurrency and efficiency. This lets us iterate quickly on new features in an independent manner across service teams. As mentioned earlier, we allow merchants to integrate with both our front end features as well as our backend APIs. This means that a merchant can either use the Bio Prime widget as is or on board directly to the same data APIs that we use to power those experiences. This lets us and merchants both experiment with new innovations for the shopper experience.

The first aspect of multi tenancy that I'd like to mention here is that each of our services manage their own AWS accounts and each of those AWS accounts manage multiple merchant resources. This cardinality lets the service teams operate independently while also being able to enforce consistent data handling security and configuration as well as simplifying other aspects like logging, metering and other analytics.

Jimin will talk in depth later on how we get visibility across tenant resources across these accounts, which lets us identify trends and debug issues specific to either a single microservice or a specific tenant.

Our micro services aren't developed in complete isolation. However, we do enforce some common data handling and communication standards in order to help manage tenant resources. One of these common standards is the tenant ID, a unique identifier that accompanies any activity associated with an individual tenant's resources on every Buy with Prime service.

When a merchant on boards to Buy with Prime, we leverage CloudFormation to synthesize relevant resources across every microservice for that integration. We also generate the tenant ID as a custom resource and propagate that to each of those services which they can use to then tag any related request or data element. This all happens behind the scenes. For the most part, the merchant only needs to be aware of their tenant ID when they're integrating with our APIs so that we can operate on their relevant resources.

So now that every microservice has these resources for each merchant and the tenant ID for those merchants, we have all the pieces in place in order to facilitate fine grain access to each tenant resources via temporary IAM assumable sessions. Each service powered by our identity service can request, it can enforce that a request only accesses the data that is rightly scoped for the associated tenant ID.

Again, this was just a cursory look at some of our architectural aspects. I put the QR code up there for the re invent uh session from last year that dives a little bit deeper. And of course, we're happy to answer any questions after the presentation.

Now, I'll introduce Joan up to the stage. Thank you.

Thanks David for the intro. Uh hi, everyone. My name is Yan Yang surgeons architect here at AWS. I am super happy to be here on the stage with you today at 4pm Vegas. Thank you very much for being here and I'll make sure you have a lot of fun here with us.

Um so David introduced how By Prime works under the hood a little bit and what our service does. But I'd like to ask you to take a couple of steps back and really think about why you chose to be here. So it is about learning from Biome's experience and example and how you can leverage these practices into your own environment, right?

So I'd like to tell you the story, why we chose to make this presentation. So I'm personally joined. Uh I personally joined AWS first, then moved to Buy your Prime. So I already had the AWS lens and decided to come back to AWS to meet the customers. So when I first came to Buy Prime, honestly, my mind was just blown away because they were doing exact same best practices that the AWS SAs tell to our customers to do and in the, in a very beautiful and elegant ways. And I was like, this is something we should definitely talk to our customers about more. And this is how Amazon teams actually do it and this is something we can benefit other architects and developers out there. So these are what the, what this presentation is about.

And our philosophy here is we do not want anyone, any one of you leaving this room empty-handed. So we got some a lot of cool samples out there. So we are excited to share this. But before we do that, I'd like to jog your memory a little bit about the Oppression, Excellence in Risk Architecting framework.

So this is basically about like running your workloads and designing your workloads to efficiently serve your business requirements without having too much trouble operational overhead. I do not want to go over every single icon over here. There are great presentations about this. So please check those out. But I'd rather invite you to some of the key points that we thought are very important to important to Buy Prime.

So it is really about automating the task and continuously improving the process so that we can learn from our failures and never repeat the same problem that we had. So what we did here is um so a couple of things since we are very young business, we opened up last April in 2022. What that means is we have to roll out new business critical features basically every day and two, we we should be absolutely torrent in the, in the sense that we are being basically infrastructure for our merchants and we should do all these things while we keep the security as job zero, just like any other Amazon services.

So, but these are not the end uh because we are software as a service provider, this brings additional layer of complexity. So we have to know we have to know what happens across the whole tenants and make sure every single merchant of ours experience the quality that we would like to provide to them. And we should also definitely know their usage patterns and find out there if there's any way we can optimize their experience based on their usage.

So these are something we should definitely care on top of the normal operational excellence that we have for general, diverse workloads. Once you put this all in one picture, honestly, this is a lot. So what i'd like to do here is we'd like to give you two digestive version that By Prime has for this practices of operational excellence for performing operations as code, enabling our developers to roll out new features every day. While we keep the robust tenant isolation boundaries, we embedded this tenant isolation validation into our pipeline. So this will be the first topic that we will go deeper into.

And the second is just like David said like five minutes ago, we are a big microservice shop and we have multiple different engineering teams that manages different microservices, of course, in different accounts so it was a little challenging for me, for, excuse me, for us to have very standardized way of having metrics, logs and all these observability things at, at the same level across different teams. So we have to find a way to standardize this in a smart way and the way how we accomplished this is using the AWS CDK the Cloud Development Kit. So we will dig right deeper into this as a second part.

When you take a step back and look at these two principles that we have this all boils down, this all boils down into one thing, the confidence. So I, we know engineering team has built great things, but we want to really, really make sure that we are setting same uniform high standard across the whole team for our customers because having the tenancy layer is absolutely complicating a lot of things.

So we are very excited to go deeper into this. And without further ado I'd like to go into this automatically validate tenant isolation part. Since this happens in part of the pipeline, I like to quickly go through individual stuff that we have in Byline.

So if you are familiar with Amazon best practices of CFCD that looks fairly similar to this. So what happens here first is the source check in, let's say you have application and to make this one application to work, there are lots of things to work together to make this one application work, there will be application sources of operation source to sources IAC code and all these dependency packages. And if anything changes in one of those, the pipeline kicks off, then we go through the build phase and do all the fun build stuff like unit tests, compile unit tests, static analysis and all these things.

And once we complete all these build actions, if everything succeed, we move on to the pre-production environment. So for, for pre-production in By Prime, we have three - Alpha, Beta and Gamma. For Alpha and Beta, we are mainly validating if the latest code functions as we expected. And in Gamma, we do all this validation. And in addition to that, we make it a little bit more production-like. What I mean by production-like is we bring the same deployment configuration as production and we have the same monitoring and alarm set up for production. And we also do the same continuous synthetic testing that we do for production.

So across this all, Alpha, Beta and Gamma, we run a variety of integration tests. So we validate our assumptions on the mocks and validate it works as we expected and we put those into a rollback window. So any of these tests fail within the predefined time frame, it is just going back to the previous version. So you as the software engineer have to really make sure everything runs within the defined timeframe.

And today's the start of today's show - the automated tenant isolation validation happens as part of the security integration test across different pre-production environments. Once all these integration tests succeed, we finally go to the one box stage.

So here what we do is we want to really make sure this new source is backward compatible. And what we mean by that is we want to make sure that they can be deployed alongside of existing source. So for instance, we need to make sure the new source is not writing any data in the format, that current code cannot parse because that will just simply break the whole system.

So what we do here is we deploy the new source to a small subset of the whole fleet.

So let's say one virtual machine or even single container, small percentage of lambda invocation, we set this side by side and let this consume the traffic organically together. And let's say we are setting the time window of one hour and let these two stay together and run the synthetic testing and make sure we have the successful synthetic testing um results and the monitoring um metrics.

And by doing so, we really understand that this new source is not making any side side impacts by side effects by self or just sitting by next uh previous versions. So with this all one box stage, you finally hit the production and your new feature is out there.

So throughout this whole uh pretty intense validation process, we decided to add this validation isolation validation as part of it. So you might now think like, ok, now we know how pipeline works, but how does this 10 isolation actually works in automatic way.

So let's go a little bit deeper into this part. The concept might be a little foreign to you. But how we do these things are not necessarily rocket science. What we do here is uh you should uh we should firstly go over architectural factors we have here first.

So just uh just like um david introduced, we have control plane and the data plane for microservices and the control plan holds the metadata for the tenant and the actual important business data goes to data plan database. So we have these two database dimensions and to make sure we are allowing only really important people to access um the their own data, we leverage two boundaries, one a, there is a account id and the second is 10 id.

So we use these two as part of the imo session metadata and make sure uh we are getting the expected outcome based on individual scenarios. Um so with this im knows what happens, things in action is straightforward.

So if this security intelligence test, uh this part of isolation validation part happens, we instantiate the test runner once this is instantiated. It also is the first iron roll since we are having uh allowed listed, non allowed, listed a risk accounts. It starts from non allowed, not allowed, listed a to account but saying i'm g one i'm giving to data.

So in this case, it would be something like i somehow have microservice and point of by prime. And i'm trying to access legitimate merchants data which, which is pretty possible. And of course, for of course reasons this should fail. And our microservices are smart enough to know which a account they should allow or deny. This will get 400 exception saying excess denied.

And we move on to the next part of allow listed at ws account and not matching 10 id. So we do create read update and delete and make sure all these things are covered.

So in the second case, uh you can, you can think of it as i somehow hacked into microservice or by prime somehow. And now i'm not happy with just g one's data, but i want david's data and i'm saying i'm carrying this idea of g one but give me david's data and this should obviously fail as well. And with the same error code saying 400 access in section.

And finally, you get the positive case from lawless account and matching tenant id because this is most likely coming through a legitimate path that bi primal proofs.

So these are pretty straightforward, right? It's just about like being more meticulous about all the corners that you can imagine about this tenant activities. But i'd like to give you key three considerations of this test. runner part one is be uh you should be aware of the tenant isolation boundaries for yourself and really make sure you bring those things into the im no, for us, it was either with a account and tenant id, but it can be vpc s or ecs cluster or eks clusters for you or whatever choice of the boundaries that you choose, you have to bring those into im no policy as a resource or the conditions. And you make you just really make sure you have you cover this.

And the next part is definitely the data dimensions. In our case, it was just control plane and data plane. Um but if you have multiple different a databases to support this, you gotta make sure you cover all those data sources and data, different data dimensions and tables and cover all the operations that your application can perform and really make sure you are covering all the corners of positive cases and negative cases.

But everything is always easier to be said than done. And again, the last thing we want from from this presentation is living near empty-handed. So we put the put together this s uh very simple implementation of the test runner that i have just described.

So if you scan the qr code top right corner here. It will take you to the github repository uh containing uh just test framework. So it, what it does is it assumes im role that you provide and tries to do create v update and delete actions against the uh dynamo db table that you give. So this is fairly simple, start what we'd like to accomplish with. This is giving you the right starting point.

So you should bring this to your office and expand it out to the im rows that you actually have and bring in, bring in your like ecas cluster options, vpc options and all these things as part of this uh im row session part. And let's say if you have like secret man or different of databases, you should definitely expand it out to other data, uh other api s as well outside of data dynam o db, just like we did in the repository.

Um when once you figure out those things that you like to test out, the next thing you should do is putting that into your pipeline. So let's say you have this kind of code pipeline for you once your beard phase completes and uh the depth environment has the new version of source. What you can do is using code beard as a test runner and run the test for you. And what it, what it will do is store the reserve to s3 bucket and you as a reviewer can hit approval and make sure they are safe in isolation part, move on to the next pre-production or even to production.

So hopefully, this equips you with some tools to boost up your confidence over the isolation boundaries. And i want to share this q code once again and please please be free to share your feedback questions, any issues that you face, we are all ears and we'd like to help you out as much as we can.

So with this, that i'd like to invite jimin over to the stage and let her walk us through the most. uh one of the most important challenges of our prime, the observable and how we tackle this. Thanks to one.

Ok. So in the second half, i'm gonna talk about object ability in sas before that. Um before joining by prime, i was on aws a for 3.5 years, just like t one i primary supported start up and isb customers. Many of them were s a provider. So i wanna see who are coming from sas company. Great, and who are the engineers? Great.

So i, i believe you can relate to this discontent today. So one of the frequently asked question was what metric should be monitored and what monitoring tools do i recommend? Where in this session, i do talk about the metric we monitor and the tools that we use, but they might not be your. So i want to emphasize why behind our choices and the benefit we gain.

Let's start thinking about um ability challenges where since b prime launched in 2022 we grow fast. Meaning that we have more resources, more servers databases, meaning that we have more metrics to monitor. If you don't put clear baseline, what is important from your business uh perspective, then you would often end up giving equal weight on all the measure of our metrics and it will research in a fatigue.

And the second, um due to the nature of micro surfaces obtaining obser becomes more complex, you need to track all the dependencies and maintain feasibility across aws account. Multi tenancy also introduced additional layer of complexity here.

Then how does bind prime hinder these challenges? First, we utilize proposed dashboard tailor learning dashboard to specific needs enable us to concentrate on critical issues and quickly gain an understanding in our services status.

Second, we utilize cd k to standardize our ul configuration because each microservice is mended by all different teams. Keep our configuration consistent is challenging.

Let's take a look at some proposed spirit dashboard. We have the first example i'm bringing here is critical system summary, focusing on availability, latency and traffic where it would be common format of dashboard. And i believe most of you already have similar one, right?

So let me show you how we utilize this. We hold regular operational metric review meetings. And um first i transfer to bin prime, i was shocked because of the large group and diversity. I expected some uh technical folks there but actually vp is there and directors from tech and even from sales and all the engineering teams, managers and engineers as a large group, we invest a lot of time so we focus on critical operational practices. And one of them is reviewing this dashboard with that, we can facilitate a shared understanding about key events and key services say us within a short time of across various stakeholders, defining rare defined uh dashboard is not close enough.

So it's crucial to have um have um practice to regularly visit and review them. So um we are having a traffic peak in the last two days for example, where it's um actually coming from prime big the day on october 10 to 11. So it was not an issue, but in this case, we expect to have an answer about what caused the issue and what was the impact. And ideally the action we took.

The next word i want to share is availability tracker. So being available is bare minimum based line for service, if not available, no one can use it. So we listed up all the micro services that we have where i i have only four, but obviously there are more. Each line indicates an availability goal and the calculated availability during the last seven days, 1330 up to 90 days.

I said calculated availability. We use this occasion to death let's break it down one by one you might heard about up time. So calculated as the ratio of up time to the total running time, typically yearly or a monthly basis. It's a common approach though we calculate availability using the number of requests because it's more closely tied to customer actual experience.

What does that mean? So if you measure availability based on the fixed time frame, five minute interruption at midnight is technically same as one during peak promotion hours like black friday or prime vector day. But in fact, the impact is not the same in reality.

So we calculated using the number of requests, then why do we include 400 errors and they critical indicator for client side issues or that's right. But it can sometimes significantly distort the calculation.

So consider this scenario, if our system face d attack, many of the cases will result in 400 errors. So 44 not found for three forbidden or 4 to 9 to many case. So if you include all the 400 errors, then it would distort the availability making it appear more available than it is.

So we remove them. On the other hand, we put 500 errors in the numerator. It clearly indicates where our system provides service. The same approach applies to everywhere in b prime and shorter checkout availability is not an exception.

We put high importance on shoppers experience, which is why you have a dedicated dashboard for them. You are looking at aggregated uh graph both in weekly and solid days. Few so short checkout looks like a single workflow. But behind the scene, it involves many api s from different api s, many api s from different micro services.

So um that is why we have a breakdown view together. If any of this api fail, then the sh would be um would then able to complete the checkout. So if you click on the breakdown, the dashboard, then you'll see the complete list of all the api s and the broad top three availability drops.

So the blood three top three api s are actually having the most significant availability drop over the given time period. So with that, we can identify which api within specific micro services were unavailable and how that impacted the aggregated availability.

As you've seen earlier, the multi time frame in by prime typically spans from weekly to 30 days. So the short time frame spot emerging trend and help us address some issues quickly. Meanwhile, why do we have longer time frame for the same metric? Some changes come so slowly that they might not be visible within a short term view. So combining a longer time frame is an easy solution for you to prevent future potential issues.

Let's get back to short checkout experience. Another key metric we care about is latency. So high latency, it can directly impact short satisfaction but also has implication for availability as it can revert in api precast time out to gauge the super checkout experience from their perspective, we split the checkout journey into three individual steps and we measure the time in milliseconds first.

When super clicks the buy prime button, they are prompted to log into their amazon account

We measure the time it takes to bring the super to the login page. In the second step, after super log into their account, we need to have the ship to address to compute delivery promise state and then payment information is linked.

In the second step, we measure the time how long it takes to generate checkout page by obtaining all these pieces of information. Once they complete the review, then they click the place order button. Then by prime start tracking the latency to complete the payment and finally read out the sh to the order confirmation page.

Breaking down the super checkout journey into smaller chunks, we can pinpoint any specific para along the way. We have a dedicated dashboard to monitor the individual steps latency, both in p50, p99. We have a specific target for them.

P50, P99 are common approach to rapidly measure users latency experience. P50 represent the time of the medium response time where 50% of observations fall below. On the other hand, P99 represent the extreme response time where only 1% observation exceed.

So far, we looked at some proposed spirit dashboard within by prime - high level summary, specific technical performance and user experience. The data used there like availability or latency - where the data comes from and they are collected by each microservice team who owns the service where it's natural for them to have the data because they needed to efficiently manage their service, right?

But are we certain that all different teams are following the same principle? So for example, do we know if they are using the same availability calculation or is there any missing metric? Where is exactly where CDK comes into play?

We have defined CDK for bi prime to spin up standard AWS resources that following our opened best practices. Our engineers from all different services collaborate together on a common set of best practices for by prime. With this CDK, we can create a center repository that is now used by all the different teams. With this, we saved 50 engineering years, yes I said that, so 50 engineering years, that's huge amount.

Another session titled "Construct Covers" how by prime leverage the CDK for scalable architecture. And today's session, we are specifically looking into monitoring CDK which is subset of bind prime CDK.

Monitoring CDK consists of four different parts - dashboard, metrics, alarm and logging. From now on, I'm going to demonstrate some CDK code snippets to give you an idea how these components are integrated within the monitoring construct.

By the way, we released a simplified version of monitor CDK as an open source. So I will introduce later with more details.

There are some basic metrics to be monitored. For example, API Gateway and ALB, they are recommended to collect HTTP 4xx, 5xx errors. And in terms of compute - ECS, Fargate, Lambda, they are monitored for CPU and memory utilization.

Lambda requires some additional unique metrics like invocation count, invocation time or max memory used. Now, let's see how these metrics are accommodated into monitoring CDK.

I'm bringing here API Gateway. We measure latency using percentile and they are represented as p50, p99 latency. And integration latency here just to be clear, integration latency is the latency between API Gateway to the upstream server, whereas API latency encompassed this integrated latency along with other API Gateway overhead. So it represents the total response time.

We monitor availability, yes it's not a native metric of API Gateway, it's calculated custom metric using the equation we discussed earlier. To do that, we need number of requests and 4xx, 5xx error count which are collected by CDK.

Together because metrics are predefined in CDK under the hood, we can mandate all necessary data is collected and measured in a consistent way. Here API Gateway metrics are exported as an interface and with that they can be consumed at higher levels like dashboard or alarms.

Let's see how CloudWatch alarms are configured here. Where metric has threshold and alarm is based on that. So from the CDK perspective, it looks pretty much the same. But what do we do with alarms?

Remember alarm fatigue I mentioned as ability challenges. So to mitigate it, we first clearly categorize the alarms based on the SLA. And if it requires immediate remediation from the engineers, if it does, we automate operational practices using CDK as much as we can.

For example, the CDK will automatically create a ticket in our ticketing system which will be assigned to an engineer. And if engineer fails to resolve the issue within the given time window, then the SLA goes higher and needs to be escalated. Major SLA escalation is also defined in CDK.

It means automated routing to the support ticket that CDK created already encompass all details about the alarms. So when I look at the dashboard, I need to look at the operational runbook. The runbook says specific steps I need to take.

And the first thing I need to do is to confirm if the alarm is still going on and if there are real customer impacts. For investigation, I want to introduce two powerful tools - CloudWatch Logs Insights and Contributor Insights. Let's explore how they are incorporated into the CDK.

I'm bringing here a sample Logs Insights query for application log. First, it filters log entries where log in type is labeled as error. The result will be sorted by timestamp in descending order, limited to 100 entries and display log stream and logging methods.

Basically this query is to filter out and predicate error logs with timestamp so we can narrow down the time duration and continue to dig into. It's kind of common approach regardless of the kind of microservices creating Logs Insights query in AWS is not a difficult job, but this kind of query can be reused by any team.

Then we accommodate them into our CDK to eliminate manual effort from the engineers. Again this is one of the cases how we saved 50 engineering years.

Contributor Insight is another tool we often use to analyze high cardinality data. High cardinality, what I mean here is logs contains many distinct values and dimensions. So if you look at our bi prime architecture at high level, then each microservices are isolated at account level.

So account ID can be critical dimension in terms of dependency. Also each microservice has multiple APIs and as we've seen earlier, we need API level visibility. Also we identify tenants using tenant ID, obviously we need tenant level visibility.

So for example, when the incident happens, let's say Auth service was impacted. With Contributor Insights, we can see top contributors with 5xx errors per API by tenant.

If your service is monolithic running in a single server, then you can easily find top contributors using some Linux command, right? But in bi prime's complicated architecture, how quickly can you find who they are with the tenant ID? And what are the error kinds - is it server or client? What API is that? And what about the pattern? Is it confined to specific tenant or is a particular API generating 5xx errors for multiple tenants?

So Contributor Insight significantly simplifies such complex analysis. For example, if you see extra account ID or API name from the graph, then you can immediately distinguish who are most impacted by 5xx errors and in which API.

Contributor Insights queries were also defined in CDK. Let's take a look. So I'm bringing here a rule named "Get top contributors with 5xx error by key" and I want to bring your attention to log group name.

So Firelens is container log router for ECS, EKS and Fargate. It is a mechanism supported by CloudWatch. You can embed your custom metric in the form of logs without having to call PutMetric API or creating metric filter. So /aws/containerinsights/${Application} indicates 5xx errors.

So that is why you have the top contributor with 5xx operator to find for the key is to classify contributors, so who or what is impacting our system performance the most. The key here can be tenant ID for example. And in bi prime tenant can be identified by tenant ID. So if we put tenant ID in the key, then we can see which tenants are impacted by 5xx errors.

If we put AWS account ID in the key, then we can understand which dependency got impacted by 5xx errors. So the specific rule or key filters might differ for you, but whatever they may be, you can simply analyze them using Contributor Insights.

Here's what you can do today. So if you like our observability approach using CDK, you can start from our open source to access the repository, you can scan the QR code. I know it's a little small, I have a bigger picture in the last slide so don't worry about it.

Let me briefly introduce what we have in the open source. The CDK has a library to generate custom dashboard and has some sample CloudWatch Logs Insights queries and Contributor Insights rules. When you deploy CDK, it will spin up sample web application running on ECS Fargate.

This Fargate task has two containers - one is for app, so it has some web app predication and sidecar is Firelens. So Firelens will collect all the container logs and send them to CloudWatch Logs so they can be analyzed using Logs Insights or Contributor Insights.

The web application is fronted by ALB which is configured to collect access logs and store them in S3 bucket. So you can also see how this S3 bucket is securely configured. And the ALB's custom metrics like HTTP row count will be accommodated into CloudWatch custom dashboard.

Let's recap the key takeaways from the content covered today. In the first half, I introduced a test learner which is embedded into our dedicated test stage of deployment pipeline so that we can validate the tenant data is securely isolated and perfectly protected. And so we automate this whenever we release new code change.

In the second half, I introduced our observability approach using CDK and how we enhance observability at scale. If you don't know where to start, then your customers can be a good starting point for you - what metrics you need to collect and what tools you can use.

Once again, you can access our two open sources through the QR code and the link. The first contains a simple test runner to validate permissions using assumed role tech. So you can develop your own test scenario from there without having to start from the scratch.

The second link is for observability CDK - you can find how different components are defined and how they are related together. I hope you like these two open source projects, and if you have any implementation ideas or feedbacks, please open an issue in GitHub. This way you can not only help us but also other developers.

Lastly, if you are interested in bi prime, then you can register yourself in the list. There have been other bi prime sessions that happened in the past, but if you're still interested you can take them out later from YouTube.

So here at Amazon, your data company, I know it's getting late, thank you for coming today but please make sure you fill out the survey for us. Thank you for joining today, enjoy the rest of the event!

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫