Manage resource lifecycle events at scale with AWS Health

最新推荐文章于 2024-06-13 16:27:15 发布

李白的朋友王维

最新推荐文章于 2024-06-13 16:27:15 发布

阅读量56

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134788124

版权

hi, everyone. so, right. as we kick off, can i get a quick show of hands for anybody who's had challenges, managing database diversion upgrades or keeping current on ets versions across large sets of aws resources?

all right. so several, several people, several people raising their hands. so you were in the right place today, we're talking about avos health planned life cycle events and how they simplify the process of managing and tracking these events to completion.

so, my name is andrew riley. i'm a principal tam at aws and enterprise support and i'm joined by um james aitkin and i'm the senior engineering manager for aws health.

hi, i'm patty rafferty from the cloud business office at cox automotive.

all right. so i'm going to introduce you to life cycle events, why they exist, what your expectations should be and how aws health plays a role in keeping you informed of these events and helping you track through them.

i'm also going to cover best practices and key areas that i want you to keep focus on while you're working through these events and ways in which you should be considering integrating ws health and processes.

james is gonna deep dive on new features and playing life cycle events and how they simplify all these processes.

and patty's gonna walk us through a customer journey for cox automotive about managing life cycle events at scale in a large organization.

uh we'll also try and leave a quick time for q and a at the end. but with that, let's get right into it.

so what exactly are life cycle events? the simplest way to think about life cycle events is that these are changes which require action from you to avoid impact to your services.

your applications running on aws, they are different than standard maintenance events like hardware maintenance or reboots that might be occurring for database engine upgrades.

you might need to test that your restored procedures or database plugins work and perform the same way before and after that upgrade.

so the reason in part why these exist is that the open source managed products you use like amazon r ds or like amazon eks follow community support timetables.

so a certain number of months or years of standard support before their life cycles end and before that life cycle ends, you need to have be been aware of the event planned for that upgrade, tested and validated your applications to ensure that your applications keep running smoothly.

and so it's, it's good to think of this in terms of the a w shared responsibility model.

so in the shared responsibility model, aws is responsible for the resilience of the cloud and you are responsible for your resilience in the cloud.

so the areas to highlight here are really as the foundation for your applications on aws. the services you use are responsible for keeping current on the versions you use and provide to you.

and aws health is responsible for keeping you informed of these events and when they occur, what's required of you, what dates are involved.

so on your side, you're responsible for maintaining awareness of these events planning around these life cycles, making room for your teams to get the work done.

you wanna ensure your teams have the tools and the processes in place to work these events with minimal impact your applications and to give you a sense of the upcoming events.

for 2024 these services require routine upgrades as part of the normal life cycles.

so configuration changes like certificate, rotations or version upgrades can fall in this category and as each life cycle event ends, you want to keep in mind when the next one is occurring and be able to plan around that.

so you want a recurring process to manage that. and this is where ribas health forms an integral part of this process.

so the primary purpose of aws health is to keep you informed when an event at abs might impact your applications, your resources in these events will take different forms.

so for instance, during a service event, the purpose of health is to keep you informed about the ongoing event. what it affects what availability zones or regions a service might be impaired in and what actions you might take to mitigate that event.

events can also be forward looking like playing life cycle events where database health's role is keeping you informed of the event, give guidance around working through it and allowing you to plan around these events within your normal planning life cycles.

you're probably familiar with the email notifications that come from a w os health. that's normally where customers get started with this. but we want you to be aware that the same data is, is available through multiple endpoints.

so database health dashboard as, as eventbridge, uh sorry as events delivered to your default amazon eventbridge plus and also for customers using business or enterprise support through the health api so this is to enable you to use these events in the way that works best for you and your teams.

and here the goal is that aws health wants to give you the ability to quickly determine what's important to you, what services are affected, which aversions are coming, what actions you're responsible for and through health, you should be enabled to handle these events across your organization with minimal manual effort, managing and reporting on the event itself, allowing you to focus on doing the work involved and doing the upgrades or other actions required for you.

so in part this comes from providing data, programmatically reducing your need to query resources through the console or api s or also escalating to your account teams or the service teams for further clarification on these events.

and last to allow you confirm that you're done that you've resolved any risks that you're ready to move on?

all right. so how big a started managing life cycle events? there's three areas i want to keep uh i want you to keep in mind here, identification, tracking and access.

so identification, you need awareness of these events. how are you doing that? what's your, what's your procedure to keep in? keep touch here?

so which versions, which dates are you responsible for and by delivering health events? aws health. this is a key function of aws health, keeping you informed of the event.

first, you need to know what you need to test and execute on. so are these database engine version upgrades? are they certificate rotations? do you have the right process in place to manage these?

so again, arias health provides you descriptive information about the event. what um what documentation is available or best practices can link out the blogs that will give you guidance on ways to work through this event with less effort on your part.

as an example. one of the ones for amazon r ds pointed to one of their new features, blue green deployments. so you'd be able to test against your database and production without causing any impact.

do you have a complete picture of the affected resources for one of these events? and what's the scope of that across your organization?

so here by providing resources with the event, aws health is reducing that secondary reporting i talked about and allowing you to get started.

and if you're in a large organization or distributed organization, how are you identifying who the owners are of these resources and getting them to take action on them?

so here having resource i ds or a rns in the event allows you to enrich this with further data from your change of management databases or itsm systems or even the tags on these resources help you identify owners stages of the environment. what's the priorities on on these, on these resources?

so planning, what does your process for work or sprint planning look like?

so what cycle are you on? how much lead time do you need to take on one of these events? and can you make space for your team to get the work done?

eight of health's goal here is to give you the most timely notices of these events. so they aren't surprises to you and that you can incorporate these into normal planning cycles.

so you can avoid interrupts and urgent actions on your part. you want to determine how far you progressed through one of these events and more importantly, use data to do it.

so do you automate this today? i know many people use spreadsheets to track a list of resources and check them off as they go. but by programmatically integrating with aws health events, you can take this information into other tools like itsm or ticketing systems, issue tracking and allow you to get a picture of your status through these events, who the responsible parties are where they are in the process.

and, and now you see that across the organization potentially, how do you know you're done?

so and, and no, and have confidence that you're complete in this event. so what's your source of truth? is it that spreadsheet i talked about or can you lean on an itsm tool here or do you need something that's external and authoritative to let you really know there's no more resources out there that i'm not aware of.

so for those end points that i mentioned, what's the best way for you to consume these events, right? you're all receiving these emails. but what about the dashboard? do you work in the gooey? is that the best way for you to stay informed of these events? see the upcoming ones if you work programmatically to, to the to sorry mate, make it out.

does it does delivery through eventbridge work for you? can you create push notifications or project these into other systems that you use through there? and again, the health api is available if you'd like to integrate into your own custom tooling, who needs the ability to see these events.

so help delivers these events to every individual, every individual account in your organization. but what's your organizational structure look like?

do you have a classroom of excellence or central platform teams that are responsible for each of these services and shepherding their individual owners through them?

so by enabling organizational view for aws health, you can get a consolidated view of all the events and all the resources in your organization in one place, get a better handle on that.

ok. so the last thing here is what tools are the best things for your teams to use, right? there's, there's a lot of options out there and a lot of different ways, teams work, some teams work in chatops, some do work assignments through ticketing systems.

so i want you to think about if you're an organization that has operational teams that don't use the aws console and maybe they aren't even receiving those emails if they're not on the right lists.

so how are you making this data available for those teams? and here i want to stress the importance of leaning on tools, ticketing to make this work because they are much more accountable than did you see that email? right.

so in working with you on integrations like these for ravi health, we've heard certain feedback.

so first up is you really want to enable straightforward integrations with these other tools and you want descriptive information that's programmatically accessible.

so metadata properties of these events. if you think of those emails, there's a lot of information in them, the services, the versions, the dates and things are occurring. but this is something humans parse text really easily. and it's difficult to create rules around that, especially with multiple dates in the emails o other things that you might want to factor in.

so you want a better way to do this, you want the resources consistently available in a standard format.

so this is to allow easier enrichment activities and programmatic integrations, looking them up in other systems, looking to stay the same i ds, those change management data, sorry, change management databases or itsm systems. you're using potentially what stage of the environment or what owners are involved and you want organizational visibility.

so many teams, many organizations have very strict rules around how the organization management account is used in the ras organization.

so you want to avoid workarounds, like having to get approval for a very least privileged role to be able to, to assume in that account, to be able to get this information.

and so you've asked for something better and here's james to talk about how a aw os health has addressed this and more.

thanks andrew. i'm now gonna talk to you about, talk to you about the improvements that we've made to aws health to help you better manage planned life cycle events.

we've delivered these improvements across the three key areas that andrew was just talking to identification, tracking and accessing these events across your organization.

first, i'm gonna show you how we're using standardized data and processes to give you, uh help you identify these events early and give you more time to assess the impact and plan for the mitigation across your organization.

secondly, i'll show you how we're using um a new feature that we call resource burn down to allow you to natively track progress against these events in native us health.

finally, i'm going to talk to you about the improvements that we've made in our support for aws organizations to help you manage these events at scale.

so first up, let's talk about how we're making it easier for you to identify these important events.

so previously, when we wanted to communicate about planned life cycle events, we used the account notifications category on aws health and there were a few problems with this approach.

firstly, we used a generic operational notification event type, making it difficult to distinguish these events from a myriad of others that you would have in this category.

secondly, you would have to read through lengthy text descriptions in the event description to figure out what change was happening when it was happening, et cetera. this is very time consuming and also doesn't lend itself to automation.

thirdly for these types of events on the aws l dashboard, they are visible for a period of seven days before they move into the event log, making it difficult to find those events again.

so with this change, we are moving planned life cycle events into the schedule changes category on aws health and they will remain visible there for the entire duration of the change, meaning you will not lose, lose visibility of these important events. before you've had a chance to take action, we've also standardized on the naming of these events.

they will contain the phrase planned life cycle event. and this change is also reflected in the event, um the event type code that's used um which i'll talk more about when we come to the api and eventbridge schema changes.

so also for account notifications, another issue that we had is that in order to send you reminder notifications about these events, we would have to create new events with new event ons.

and once again, this created work for you to read through these event descriptions to figure out is this an event i've seen before or is it a new event?

so for planned life cycle events, the event on is going to remain fixed uh throughout the whole course of the event and you will receive a single event on per region for all the resources in your organization.

so now when you share links about this event on, you know that it won't change throughout the course of the event.

an exception here will be for those of you that have very large organizations. and perhaps one day you have hundreds of thousands of resources that are impacted by a planned change. in which case, we would have to send you multiple event arms

But once again, those event arms would remain fixed throughout the whole course of the event. Another thing that um was difficult with account notifications is that you would have to read through those event descriptions to figure out when the change would actually happen.

Generally for account notification, event types, we the start date of the event represented the date and time on at which we sent the notification to you. So for planned life cycle events, the event start time will reflect the time that the change is actually gonna take place. For example, the software end of support or when we start doing order upgrades or a certificate expires. So this makes it really easy for you to see at a glance when the change is gonna happen. And for machines to reason about it, as i mentioned previously, these events will remain visible in the scheduled changes um tab on the um aws l dashboard, sorry until the uh change date passes and the event status code will reflect whether the event is still upcoming, ongoing or completed and these events will be archived uh 90 days after the the the change date passes, right, to help you assess the impact um of the planned life cycle event uh on your organization, you can now see at a glance how many resources um are affected by the event on aws health dashboard in the affected resources column, the total reflected there will be for your account if you're using the account for or for your whole organization, if you're using the organization view on health.

This functionality is also available via API s using describe entity aggregates or describe entity aggregates for organizations. these api s are available for those of you with enterprise or business support plans. we're also making planned life cycle events, resource specific wherever possible. so you know exactly which resources are and the scope for the change.

Previously, there was a lot of variability in the way resources were specified on plan changes. if at all some events, we would just specify which account contained affected resources. and this would once again leave you with work to do to figure out which resources you needed to take action on. For example, which landers do i need to upgrade the node js version on other events would specify resource i ds only making it difficult for you to carry out automated actions. like for example, looking at who the resource owners are along with this, we'll also be making sure that um you receive full resource on on planned life cycle events allowing you to carry out automations like enriching the data.

We understand that you like us operate on long planning cycles. so we wanna make sure that you have enough time to plan for uh these planned life cycle events. so whenever possible, we are making sure that you get at least six months of notice for major major changes and at least three months on minor changes, there will obviously be some exception to this. For example, when there are maybe urgent security upgrades that need to go on on the aws l dashboard, we've introduced a new calendar view um which provides um a way to visualize these changes on a month by month view. you're able to go back three months to look at past events as well as up to a year in advance to see all the planned life cycle events that are coming your way, either on your account or your organization. And in this calendar view, you're able to then drill into the details on those events as you would from the the table view.

I don't know, i'm now gonna take you through some of the scheme and changes that you'll find on on the api s that support um these changes that we made to plan life cycle events. And just to note that these scheme of changes are backward compatible. so if you do have any existing integrations, they will not be broken by these changes.

So as i mentioned before, the event type codes, um you'll be able to recognize these events because they will have a planned life cycle event suffix. and as i mentioned the event type category, we schedule change for events that are still in the future, you'll see the start. um this event status code will reflect this up upcoming. we've also included um sorry when the event date um has passed or you've carried out all actions on the play life cycle events, the status code will show is closed and then we've added um additional event meta data into our api s um to help you determine the exact change that's taking place.

So for example, for software version deprecation, you'll find a deprecated version field which gives you the exact version that's been deprecated. you'll have noticed earlier that there, for example, are multiple kubernetes um version deprecation happening in 2024.

So to recap this section, planned life cycle events, we found in the scheduled changes category with standardized naming and we will use a single event on across your entire organization to communicate about all updates on their planned life cycle event. The events start time will reflect the time at which the change actually comes into effect. And planned life cycle events will be resource specific and the full resource arms will be supplied to you and we'll be giving you enough time to plan for these events with at least six months notice of major changes.

So the next section is talking about how are we helping you to track your progress against these events using our dynamic resource burn down feature? So previously, um when using the account notification, once again, you would get a static view of all the resources at event creation time. And this would leave you with work to do to uh create in-house solutions to manually track those changes like andrew was referring to earlier, maybe you use spreadsheets to track your action against these these resources.

So now the planned life cycle events, we make use of the resource status code to indicate whether or not action still needs to be taken on a resource. So in the case where action is still required from your part, the resource status code will be pending. And once you've taken action, for example, performing an upgrade, the resource status code will change to resolved on the aws health dashboard, you'll be able to see a convenient breakdown of the pending versus resolved resources against your account organization depending on your view that you're using. And this information is also available on the API s through our entity aggregate API s for those of you with enterprise or business support plans.

So to see how this works, i've got a simple example, i'm going to walk you through. um you have three running r ds my sql instances supporting your business applications. two of these instances you created a long time ago. so they're running on my scale 5.7. so when the planned life cycle event for my scale of 5.7 kicks off these two instances will be picked up by a scan by the r ds service team. an event will be created for you listing these two resources as requiring action with their status codes set to pending. you'll receive a notification of that, that event on the preferred uh aws health end point of your choice.

So for example, on the aws health dashboard or through the API s, once again, if you have an enterprise or business support plan or on amazon eventbridge, if you've configured a rule for health, so once you receive that notification, you decide to take action and you do an upgrade on one of those uh database instances. So after some period of time, the rds team do another scan of all your resources and they detect that you perform this upgrade.

So the uh so that event will be updated and the resource status code set to resolved and you'll receive a notification once again on the preferred endpoint of your choice about that change. So when you've carried out all the actions across your organization, the event status code will change to complete it on the dashboard or it'll show as closed if you're using the API s, should any new resources be detected using old versions? If someone spins up an old version of my skill in your org that will be detected and added back to your event, the same event and that event status code will reopen.

So you will always have an accurate view of all the required actions across your organization for this plan plan change. So let's look at the scheme of changes for resource burn down which affect the affected entity object. um just as mentioned before, full resource arms, these will be found in the entity value field on the affected entity. And the last updated time will be the time of the last scan of the resource that we performed where change was detected.

And as i mentioned, any resources that still require action will have a pending status quo. And once you take action, that status code will change to resolved and the last updated time would be updated accordingly.

So to recap on this section about tracking planned life cycle events, we use the resource status code um of pending or resolve to indicate whether or not you still require action on that resource. And we regularly update those resource status codes to make sure that you have a dynamic view of resource burn down across your organization using aws health, the event status code were changed to completed on the aws l dashboard or closed if using the API s, once you've taken action for all resources for that event.

So the last section i'll be talking to is accessing these planned life cycle events at scale. So we've heard from a lot of you that you've not been able to make use of aws organization view either on the the dashboards or the API s because you don't have access to the management account for aws organizations. Many companies do have strict policies around accessing that account.

So we're happy to announce that you now have the ability to use a delegated administration account to access all of the health events across your entire organization. So for example, you can log now, you can now log into the aws health dashboard with a delegated administration account to see all of the events across your entire organization or use that delegated administration account to access our organization view API s if you have an enterprise or business support plan.

Another recent development is we've introduced health organizations view on our eventbridge endpoint. So before now you would have to go and create set up rules for the aws dot health event source in every account across your entire organization to receive health events for all of those accounts. Now, when you enable aws organizations for health, in addition to us sending the events to the member account event bus, we send copies of that event to the event buses of the management account and any or all of your delegated administration accounts that you've set up for your organization.

This means that you're now able to process a single stream of events on a single event bus to capture all of the health events across your entire organization. This greatly sets up, greatly simplifies the set up of uh aws health on eventbridge and it also helps you uh potentially avoid missing any new accounts as new accounts are created in your organization. And perhaps you forget to set up an eventbridge rule for that account.

So looking at the schema changes now for eventbridge for the health detail type, just before we get into that, just a quick note, the account field that is used on the base of an bridge scheme. If you're using the um organizations view for health, that that account will be either the member account, the management account or delegated administration account.

Um so i'll talk in a later slide about how we're helping you to distinguish which is the actual account where the change is happening. So as before you'll find the event type codes and these will have the planned life cycle events suffix. But a really neat thing that you can do on eventbridge. Now using the suffix matching pattern is create a rule to capture all planned life cycle events. And then you can use that rule to send those events to a different target for specialized processing.

As i mentioned before, we're using a single event on for all communication about that kind of life cycle event. But we will be sending you multiple updates, for example, as you carry out actions, um we will send you updates on eventbridge for resources that get resolved. So we've introduced a communication id field to help you distinguish those multiple communications on the same event as for the apr s, we include the event metadata which will include things like specific versions that have been deprecated.

Full resource arms are provided as before. And this is helpful on eventbridge because this helps you really create uh automations and um perform enrichment of data easily. Like for example, looking at resource owners or even carrying out an automated action on a resource, the last updated time once again reflects the time at which the the change was detected.

And as i mentioned, you'll be receiving um messages on a eventbridge indicating which resources you resolved. Once you've taken action, if there were any new resources that were detected in the interim, these will also be included in the update on eventbridge and you will find the status code of pending, which indicates those as i alluded to earlier, we've included an initial field in the health detail type to help you determine which is the affected account. And unsurprisingly that's called the affected account field.

And then another feature that we've introduced on eventbridge is pagination of health events over multiple eventbridge messages. Now, you might be aware that there's a maximum size limit of an advance message and that size limit can be easily exceeded when there are many resources that are associated with a planned life cycle event. So what we've done is develop this ability to paginate, as i said over multiple messages and to support that we've introduced two new fields which help you track um the total number of pages to expect as well as the page number that you're receiving. And the communication id field um would also contain a suffix including the page number to help you distinguish the paginated results for the same communication.

So finally, we've added some new out of the box integrations for service now and jira cloud using the aws service management connectors and using these connectors, you're able to create issues or tickets in these systems in response to health events without the need to do any coding. I'd like to invite you to join andrew later today for a demo session to learn more about how to programmatically integrate with health um to get these into operational tooling of your choice.

Ok. So to recap this section, we've introduced delegated administration support for aws organizations across all of the health end points and you can now receive health events to this delegate delegated administration account event event bus. So you have a single stream of all health events across your whole organization in that event bus.

So with that, i'm gonna hand over to paddy, who's going to, who's gonna talk you through how cox automotive are using aws health for planned life cycle events, right?

"You can see our brands listed across the bottom here. These are all well known in the automotive industry and a couple of them, particularly AutoTrader and Kelly Blue Book, are well known to the public as well.

And I'll just say before I move on from this slide, as we've brought all these brands together over the years, the cloud has been a big power up for us in terms of streamlining development platforms and workflows.

Cox Automotive has more than 40,000 dealer clients across five continents. We partner with more than 200 OEMs, that's vehicle manufacturers, and we've got more than 1,600 lender clients in our network. Take all that together and we touch three out of four vehicle transactions in North America in at least one way.

We've got some of our sizzle stats here. I'm just gonna pick a few of them. So we list almost 5 million vehicles daily and that generates 80 million leads, 48 million credit applications and 18 billion vehicle valuations annually.

We're a software company. Our products are software products. And the name of our agile development organization is the Cox Automotive Product Group. We've got over 500 scrum teams organized into 100 plus release trains that roll up into 30 plus delivery streams. And at the top we've got 10 portfolios and I share all of this just to give a bit of a sense of scale, the breadth and depth of our organization as we start to think about today's topic.

A quick look at our AWS estate - we've always had a multi-account strategy and today we're running 500 plus workloads across 1,400 plus accounts again just for some scale.

Okay, that brings me back to the Cloud Business Office, which is our Center of Excellence for cloud optimization. We have other Centers of Excellence as well for things like product management and agile solution delivery. And we think of these as force multipliers across our company.

For the CBO specifically, some of our main responsibilities include cloud partner management, cloud architecture guidance, cloud cost optimization, and cloud native upskilling.

Here's a quote from Chris Dillon, our VP of Architecture: "All of our architects have a goal to modernize our ecosystem, driving toward a healthy modern manageable technology environment. The new and improved AWS Health will help us measure our success against that goal and deliver actionable information to the teams that need it."

So when we think of all this AWS Health and planned life cycle events, we think of it two ways. The first almost goes without saying - we need to avoid production impact. And the second is what Chris is talking about here, this ongoing initiative to modernize our ecosystem. And in trying to achieve that historically, we have definitely run into some of the challenges that Andrew talked about earlier.

We need a consistent lead time - so we had dev teams that just couldn't plan and execute this work in a consistent, predictable way. And as a result, we've had some last minute scrambles to get some of this work done. At the 11th hour, we needed to identify impacted resources. This is a big one. Sometimes we can build dashboards from our own systems, that's kind of been our best case scenario. Sometimes we can't, we have to lean on our account team to feed us static lists of impacted resources and sometimes they can't do it themselves either. And they have to go all the way back to the service team and that just gets too slow and cumbersome.

We needed better visibility - our dev teams have got to be able to see their local impact quickly and easily. And we have relied on sort of a patchwork of dashboards, spreadsheets and Slack channels.

We needed structured data to link workloads and teams - the email notifications are informative but they're free text and without being able to do much programmatically, this has been a manual effort historically.

And we needed enterprise oversight - a centralized team like mine needs a top level view to drive this work to completion across the organization.

So we're excited for the powerful new features coming to address these issues:

Standardized lead time for predictability and planning - James mentioned that we'll get 6 months lead time most of the time whenever possible for major changes. Now we operate on a quarterly planning cadence. So if you think about how that plays out, it means that no matter when during a quarter an event is announced, we still have a whole other quarter ahead of us before we're gonna get to that final date. And that means we can just work it in to our regular planning cycles, which is great.
Standardized ARM format is big. This eliminates all the manual, manual reporting that I talked about a minute ago altogether. And so instead of having to do that work for each event, we'll get a definitive listing of impacted resources straight from the source.
Burn down tracking for distributed visibility - this will give our dev teams quick and easy access to see their local impact, to understand which resources need to be addressed, and track their progress against that.
The EventBridge integration gives us the structured data we need to link in workloads and teams - more on that in a minute.
And the last one here, the delegated administrator feature for enterprise oversight - this means that someone like me doesn't have to mess around in the management account in order to oversee all of this and I definitely prefer not to do that.

Let's take a look at the integration that we built. So starting on the left here, these are our member accounts and we've got Health events being consolidated into the delegated admin account with EventBridge. Then we've got an EventBridge rule just like James mentioned to filter for planned life cycle events specifically. And we invoke Lambda to write those to S3 and then S3 bucket notifications feed those through SQS to trigger the ingestion into our Snowflake data warehouse.

You also see in the top right ServiceNow, which is where we keep our catalogs of team data and workload data. And we have a similar ingestion pattern to bring that information into our Snowflake data warehouse as well.

And so at the end of all that, what we have is all the information we need to operationalize this work brought together in one place.

Here's a quick screenshot of a dashboard we built on top of that. This is just the header and a nice quick visual of impacted workloads, but there are a couple of elements I want to highlight here:

The first one might not look like much - it's just a humble service filter. But previously, we've had to build a new dashboard for each event. And what we have now is an automated mechanism where these events are coming all into one system programmatically without having to do that manual effort. And then we can come along and filter for the one we're interested in. So it's actually kind of a big deal for us.
And the other one is the enriched detail view, apologies if this one's a little bit of an eye chart and we've got the account numbers and the orange blurred out here. But this is a listing of some resources impacted by, we used the EKS 1.24 deprecation as our example here. And of course, we get the resources from AWS Health through EventBridge. But what we've got here, if you look at the next columns, portfolio delivery stream release train - I talked earlier about how those are the levels in our product group organization. So these are the teams that are responsible for these resources and then all the way to the right we've got workload identifiers. So now we understand the applications and systems that are impacted here as well.

And so this is merging all the data to make this operational for us. We can now make sure that the appropriate leaders and teams are aware and have this work planned and scoped and we can track it through to completion.

A quick look at our roadmap because there's more we'd like to do here as well:

The first thing is probably integration with our agile management system. So now that we know the workloads and teams associated with the events and resources, we can drop agile artifacts, right, in the backlogs of the dev teams that need to take action, which means that they can plan and execute that work in the system that they use every day.
And another potential road map item on the horizon here is to leverage our tagging strategy so we can potentially essentially attach the events themselves to the impacted resources using tags. And that would give us sort of a whole new or additional perspective on impacted resources across our state, across services, across events, across teams, across workloads - however we want to look at that by leveraging our tagging strategy.

That's the Cox Automotive story on AWS Health and planned life cycle events. Thank you very much and I will hand back to Andrew."

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Manage resource lifecycle events at scale with AWS Health

right.
复制链接

扫一扫

Manage resource lifecycle events at scale with AWS Health

“相关推荐”对你有帮助么？