Sustainable architecture: Past, present, and future

Good afternoon, everyone. I'm Stefan Kwal. I'm a Solutions Architect at AWS, and my job is to help customers build sustainable architectures on AWS. This is SU S3 02, Sustainable Architecture, Past Present and Future.

Originally we planned this title to be "A Sustainability Carol." But we've been asked to stop this. So instead, we are having this title.

Good news for you - nobody in this session will sing. What I will do instead is:

  1. Talk about the motivation of application teams in the past to drive resource efficiency in architectures.

  2. Talk about what we observe today at customers who run sustainable architectures, sustainable workloads, and sustainable IT departments at a broader scope.

  3. Share what we expect in the future to change on the customer side and AWS side to be more successful in achieving our goals.

The goal for this session is that you will leave with an idea of how you can find things to optimize and improve.

Optimization for resource efficiency is nothing new. Architects, developers, and you are used to optimizing response time - shaving off seconds in response time, build processes, deployment processes. So optimization is nothing new.

I looked at my bookshelf in preparation and found a Java optimization book - over 400 pages of general, applicable rules for performance management. It was about creating a plan - measure, test, optimize, learn, repeat. It was very specific to the application implementation - arrays vs vectors, string handling, etc. - rather than architectural changes across the full stack.

The focus was because the audience asked for it. They had an initial budget and infrastructure boundary. Everything needed to fit within those boxes. When facing performance issues, it was difficult to increase the boundary. Instead it was easier to tweak the application to fit.

The application often needed fewer resources than the boundary, but could never use more. This was changing around 2006 when AWS announced EC2. Customers could scale out as needed for peaks, then scale back down. Using economies of scale, EC2 was always highly efficient, multi-tenant infrastructure compared to on-prem, single tenant data centers.

Because EC2 was so efficient to scale out, a rebound effect kicked in - it became popular to use more of that efficient infrastructure. As a developer tasked with speeding up an application, you now had two choices:

  1. Tweak the application code

  2. Give the application more EC2 boxes horizontally to make it faster

It's okay to use more resources, as long as business demand grows. Even a decade ago, Werner Vogels advocated for CWA architecture - track and model cost efficiency, use cloud elasticity, and leverage economies of scale. The Well Architected pillars launched with cost and performance best practices too.

I can't put an exact date, but raising awareness of climate change effects also raised stakeholder awareness to become more sustainable. Customers made sustainability goals part of their decision criteria. Other stakeholders made it key in deciding who to work for and invest in.

Companies communicated goals in public pledges like the Climate Pledge. Amazon was a founding member in 2019, but it now has over 400 signatories. The pledge is to be net zero carbon by 2040, 10 years before the Paris Agreement. When companies set net zero goals, it includes all departments, including IT.

Awareness of sustainability and decarbonization also picked up the Greenhouse Gas Protocol - a way to quantify carbon emissions in three scopes:

  • Scope 1: Direct emissions like burning fuel, wood, diesel generators

  • Scope 2: Emissions from purchased energy like electricity

  • Scope 3: Everything else - indirect emissions in the supply chain like equipment, data center materials, buildings

Carbon accounting creates a complete picture and allocates emissions to teams and products. Each team can influence the emissions related to their value - IT applications, shop floors, etc.

With raised sustainability attention, IT faces new pressure to optimize too. Sustainability is a non-functional requirement now.

That covers the history and motivation. Next I'll discuss the processes, tools, and culture we see at customers successful in optimization for sustainability.

A one stop shop for evaluating architectures is the Well Architected Framework. Launched 10 years ago, it provides a consistent approach via pillars for security, reliability, performance, cost, and since 2 years ago, sustainability.

The framework has best practices, but also important concepts like the Shared Responsibility Model, design principles, improvement processes, KPIs and metrics.

The Shared Responsibility Model means AWS and customers each do their part. AWS builds efficient infrastructure; customers make architectural decisions and write efficient code. Ultimately both AWS and customer choices lead to impact.

Customers use services efficiently. AWS architects services for efficiency.

Less abstract - let's look at service usage as an input and cost as an output. Usage has pricing terms that generate cost.

Emissions are more complex. The AWS Customer Carbon Footprint tool provides scope 1 and 2 emissions for total AWS usage. At the payer account level you see the full AWS organization. At the individual account level you see that account's emissions.

The data is available to download and analyze. It covers the past 3 years at a monthly granularity. You can also forecast changes as AWS uses more renewables.

To calculate this, AWS processes all data on actual energy used per service. The tool reports with a 3 month lag though. For continuous optimization, more timely metrics are needed as a proxy for emissions.

The AWS Cost and Usage Report provides service usage. Go further upstream to understand operational processes and business needs fulfilled. Divide business metrics by resource metrics to calculate sustainability KPIs.

That covers the tools and metrics. Next I'll discuss the best practices.

The idea of that it is decoupled from the growth of your business. So when your business grows and your resource usage grows, the ratio will still be the same. But then again, if the business demand is shrinking and you still use the same resources that could be cause for alarm.

Dividing that ratio here is nothing new. You have seen this even in this talk before. It's what Werner called cost aware architecture. And the difference is that it's not looking only at cost but looking at resource usage independent of any commercial instruments that are just discounts and have no direct impact on sustainability.

To get started with this approach, I want to demo quickly the Cloud Intelligence dashboards. These dashboards are an open source framework which provide dashboards for cost, for understanding Compute Optimizer, for understanding your Marketplace performance and use. And we also have a dashboard here for sustainability metrics.

To get started, let's switch to it in the browser. Alright. So this is a public demo page which you can also open from your mobile or from your computer. It's having synthetic data about AWS organizations and accounts. So you can see the features of the dashboard, each dashboard has multiple tabs. I switched here to the sustainability dashboard.

Each of these dashboards has multiple tabs and here in sustainability in this proxy metrics dashboard, each tab has these two controls. First is the date range, so what range of the usage reports do you want to visualize, here the last six months. And what organization do you want to show, so which payer account is what you want to present and see.

For each of these tabs, you see the relevance to sustainability here in the text that is coming from the Well Architected pillar or from our web pages. So you can understand what is this visual displaying, here you see all regions on a map and you see the distribution of your usage of AWS services distributed by the cost. So the different colors, what do they mean?

If you paid close attention to the Amazon Sustainability Report for 2022, you've seen that 19 regions of AWS use electricity that is 100% attributable to renewable energy. And these are here presented in a light yellowish color. And then we have also the other regions in orange. So when you look at this, you should think, where do you expect your users to be? So this should be in some proximity. You don't serve applications to your users around the world by sending packages around the world. But also think about using, deploying new applications in regions where there are renewable energy projects by Amazon.

Also on this page is a development of this over time. And renewable energy is good, but the greenest energy is the one we don't use. So that's why we're also looking at usage here with the compute proxies.

This page here shows compute resources, the distribution of compute in the services of Amazon - EC2, Fargate on ECS, Fargate on EKS and Lambda. And what you can directly spot on this graph here is that the top account is using obviously the most resources and all accounts here use the orange color, which means EC2. So I have to scroll a bit or enlarge that scope a bit to see the first account here, which is also using Lambda and ECS.

So think about you, the team here at the top using exclusively EC2. That means for the shared responsibility model, you are having the utilization of these instances and the patching with the latest and greatest patches for performance in your responsibility. And that is different for the team here below. So if I scroll down, I find the same bars but colored differently by the instance families that are used in EC2.

So I see for example, at the top, they are using predominantly one instance family, that is the C5a. And this often indicates that the team probably fell in love with a certain instance family and just use it every time. But it's also an indicator that maybe they should use a different instance type to better fit, to better suit the needs of the application - like picking instead of a compute optimized instance, an R instance, a smaller R instance or an M instance to right size the instance.

Second, these are mostly older generation types with the R3 and R4 - you know that we already have an H generation. And here we have predominantly C5 and just a single R6 instance. And if the team moves to later instance types, to more modern hardware, that also brings more resource efficiency.

And here at the top, at the bottom, you see in the dashboard also a timely development, a development over time, how the resources were used so that you can understand how, how this went better or how there was an improvement or when did we start using so many instances? For example.

So the controls here are the same as in the previous page. But we also see these two controls here, grouping by linked account that is here a control and I can define also an environment tag. So I could also say if I have a consistent tagging strategy, I could say I want to have my department code in the tag which I want to use to group those bars instead of accounts instead of accounts.

Also you see this one here which says "Include Business KPI" - and that is the concept which I shown before with the ratio of resource use divided by the business need. This one here is backed by a table that just has random data for business metrics. I show you how this works.

So everything is in S3 and Athena in this code and I can look at the business metric table and you'll see for each hour you see the business metric and these are accumulated, aggregated and then joined to the usage data to create those ratios. So that you understand how the business KPI has developed in this hour on hourly granularity, you could fill that with your own data. Like if you have, if you have an application, a certain workload, you could fill that in S3 and make this available to this dashboard.

And then we have other storage proxies, storage proxies, not only compute proxies that show the same usage for EBS, for S3 and also data transfer which is visible than in this tool.

Good. Going back to the presentation. Yeah, check out this tool. Let us know feedback that you have in the GitHub project to get started with proxy metrics.

And I want to quickly show examples by customers applying this concept of sustainability metrics. TUI is one of the world's largest leading global tourism platform companies. And in order to reach their sustainability goals, that's what I like about this quote here, every team must contribute. So also IT of course. And therefore they have been implementing this sustainability proxy metrics and KPI approach to create visibility into the resources which they use for all application teams which they have.

So the insights are then used in regular reviews and in optimization cycles which then lead to an action. So an example for that is Flight Margin Brain. This tool calculates 70 million fares daily on a daily basis on AWS. And with the sustainability KPI concept, the application teams gained visibility, identified some improvements and for example, reduce the caching layer compute resources by 40% with no impact on the customer experience.

For TUI to achieve that they have built dashboards like this here on the right. So the proxy metric score is calculated and displayed on this dashboard. And the proxy metric score is a sum of weighted resource usage combined into a single figure that could be for example, be the number of nodes in a cluster, the vCPU hours, the utilization etc. or the concurrent execution inside a Lambda function.

And this approach of a weighted resource use also allows you to factor in multiple resources into a single figure and also to incentivize certain resource usage over other like using EC2 Spot, using managed services instead of self managed services, etc.

Here on the right side, the top graph, you see the nodes which this team here was using. And in the middle, you see the calls that run within the SLA that is defined and on the bottom, you see how this influences each other. So the first you see that at night when the demand is lower by customers and the cache is not scaling in you see that the proxy metric score gets worse. So the lower the better.

But the overall reduction of nodes in that caching cluster reduces that beeline into the green area which they defined as a goal.

Second, customer example is VMware. It's a provider of multi-cloud and virtualization services and they are making use of sustainability KPIs to optimize for resource efficiency as well. So resource efficiency is also nothing new for them as they already provide virtualization solutions for more than two decades.

And in 2022 last year, they started different initiatives with AWS to drive sustainable software development to understand how can our builders, how can our engineering teams build more sustainable software and how can we support them here?

So they started with incentivization, like gamification, like recognition of the teams who built the most sustainable designs and they established also dedicated roles and committees to make sure that sustainability concerns are heard in the decision making process.

Through this process of establishing a sustainable culture, VMware has made a lot of learning and they are integrating this learning also into their tools to help the application teams which they have fast track improvements and they aligned their tooling with a Well Architected pillar for sustainability and integrated it also into their carbon measurement tooling.

So they started with VMware Tanzu Cloud Health and this was originally a FinOps platform. And now it gets the additional perspective of a GreenOps overview. Not only showing application teams cost but also indicators for resource efficiency through proxy metrics and KPIs.

The tool is currently in private beta but once generally available, it can also be used by customers to identify opportunities to right size, to identify opportunities to reduce the carbon emissions and receive recommendations.

As you can see from the examples here with the dashboard from VMware and TUI, visibility is key and if you have this visibility, you should rethink some common principles.

For example, what is important to have all the data for traceability and for researching why something was slow and how it can be improved? Should you log everything forever?

Second, should you back up the backup of the backup?

And third are faster responses all night and day for your users really addressing the main pain points if it's done at the expense of more resources that are needed for that?

The pillar's best practices provide a compilation of these questions and also others. It's like a checklist to go through your workload and all aspects of your application to improve. And each of these best practices provide a short, it provides the anti-patterns that you probably tackle that you have. So to identify these best practices and it also links to implementation guides - how can you get started? Blogs, documents, etc.

It is grouped in the topics of region selection, the alignment to demand, software and architecture, data, hardware and services and process and culture.

Optimizing always needs change. And sometimes you can do this change without affecting the customer experience and without breaking SLAs as you have seen with TUI. If you optimize, you must know what this boundary is of your SLAs and if you can maybe also move them to make trade offs, that means you must know the SLA.

So ask yourself, do you know the availability SLA of the workload which you built? It's not often the case. But if you know that, if you know the SLAs you can identify and see if you overperform them. And if you overperform that is also an indication for resource savings potential.

Like you could reduce the I/O and processing resources by approximating just the results, you could reduce the compute resources if you allowed higher response times for your users and you could reduce idle online capacity if your availability requirements can be relaxed.

And while we don't know what the future holds for sustainability, we definitely know that to achieve these goals which we have set in years and decades, we need to build on the past and present and also need to accelerate.

I want to focus on the role of innovation in this future and what you can do as customers and also where we as AWS are continuing to innovate. I said both AWS and customers do their share and they contribute significantly and independently for more sustainable architectures each in their silo.

AWS is continuing to innovate for resource efficiency in the data centers, for example, the hardware and the servers and for which I want to highlight two topics here in the future.

The first one, there is the future pretty close. Amazon is on a path to power all operations with 100% renewable energy by 2025. We achieved here as you see also in the Amazon Sustainability Report, 90% across our business in 2022 and are continuing of course the investments.

And in the demo, what does it mean for AWS users? In the demo, I called out already 19 regions which use energy attributable 100% to renewable energies. And it's especially important that we invest, that AWS and Amazon invest into renewable energy projects in the electricity grids where it is heavily dependent on fossil fuels.

So this year we added 13 new renewable energy projects across the Asia Pacific region, for example, in South Korea, India and China. And we also introduced the first solar farm built on brownfield, that is land that has been abandoned due to industrial pollution.

Second topic in the future is 2030. Uh the sustainability is not only about decarbonisation but also about other resources like water. And last year re invent, we announced that we will be water positive by 2030. And with that, we will return more water to the communities in which we operate uh than we use in our direct operations. And since then, we already improved that um that met the metric of water usage efficiency from zero dot to 5 l per kilowatt hour to 0.19 l of water per kilowatt hour.

And the key is the key here is red increasing. Um the sustainable water use uh from resources like recycled waters, the water reuse of cooling water and supporting the projects that um increase the water availability in communities around the world. And here, for example, in november, just november, we announced six new water replenishment projects across australia, india, indonesia, spain and the us.

And while aws is continuing in the investments into investments into resource efficiency of our data centers and services, we can do even better. Sorry, we can do even better if your signals. If you give us signals of what workloads you run, how flexible these workloads are to indicate where we can make tradeoffs. So by signaling your constraints and also maybe the information of what constraints can be relaxed, you can enable tradeoffs for increased efficiency and service optimization depending on your that can be as simple as picking a certain service or different uh different feature over another. Um so you shift the responsibility to aws to optimize for that and to make these trade offs for resource efficiency. It's our goal to take as much responsibility as possible. So you can focus on your business on building for your business.

Peter's keynote just on monday was essentially a love song about all the service choices you have now for running um databases and caches and data warehouses. While we take care about reducing the overhead that you have just for um handling the peaks of your usage. Another example from the compute world um with the glue flex execution class, you can signal flexibility in terms of um the uh varying priority and the time you have for executing your jobs. And that is translated by aws clue into using spare com compute capacity instead of using dedicated hardware for your etl pipelines who has already used m seven if flex instances, anybody, ok, you should we introduce them in august.

So um or that you haven't used it yet. Um so similarly, here, you can signal flexibility uh in compute capacity to allow us to make these trade offs for resource efficiency to increase the resource efficiency. And um this is not the only computer is not the only um area where we can do that, you can also signal flexibility in storage while aws making is making tradeoffs for um hardware and hardware efficiency and energy efficiency.

Let's look at s3 glacier instant retrieval. When we look at storage, hard drives remain still the best way to store data and to serve data that need have you need to have the ability to access the data immediately. But while over the last decades, the size of the disks uh grew um a lot. Um ieps didn't and this means that on a per terabyte basis, those disks even got slower. And when you um that that does not help to scale the demand of a single storage tier. So you can imagine if it's getting slower, that is not a fit for a single storage tier, either you have a very full but very idle hard drive or you have very active drives that are under utilized from a storage perspective. And the solution here is that um the disk shares uh that that a share of the disk can also be used for uh s3 glacier instant retrieval that has less frequent access. And if you want to take a deeper look at that, it was already been told by in 2021 by peter descents. So he goes in even greater detail in this keynote and with, with these signals for demand of just taking another storage services, another storage feature storage class um aws can increase the resource efficiency. And also by the way for those people who are um convinced by us dollars also reduce the cost of your workloads.

Now we covered the customer signals to enable trade offs. And at the same time, you know your workload best and know how to optimize that. And don't get me wrong. We don't want to increase the responsibility on your side to optimize, but we want to offer more transparency, more tools, more selection overall, more convenience for you to make informed decisions and make it easier for you to optimize.

If you think back to the uh topics of the well-architected pillar, most focus uh most also have prescriptive tooling um that continuously analyzes your usage and um understands uh where is where's the potential for optimization for optimizations? They check typical improvements like codeguru is can identify expensive code that you can optimize s3 storage lens. Can help you to identify cold storage buckets where, where you have barely accessed objects. Compute optimism is a very important service checks for under utilized ec two instances, ebs, ebs volumes or recommends then the better the recommended recommended instance type. It also gives you the insight into better lamda configurations. And just in 2023 computer optimizer had over a dozen feature launches about new recommendations, new insights into your workloads. And last but not least cloud intelligence dashboards also are a tool of that which you should use. There's also another tool which is called sustainability scanner. Uh we launched it this year, it's an open source tool uh that checks your infrastructure against well architected sustainability pillar best practices and it does this at build time. So before you deploy your applications to production and let's see how this works in practice.

It's an open source tool that validates your cloud formation templates against a customizable set of checks. It comes, it generates a report that calculates a sustainability score. Um and it also includes suggested improvements that you should apply or can apply a two year templates. It's based on aws cloud formation guard that is a tool um which is, which is open source policy at code tool um for evaluating cloud formation templates. So it comes with a predefined set of rules. Um cloudwatch lock retention for example, and the use of energy efficient configurations, instances and blums lambda functions, you can, of course, it's not a one size fits all approach as you can disable those rules if they are not a fit to your type of application. Um and you can extend also these tools to those which you found to be successful in your um in your optimization journey.

Let's see how this works in practice. Um here's a cloud formation template. Um it creates just a lambda function in this code in this infrastructure as code part and a cloudwatch lock group. I have installed sustainability scanner. Um i call it with the cloud formation template and the output follows an inverted scoring. So the scoring the lower the better. Um and it fails here two rules. One is that the lambda function architecture configuration is not arm 64. Um and the other one is um not graviton. And the second one is the cloudwatch lux retention per uh period um that is by default indefinite. So storing everything forever and while it is very beneficial to use this locally, before you check in your code in the distributor teams, it's even more important to run this as part of your built deployment life cycle. You can use this github action if you're using github workflow, uh you can use this github action to to run it. Um you typically have a workflow file and then you just need to define this action and also uh say in which step it should be executed.

Um i've already set up here a new branch um for that uh for that change uh into which i check in the code and push the code to. And i'm as soon as this is pushed, it will be checked in github. It looks like this. Uh this is the merch request that has been created, there are checks included there, including the sustainability scanner and it runs automatically here. And as soon as it runs, this is the red x here. Um it also shows the same results in a condensed way from this previous json output. Um the person who is then merging this or processing this merge request can accept it or can act in it or give it back to the committer. Uh so things are fixed. I'm doing this here. Um i set the architecture for arm 64 and also the um retention period um for the lock group. Then i pushed the changes and now the sustainability scanner is resulting with uh another comment which takes a moment and this is now zero. So it can inform the application of best practices for sustainability in your built deployment process.

There are lots of tools um and in the future, we will have of course more tools, but i want to highlight three of them which i've shown um to give you also to assign them to the corresponding use cases. So for example, if you want to report on the carbon emissions of your aws total resources and want to see the historic, the history of that. Um then use the customer carbon footprint tool. If you want to quantify the input metrics and calculate a performance indicator, do this show back to your application teams then use uh sustainability proxy metrics and kpis. And if you want to just find out what should i improve, get recommendations, actionable recommendations of your application and your infrastructure, then make use of this tooling, which i've shown on the page before on that overview.

And what would be a talk about the future without gen gen a while? Gen a i has the potential to democratize a i for, for customers and everyone, we can also use it to tackle sustainability challenge. But we should ensure that we are minimizing the environmental impact when it comes to the different options i'm showing here um of using gen a i. We also see customers using the full approach. So from starting with prompt engineering, um uh using retrieval augmented um uh generation um to fine tuning to training complete models. And these options here have an increasing um uh increasing use of energy profiles. Here, it's important to define or start with the right strategy and keep in mind the resource usage that you have. So um start experimenting with the existing foundational models. Often prompt engineering will give you good results to get feedback from your customers about things uh for use, fine tuning and rack to achieve better, better results. Um and if required for the full training, use the um efficient silicon like like ranum here, no matter what your workload is also other workloads outside of a i. Um make sure you apply the well architected best practices um and do a well architected review on that.

And as the customer examples have shown especially that one by that quote by tui. It's important that everybody in the company fulfills their share, their does contribute to resource efficiency. Sustainability. Also every team needs to be involved to achieve sustainability goals. So it's not only the how but also the what and think about where emission drivers are in within your industry and how you can use technology to increase resource efficiency and tackle sustainability challenges. Almost all industries have common things like energy optimization, more optimizing facility and building management like carbon accounting uh et cetera and we have over 200 aws has over 200 fully featured service to build solutions for that. And the aws solutions library um al already combines them in ready to use uh deployment, ready to use uh solutions and partner software and guidance which you can use to get started quickly. Uh for your use case.

Let's recap. I have uh three calls to action. One is start tracking and visualizing your progress, establish a show back mechanism uh with uh application teams for your application teams so that they know how they are driving um make it based on the customer carbon footprint tool. Um also uh sustainability proxy metrics and kpis, if you want to optimize, start with directionally correct improvements, um review the best practices um from the well architected uh pillar uh use the recommendation tooling of course, um focus and where, where you have the biggest opportunity to optimize and last use technology also to solve these data problems to be more sustainable through the cloud to not only optimize it but also all the things we have outside of it that are also emitting carbon emissions with that.

Uh we are at the end, uh we're a data driven company. Uh we're interested into your feedback uh so that we can improve those sessions here. Uh if you have another minute of time, uh please use the app to rate this session other than that uh have a nice replay today. And uh thank you. Thank you for attending this session.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值