Optimizing TCO for business-critical analytics

最新推荐文章于 2024-07-20 19:31:22 发布

李白的朋友王维

最新推荐文章于 2024-07-20 19:31:22 发布

阅读量95

点赞数

文章标签： aws 亚马逊云科技科技人工智能 re:Invent 2023 生成式AI 云服务

本文链接：https://blog.csdn.net/just2gooo/article/details/134789167

版权

Adam Driver: Welcome to NT 209. The host today will be myself, Adam Driver. That is my name. There's a famous actor named Adam Driver, but I'm the Adam Driver that actually works at AWS, not the one that's on the big screen. A very warm welcome. I'm joined today by Kevin Lewis.

Kevin Lewis: Hi everybody, I'm Kevin Lewis. I'm with AWS Professional Services focused on data analytics.

Adam: Great. So we're gonna be tag teaming the presentation together. By way of introduction, I lead a solution architecture team that covers our data portfolio - so that's databases, analytics, and a little piece of the AI/ML stuff that you've been hearing a lot about this week.

By a show of hands, how many is this your first Re:Invent? Have people been to more than five Re:Invents? Can I see like the first Re:Invent? Maybe a hand or two? Ok. Anybody have been to more than five? I see one in the back. Ok.

So these events have just grown across Vegas and obviously the population of folks that are joining. So a huge thank you again. I know this is a competitive time slot with Swami's keynote happening and a lot of releases, but you'll be able to watch that on YouTube later at your own convenience. So thanks again for joining us. We're super glad to have you here.

I wanted to start with a Forbes article. I think most people, you know, subscribe to different trade magazines, different things, but this was a great article and I'll put the QR code up here in case you want to take a picture of it or link to it. It was a great article from January 2023 on what are these hidden costs of moving to the cloud? And where is the overspending that resides in these environments given the optimization that's been happening the last few years and given that more than half of the IT budgets are soon to be dedicated to the cloud, it's important to find that overspend or waste.

And I'm sure a lot of you came to the session to find ways to save money to direct those funds to other investment areas within your organization, within your company. And we believe that it's good to kind of see how you compare to other organizations to figure out if you're on track, are you ahead or is there areas of optimization that could actually happen?

So that's what we hope to explore with you today. We think we have some really good strategies specific to the analytics domain to help with that and catch it early, because if we wait and we don't find it until we're already in production, it's a lot harder to build those controls in place and identify that cost and have the right projections going forward.

So let's look at a couple of facts. I like to play some word numbers and I love word games and number games. Didn't do so well on Wordle this morning for those of you that play Wordle. But 51 and 30 - do you have any idea what that is? It's just two, it's two numbers, right? One's about half of something. The other one's about a third of something. But knowing that it's a percent helps, but you still don't know exactly what we're talking about. It could be, doesn't, how many eggs can you eat or how many eggs are left. We, we could have a lot of fun with that, but it doesn't help you just knowing these percentages and these were, these are Gartner predictions.

So I'm gonna share with you what the actual percentage stands for, but it's, it's hard to understand what it means. The 1st one - 51% - is really how much public computing will, will exceed, you know, IT spending on public cloud will exceed 51% by 2025 is what that one is. So, I don't know where you stand with your organization, but the percentage of growth of data and the investments in the cloud is continuing to accelerate the high hockey stick effect is definitely happening.

So I'm sure you're familiar with your cloud spending, but a lot of customers in the role I play are asking how can I save more money? How can I optimize this to really free up resources, dollars, FTEs, head count to go invest in other things? The more important one is there's 30% of waste. So for every dollar you're burning 30 cents and if you can find a way to make and maximize that investment back into your organization is a tremendous windfall with flexibility and agility to invest in some of these new projects that might be on hold or projects that you're not able to invest the right amount of resources to get done at the speed.

There's a concept and I'm sure you're really aware of thin ops, right? Fin cloud ops, there's all these different emerged approaches to help organizations answer those questions and provide recommendations. So we're hopefully through some strategies that we share today will help identify how you can maximize your investments. You'll still choose where you want to invest, but at least have some abilities to control the costs as things grow and become more and more mature.

So let's transition to a couple of themes. How many have heard data is the new oil of a digital age? There's like an Economist article that said data is more valuable than oil and you know, all these different things, we all know data is important, but there's a couple of themes, I just wanted to talk about. And obviously with prioritization and sharing this most valuable resource is continuing to grow at a phenomenal rate. And we want to make sure that people have an opportunity to seize it and actually create data as a differentiator for their organization.

This will allow people to make decisions faster, approve or reject, reject a request a lot faster. But more importantly, push the data, the decisions to the knowledge workers, the people that can actually make the right decision if they have access to this data and organizations are constantly working to do that.

How many have heard data is like water, right? There were articles in the last 10 years. It says it flows everywhere. You know, you can be surrounded by it. But how much of it is actually usable? Like how much water on earth is usable, right? So we're not able to use every piece of data that we're collecting. Companies are drowning in their data. I don't know how to gain access to it. I don't know where it is. Why is it duplicated or triplicated? Why am I building these pipelines to move it from point A to point B with some small conversions, maybe enhancing data quality along the way?

We think there's ways to clean the data, clean the data at the water district level and not clean it bucket after bucket after bucket. So there's some synergies and efficiencies that can happen to allow you to maximize your data. Again, we know data is fuel, you know, people eat data for breakfast. It was a tagline that someone you know, culture and strategy and I eat data for breakfast. But the fuel, it is fuel for your organization. And a lot of companies are identifying how to maximize that investment and then you're delivering value. And we believe data can be your differentiator.

We believe AWS from an analytics perspective and broader have some really good approaches to this that we'll walk you through. And we've got, we're going to start with our five key strategies that Kevin and I have kind of collaborated on. He'll take the first couple and then I will actually take the last few and then we'll share the fifth one.

So Kevin over to you.

Kevin: Ok, strategies to optimize total cost of ownership. So we'll run through these five strategies to optimize TCO. Think of it like a whirlwind tour. We're not going to go through sort of step by step instructions. But at the end of each section, if you want to get more detail and more specific instruction on any one of these strategies, we're going to offer you some resources to do that.

A lot of the things that we're going to talk about are actually quite easy to implement, but you have to be aware of them. So, you know, a lot of what we're going to be talking about today is for awareness. So you'll find and we'll talk about it as we go through that some of the things are actually quite simple.

But before we get into the technical recommendations, let's start with the sort of the least technical one and perhaps the most important in optimizing total cost of ownership. And that's developing a good data strategy.

Ok. So why is, why is it important to develop a good data strategy to improve total cost of ownership? Well, first, it reduces redundant development. Ok. So it turns out that the most cost efficient data pipeline you can build is the one you don't have to build at all. As Adam was talking about a lot of organizations have a lot of redundant development. So having a good data strategy can dramatically simplify this and reduce your need to build data pipelines from the same data sources over and over again. Just because they have slightly different data needs.

It also provides input for optimal design. So for people who design databases and data structures and so on, it's very useful to understand how that data is going to be used and it's also useful to have some lead time. So they have the time they need to build in good design from the beginning.

And then it also reduces time to find and use data. Because if we're deploying data rationally, then it's going to be much easier for application developers and end users to find the data they're looking for because it's not as scattered all over the organization as it would be without a good data strategy in place.

And finally, and possibly most importantly, as it relates to total cost of ownership, a good data strategy helps refine and drive the scope for your data deployment projects and all of the supporting data capabilities.

So let's talk about, let's show you what this looks like. So data strategy involves a lot of things. But the centerpiece of a good data strategy is a roadmap, a tangible employable roadmap that guides your program. And the start of any good data strategy is targeted business initiatives, not data initiatives, but really just any major business initiatives. And certainly generative AI is driving a lot of net new business initiatives for many organizations and then within them, we have the applications and use cases that we're targeting.

Ok. So this is the foundation, this is the beginning point of a good data strategy. It turns out that a lot of organizations do not start here, they'll start from the data and work their way down from there and it causes all kinds of problems.

And I'll show you how we use this to drive scope. But for our purposes here, if I understand and I have some foresight into the business initiatives that I'm targeting and some of the use cases that I'm targeting, then I can better plan the data capabilities, the data that needs to be deployed, the data management capabilities based on specific targeted data quality issues that might impact my initiatives, what do I need to do to the architecture for security and for the people and process side and the operating model? If I have some foresight into the business initiatives and use cases, then I can plan these capabilities much more rationally, thereby reducing development time and having much wider reuse and integration.

So when we talked about refining scope again, a lot of organizations don't quite get this right because they focus on the data first when in fact, they should be focusing on the use cases first. So if I know my specific targeted use case for the next iteration of implementation, I can use that then to very much narrow the scope of what data needs to be delivered, what data management capabilities and so on because it directly enables the targeted use cases within a targeted initiative that takes a massive amount of possibilities for scope and narrows it down.

So what you end up doing is basically two things with each delivery, you deliver direct business value because you're supporting targeted initiatives. And then secondly, you're building out your capability piece by piece in a shared and organized manner. Ok? This is a very significant enabler for total cost of ownership, but it also drives all other aspects of your data analytics program.

All right. So I told you this would be quick, we have put together this master class on data governance. This points to a video that specifically goes into building out the road map in the manner that we, that we just described in more detail.

All right. So now let's get into some of the more technical enablers. But before we get into any specifics around any specific services, let's talk about choosing the right tool for the job because that's the sort of the first part of optimizing the total cost of ownership is to make sure we're actually using the right tool in the right part of my ecosystem and there are many to choose from.

So how we want to do that is the first thing that you'll want is a reference architecture. Now, this is just sort of a simplified view. Again, we'll give you resources to get into more detail of how to build these. But the idea is to understand logically what are the elements that you need within your data analytic ecosystem? You'll notice there are no services noted here, just functions. So for example, we have the data lake where you'll need to store high volume raw data of all kinds you'll need to process and refine that data. We have a place for a data warehouse to enable analytics against highly structured data. Of course, components for AI and then components for visualizing that data and developing applications and dashboards and all of those things.

So by understanding the components that I'm going to need logically, then I can take a look at the sea of possibilities around services and find where they fit within this logical architecture. So it just sort of helps me wrap my mind around all the things that I'm going to need and help organize the services and where they fit.

Now, we're not going to be able to go through all of these. But just for some examples of how this works, you have AWS Lake Formation obviously has a home in the data lake to organize s3 files, provide some structure and provide the ability for fine grained access controls. And you have EMR for complex processing of that data using Spark and Hive and Presto things like that. And then you have Glue which can combine data from multiple sources, link it semantically. And importantly, you'll want to take advantage of data quality capability. So you don't have to hand code all of the data quality rules that it can recommend data quality rules and give you mechanisms that can help accelerate your data quality processing as part of this as part of this pipeline. And then Redshift to enable data warehousing again for highly structured enablement of analytics. And then OpenSearch Service is becoming more important within the AI space, for example, in its role as a vector data store. So when you take your proprietary company data and you vectorize it, it has to be stored somewhere and OpenSearch is playing an increasing role in that process.

So again, the point here, understand the functions that you're going to need and then use that to organize your thinking around what services are going to best meet the needs of that part of the architecture.

All right. So we're going again, this is going to be a tour of, of things that we can do on an automated basis to optimize services. And so we'll go through some examples. But when you look these up, you'll find that they're very easy to implement.

So again, this is for awareness and these are things that can have a pretty significant impact on your, on your optimization. But you have to know about it and you have to flip switches to turn it on. So you have to take some action in many of these cases.

So we're going to start with automatic scaling. All right. So you know, we know, obviously infrastructure cost tends to increase over time as you use more and more resources. So if you want to provision resources yourself, you have to predict that demand and then this can give you more control, but it can also lead to over provisioning, which can lead to excessive cost if you're not careful.

So with automated scaling, the idea is to let the system take a look at the actual demand and fit the scaling based on what the system is actually doing in terms of resource requirements.

All right. So again, we're just going to run through a few examples of how this works with some specific services. So AWS Glue has a capability called auto scaling. So without auto scaling, you have, you know, a fixed set of workers throughout the course of a job, think of workers as simply a unit of compute. Ok. Just a simple way to think about it.

Now, with scaling, the idea is to let the system decide based on resource requirements of the job, how many workers are needed at different points within the, within the execution of that job. And again, very simple to implement, you just basically switch it on for your jobs through the AWS Glue studio or through the command line interface, you just turn it on.

So, but a question that's going to come up here a lot of times is, well, why don't all the jobs just work this way? And how do I decide whether I want to use auto scaling or not? Why wouldn't I use it in all cases? And the answer is that you sort of have to experiment with this. You know, there could be some latency, for example, as you go from more workers to less workers, as it shifts in the number of workers, there could be some latency that you want to keep an eye on. Um and, and the way I think about this and this is going to apply to all of the uh of the optimizations that we're talking about. There's very often there's an option where you can control how these things work or you can let the system figure it out. And what I like to think about, I like to compare this to like a transmission in a car, right? You know, you can have a stick shift or you can have an automatic transmission. Now, if you're a really good driver and you, and you really care about the small optimizations you can do by taking care of it yourself. Well, then maybe you can maybe get better gas mileage, get better performance with a stick shift. And, but what we've also seen over time is that automatic transmissions get better and better and better to the point where the stick shift is much more rare than it used to be. So that's a way I can kind of think about with each of these services, you know, can you do better on your own, with your own skills, with your own knowledge and with your own experience and time versus just letting the system do it.

So that's going to generalize to a lot of these things we talk about and Adam is going to talk a lot about serverless technology. So that's really kind of the thought process and where you want to use these things.

All right. So we'll go through these fairly quickly. Amazon Redshift also has a scaling capability called concurrency scaling again, very easy to implement. You just turn it on for your Redshift cluster and what this is going to do, it's going to scale to multiple Redshift clusters very quickly based on the needs of the workload. Now, the good news is you actually get one free hour of this per 24 hour period. So you have to keep an eye on it, but it turns out we've done some analysis and it turns out that one free hour per day is enough for 97% of our customers to turn on this capability with no additional cost. So again, you want to keep an eye on it, but this is something you need to be aware of. Very easy to implement.

For EMR, there's a managed scaling very similar. This will adjust the number of instances in your cluster. And another benefit of this is above a certain threshold. You can identify that you want the system to use spot instances to scale up, which are much, much lower cost than other instances. So this can save you additional money and Adam is going to talk a little bit more about spot instances.

All right. Now let's talk about storage. So optimizing storage cost and again, automating it is something to be aware of, very easy to implement, but you do have to be aware of it. Amazon S3, intelligent tearing. So you're probably aware that Amazon S3 has multiple storage tiers that trade cost and performance with each other. In fact, a new one was announced recently for very high performance workloads. But with intelligent tearing you, let the system figure it out based on parameters that you set. The data can move automatically among tiers based on the behavior of the data such as the last time that data was accessed. And all you have to do is choose intelligent tiring as your storage class. When you define your S3 buckets and the system will you set some parameters and the system will take care of this, you know, choosing the right storage class automatically based on your parameters

Same with OpenSearch. So indexes can significantly grow in volume and you can let Amazon OpenSearch move those indexes to the right storage capability based on a combination of cost and performance and parameters that you set for that process.

All right. And let's talk about performance optimization. So you may be aware of materialized views. The idea here is that for frequently joined, frequently aggregated data, you can preate and pre join that data so that when queries are submitted, that the system itself does not have to do those joins, the system will do it for you and they will it will continue to maintain these automatically so that you don't have to worry about them.

Well, there's an additional capability in addition to you being able to create materialized views that you see the need for based on your preservations of system performance Redshift can do this for you again, you just turn it on for your Redshift cluster and it will sense based on the workloads, what data is very commonly joined, very commonly aggregated and it will build these materialized views and maintain them automatically on your behalf.

Ok. And OpenSearch has a similar capability with auto tune. The idea is to optimize memory, things like that and things like cache and cue sizes, things that it can adjust automatically based on the workload, you just turn this on for your OpenSearch domain. It will take care of it for you and some of these things will happen automatically throughout the day and others may have to occur in a window, a maintenance window that you define.

And again, OpenSearch is getting much more important in architectures due to its capability to store vector embeddings for generative AI. So as you explore OpenSearch for that purpose or just search, you're going to want to be aware of this auto tune capability because again, very easy, very effective, but you have to be aware of it, you have to turn it on.

And finally, for software optimizations, let's talk about what you can do with SageMaker, automatic model tuning. And again, all of these have been examples, every service you want to explore and look for the automation and and what you can do with them. But let's talk real quick about SageMaker.

So for automatic model tuning. The idea is that you know, when data scientists are building models, they may change some what we call hyper parameters through different executions of training and testing. For example, the number of layers in a neural network is an example of a hyper parameter that data scientists will adjust and then test and see how that works well. With automatic model tuning, you can take all of these hyper parameters and you can ask the system within certain boundaries to run and test and run and test and adjust and then give you the most optimal output.

So that's going to save you a lot of time. And then also what you're going to be able to do is again, with this capability, you can use spot instances because it does check pointing. The idea here is when you use spot instances, sometimes they can be taken away, could disrupt the job. But with check pointing, the idea is it can automatically restart those jobs. So you don't have a significant impact. So you get to take advantage of spot instances without worrying about potential disruptions. Again, a big big savings for you.

All right. That was a lot. That was a very quick tour. This is a resource that you can use to do both of these things. You can look at the right tool for the job and it's going to help guide you on what are the right analytic services to use throughout the ecosystem. And then for each one of those, you can go deeper into what kinds of optimizations are available for each of the specific services.

So now I'm going to turn it over to Adam where we are going to talk about choosing the right infrastructure. All right, thank you, Kevin uh for covering those two strategies. I also think there's an important strategy that is embedded in a couple of these that Adam uh our CEO covered in his keynote. If you didn't get a chance to see it, it's definitely worth diving into this zero ETL concept.

We talked about it at a couple different events and specifically in New York Summit where we're actually taking Aurora instances and we're automatically moving it into Redshift. I think there were four announcements yesterday, some were like DynamoDB into actually OpenSearch that Kevin was just covering and then there's some other RDS, automatic zero ETL again, huge time. You don't have to use Glue to move the data. You don't have to use Informatica, one of our great partners to move that data. It happens seamlessly, which is fantastic.

I think there might be something playing in the front there. Sorry. Um I thought it was maybe mine but it wasn't mine. Um so just wanted to make sure we could on the third strategy. So choosing the right infrastructure again, I think zero ETL is something to keep an eye out. We're investing a lot of money there. So please uh dive into it, see if it's appropriate for, for your organization.

All right. So it all starts with choices that we make. In fact, there were announcements of silicon that was discussed in his keynote as well. So we have, you know, a couple different options here. So I think most people are probably familiar with our Nitro system. One of the early environments of hypervisor with Nitro cards network. I'll keep it at a 200 level, but that's one choice and it's a great system that a lot of customers are doing. Graviton. I think it was announced yesterday, Graviton four now, right? So continue to innovate, which is incredibly important for our business um to support customers to get more for less. Um and so we're up to a Graviton four. And then last, but not least, you know, this is where we're investing a lot of time. So the inferential chips, uranium environments, machine learning acceleration, high performance computing, advanced computing, really, really large numbers of inputs coming into these environments. Trainium two was just announced and that will be available. I think Graviton four might be in preview, but those were covered during Adam's keynote yesterday. Um but we know that having choice here allows you to build and deliver the environments that make the most sense.

I'm going to transition to just touch on serverless because serverless is something that a lot of folks. Lambda was one of the first serverless services that we provided. There's other ones that exist. But analytics has been investing a lot of effort to reduce that total cost of ownership across all of these different services listed here. So this is a comprehensive list, Kevin touched on Redshift. I just want to call out Athena for interactive analytics. Tremendous number of connectors to Athena. A lot of people like getting that interactive across multiple data sets and being able to derive quick information really really easy. So having Athena in our portfolio and adding services and that has always been serverless has really helped customers out.

EMR Kevin touched on with big data with Spark Hi Presto etc. And then I just wanted to highlight the QuickSight if you hadn't heard QuickSight was part of the keynote yesterday with some of the storytelling capabilities where QuickSight you can actually just write and raw language and it will build like a storyboard for you and build a narrative. You can say I want it to be really short. I want it to be really long. And these are the these are the charts, these are the visualizations I want to include or exclude and it will kind of handle that with, with coaching and guidance, with language um right there on the screen. So really really cool uh new capabilities, but I want to go to OpenSearch Service and just kind of follow on with what Kevin was talking about that this is really a fast win for customers.

Um I'll share, I think a customer example later where the classic log explosion problem and bringing logs in and being able to quickly gather that and, and allow you to have insight into those logs and find where things obviously setting thresholds and figuring out where there might be dependencies where there might be errors in time. But the serverless deployment of OpenSearch allows you to automatically take care of everything that you need under the covers.

Now, the key is to have thresholds with some of this because sometimes this is where waste can pop in you kind of design a system and you deploy it and you want to have it fit within this range, whether I want it to consume, I want the performance to be the best and I'm happy if it scales up more, but it also will scale back down when it's not needed to be as high capacity that it currently is. But there's guardrails that need to be put in place is the right way to say this so that you're providing the right guidance and the expectations and so again, easy to administer, it's really cost effective scaling up and scaling down, but allowing it to do that and learn over time, right. So it's not just a point in time, it learns your applications and the usage of those applications and what the predictable performance has been and then feeds back in to manage that automatically for you.

So OpenSearch, I've seen customers get rolling with this in a few weeks in some cases, a little bit longer to get to production, but it's one of those quick wins that if, if you need to index. And as Kevin talked about even playing, not playing around but vectorizing your data with the vector database that's built in, takes you into that generative space.

The second one I wanted to talk about was MSK and so streaming, you know, obviously having this be serverless and allowing people to consume this quickly. There's five things I just wanted to touch on the capacity on demand. Again, it's not just giving you the capacity but only charging you only paying for what you actually use versus giving you more than what you need and then you use a subset of that. So really keeping track of the capacity on demand um and then you pay for the processing time. So it's, it's throughput, not end to end processing time. It's like what is it actually doing? How are we streaming this? Are we streaming it directly into Redshift? Are we streaming it directly into a variety of different data stores that could be capturing that one thing that is really attractive to customers is that auto partition and placement of where that can drive performance or drive savings and costs.

So there's some choices that are made. The fourth one is compatibility, we want to make sure all of these serverless options are compatible with cluster deployed. So maybe I didn't cover that at the beginning. But there's a choice like for Redshift, you can do a managed clustered environment where you do the configuration, you do the setup, you control everything and configure everything you'd like to make Redshift yours because you have the skill set, you have the capability or you can take Redshift serverless.

We want to make sure you can mix and match. You can use Redshift server list, maybe like as a test environment or a pre prod environment. And once you know the workload, you know exactly what that performance characteristic needs to be. You can choose what you use in production so you can mix and match across environments. And we want these serverless analytic products to be fully compatible. So you're not rewriting anything. You're not testing retesting to make sure something didn't change and making that as simple and as simplistic as we possibly can. And that really drives security consistency so that you can innovate and you can deliver new workloads and not spend your time testing retesting along the way.

So let's just look at one more and this is one that Kevin talked about with Redshift. I'm going to highlight the box in red because I think this is where over time again making sure that Redshift is learning from itself. And serverless is the one that is able to do that. You can take the logs from the the cluster, the self managed version and then do your own analysis. But you also have the flexibility to use serverless. Well, we will figure out what these usage patterns are, whether it's a small query, a medium or large query and exactly where from a workload management, how it should be executed and what that approach should be and that will allow you to optimize your costs.

So again, these are fundamental things of automatic maintenance, patching while it's still running, so we can upload and patch certain parts of the cluster and then work our way across others without zero downtime, really attractive feature of serverless that customers are really embracing. I don't know if we have any large people using Redshift in the audience, but serverless has been for net new workloads. People have been starting with serverless to at least get that ground foundation in place and then deciding when they get to that pre prod environment, what is best for their environment? So we're happy to answer questions along the way with some of this stuff too.

Pricing comes up. I don't, I don't spend a lot of time with the pricing, but I always think presentations, it's good to cover some of the pricing options because this is about TCO, right? And there's some choices that you can make or you can influence within your organization to say, you know what i might be, we might be able to save money if we do the following. So that's what I wanted to cover here.

There's three primary pricing approaches and I'll go through this fast in case. But it's also helpful if someone's not really familiar with it on demand. So this is a classic state full spiky workload. I want to just spin up an instance. I want to spin up an account. I want to start using it and you kind of pay for what you use at the time. It's on demand pricing. We're going to give you the resources when you request them. When you hit the deploy button, when you start configuring, that is one of the most common way that folks get started. It's not the cheapest, it's an on demand. So we're making sure we have the capacity available for you whenever you need it to make it on demand. And so we believe that a lot of customers have been moving to savings plans and a savings plan has like a 72% savings across 1 to 3 years of on demand pricing. So that's like on demand on the left is probably the most convenient but doesn't have the built in savings in the middle here. Savings plan definitely will cut that cost down for what you're using, deploying services capabilities. And this is a committed, you know, certain amount where you have a steady state usage workload. I know what the workload is we moved an Oracle database to EC two on Oracle. I know what the workload is. I know what the environment is or sequel. Those are common things that people are doing where the usage isn't going to change, the application is going to move as well. But it's it's predictable, know exactly what this costs and we know what value it provides in our organization.

The third option is spot and Kevin touched this, this is ephemeral instances, right? Where I just, I need a little bit more horsepower. I don't want to pay for it. It's not a state full application. It's more of a fault tolerant application. If it stops for like five seconds and then kicks up and picks up again, that's ok for this type of app, it's flexible. You know, it is, it is a stateless application because it, it spot instances are really good for additional capacity, huge savings, but they're not always available and you might not get exactly what you want and it might not be there as long as you want. So spot is one of those that you can drive a lot of savings, but you have to identify the workload and you have to make sure you're prepared for all the resources being there and then maybe being gone for 10 seconds and then coming back.

So those are some of the instance, different pricing options having said all this, most customers do what they use all three, right? So it's not a, it's not, i am, i'm on, i'm on the left with on demand only or i'm only doing spot it's figuring out the workload, figuring out your usage. And there's a lot, we're gonna cover some things that will help show you what your usage is if you're probably very familiar with it. But there's additional things that are new that have been announced on how you can maximize the purchase options.

This is a chart that is helpful sometimes instead of that left to right, three different choices. This is more of a bar, you know, chart. So on the bottom is our savings plan um on demand is in the orange uh and spot is at the top in case someone's color blind. So the top one is the green that kind of shows spot and then orange is in that middle area as i should have shaded these differently. And then savings plan is that consistent amount at the bottom. It's a known workload. It's my data warehouse, it's my erp system. It might be xyz but this is like consistent. It's the same amount every month. It maybe goes up and down a hair but not very much where on demand, we will pick up a little bit for some experimenting and then obviously the spot instances will allow you to do those workloads. Maybe some data ingestion, maybe some data quality as long as it gets done by the morning, it doesn't have to get done in an hour environment

Here's an example of EMR and so on the left and these are just four different options. Your, your pricing will vary. But on the left kind of shows an on premise environment, right? This could be a Caldera environment, a Hortonworks environment, an EMR or MapR, if people are still using MapR and they're running it on premise, right?

For big data workloads, there's server costs, network costs, labor costs and you can read the box, there's a lot of costs in there. And then when you get to EMR and this is moving, the second box is running Amazon EMR in the cloud, you end up having some fees associated with it. You've got some support costs and you'll negotiate pricing, but we've seen it almost half what the amount is running it on premise.

Um moving that workload to Amazon and then we've got optimized where you're actually investing and optimizing and looking at different ways to cut costs. So like that is the instance type that you're using, whether you're using Graviton or, and what version of Graviton are you using? And then if you're really running with EC2 spot instances, you can see that cost drop dramatically.

And then serverless is that other component I want to call out that serverless appears to be cheaper than optimize. And we did that from a graphic perspective. But if serverless is something that you're using and it's a known workload and you're able to optimize it. It might be cheaper to use the Amazon EMR optimized rather than going to serverless. It depends on the workload. It depends on what that elasticity is of the workload, what the performance characteristics are and what work are you willing to do to manage that environment versus saying serverless, I want you to take care of it.

As Kevin was talking with the transmission, maybe you're going to take care of how I get to 60 miles an hour. As long as we get there, I don't want to have to grind through two or three gears and I have a cup of coffee. No, not a phone in my hand. But uh however, however, you want to get there. Ok.

Um glue flex pricing, I just wanted to cover this briefly um because Kevin touched on glue, so I wanted to give you the context of pricing with glue. Um so we have two options standard and flex and flex is the one. So standard's been around for a very long time and we will be there. It's a great option to get things started. It's a standard execution class. But the flex one allows tremendous flexibility on how people are actually leveraging glue and glue is our data integration capability that is highly used in these analytics deployments. But we're really seeing the ability to control the cost with sensitive workloads or some flex controls that we can put in place the provision capacity for Amazon Athena is another one that we've invested a lot.

There used to be really just one real choice you would pay per query or pay for compute, right? And so the query options and we touched on Amazon Athena, if you really just want to do some interactive analysis, and you really want to have a variety of different data sources that you want to connect to really great solution to get started. And it's, it's pay for the queries that actually run um, $5 per terabyte. Very reasonably great way to minimize costs until you figure out exactly what you need.

Um rather than just overspending is the term I would use and then figuring out later how to cut that back start with something like Amazon Athena. In this particular case, what we've added though is requesting and reserving blocks of time so that you can actually pay for the capacity when you activate it and get like almost like a reserved instance. I didn't really cover reserved instances in the pricing, but it's like a reserved hour set that I know I want to crunch a significant amount of work and I want to pay for that compute and have it ready to use, but then, then have it go away. I only need it for like two hours a week. As an example, the last strategy we're almost there, I think we have about 15 minutes remaining.

Um we were gonna, we can obviously field a couple of questions if you have questions. It's, it's a, it's an intimate group spread out a little bit and then otherwise we will definitely meet you in the back for any questions that you might have.

Um the last one is monitoring because if we're not capable of not just log monitoring but actually monitoring the applications, monitoring the entire foundation, the entire workload in the end. And if we don't make that easy, then that's a spot where waste starts coming in, right? Because people are just, they're so focused on standing up the workload, optimizing the environment and perhaps not seeing the forest through the trees, right?

So we've provided a couple of capabilities and I wanted to start with Expedia. I don't know if there's someone in the audience from Expedia, I'm guessing not, but maybe um but Expedia is a customer we've been working a lot with and we've been their challenge was how do I take all these logs from AWS services, non AWS services is v services our own internal work. How do we build all of this and insight, you know, via CloudTrail application logs, even Docker logs and funnel that to Amazon OpenSearch. And the reason that they did that is they didn't have the manpower or the infrastructure to handle the growth of these logs.

We've been talking about growth of data, but the growth of logs are equally cumbersome and can be challenging. And how can we scale the limited resources we have to spend on this to meet the data requirements. And so through a variety of different pilots and experimentation, they ended up using Kibana as a front end on top of Amazon OpenSearch to actually get real good. I think it's less than 30 seconds. So it's almost like real time investigation into their issues.

Um and we all have issues, but identification is a key. So these things would bubble up and they were able to integrate it with a lot of different uh identity management solutions from AWS and really allow it to be dynamic and deliver that application monitoring. It's a lot of people are calling it observable. There's a lot of different variants of to that. But this is one that was really uh really straightforward.

But what they did in terms of managing the cost with it is they went through four steps. The first one is they wanted to plan and evaluate like what does that migration look like? What is the pricing calculator? What is this gonna cost? What kind of instances do I need? Is this another savings plan or is this something that can run in a spot environment? Like how crucial is this then figuring out what those costs will be? How do we want to manage them, you know, within our accounts within our instances? You know, maybe it's creating blinking a couple of accounts together and really having a larger budget uh approach.

Um and then how do we want to deliver it? How do we want to make it more deliverable and actionable um to identify and find new opportunities across a variety of different cloud fin cloud financial management areas? And then finally, where do we think we can save? So again, these are four steps that a lot of customers, you all are doing this too, but would you want us to help you automate it?

And so we're going to walk through a couple of things that we believe we're delivering and continuing to deliver to take some of this work away and optimize it in like an application so that you can kind of follow the, follow the investment, follow the spend along your environments.

The first one is this AWS Cost and Usage report. So I hope you have seen this. I'm sure you all have maybe other people in your organization are responsible for it, but it allows you to organize and understand your data with customization of exactly what resources are using by environment, by application, by account, etcetera. And that is incredibly helpful rather than trying to do this your own, perhaps with Tableau reports or QuickSight reports or Kibana dashboards.

And then we believe that this Trusted Advisor service that we have really helps you monitor your spending where you can say I don't want, I want to create budgets for different accounts, different organizations, different parts of the business. And I can also project and do some actuals versus forecast. Are we ahead of our spend? Are we behind, are we finding savings along the way? Where do we actually sit, cost, explore? And this is that reserved instance, standard pricing, savings, plan pricing, etc these things allow you to really right size and figure out anomalies as well.

So if I know I'm planning to spend x and we're way over x, then what changed? Why is this an anomaly? Now, why is this an outlier versus in line with what I expected? And so there's some really powerful dashboards and Kevin's going to show you some of the applications that have been built that allow you to kind of click through and not have to build it on your own. Obviously, you will, in some cases you'll actually modify it or customize it to meet your needs.

And last but not least is this Compute Optimizer like coming up with recommendations for you rather than having you do the analysis and say, ok, what's next? What do I do next? Here are two recommendations that we think that will help drive you based on the resources, their utilization and the performance characteristics across a couple of different servers services as well as the sizing of those. And there's a couple listed there, two EBS and AWS Lambda.

So with that, I'm going to just pass it over to Kevin for one example that he's going to walk through and then we'll, we'll bring it, we'll bring it to a close. All right, thank you, Adam.

All right. So what Adam just went through on cost monitoring. These are things that are available out of the box that you can use right away. But what I want to talk about is sort of turning the dial up a bit as some of the more mature organizations want to be much more proactive in monitoring their costs and and include things that are very specific to their business. So they can analyze costs within their environment even more than what you can get out of the box.

I'm going to talk about some things that you can do to bring that kind of company specific information into this cost management process. So the first thing is to take advantage of tagging, ok? It's a fairly simple and straightforward way to do this, but you have to have a tagging strategy. And the basic idea is that you can basically label resources based on a number of organizing criteria.

So for example, you can organize it by application by teams by department, you can organize it by environment like you can have multiple and you can have tags for development for test for production. That way when you're using some of the services that help you to analyze cost, you can then analyze it by your own organization's structure applications and so on, which helps, you know, where the costs are coming from, not just the individual services that you're spending money on, but which applications, which departments, which environments and so on.

Another thing that customers are commonly doing is to look at unit economics. So, you know, you'll find over time that usage goes up, cost tends to go up now. Is that a good thing or is that a bad thing? Well, it might mean that you're actually driving a lot more value to the business.

So in order to fully analyze this, it's very useful to understand what is the cost per some unit that you would define. For example, cost per customer transaction. If the cost per customer transaction is actually going down, this might be an indicator that you're actually managing your costs. Well, even if the total costs are going up because you're driving more value to the business.

So understanding the unit cost in addition to total cost can be very, very, very insightful. All right. So for, you know, you'd have to figure out what is that cost for you? What, what is the unit that represents value to your business? And again, you can pick more than one. So that when you analyze unit cost, you can analyze it by a number of factors that are specific to your business.

For example, things like daily active customers or users of applications or users of the web website. Page, clicks, miles driven in a vehicle and you can look at things like seconds of video streamed. You just have to think for your business, those things that are enabled by the analytics capabilities by the cloud services. What is that? What are those units of value that you want to track?

And then what you can do is take data from the services that are provided to help you manage cost, combine that with your value drivers. And and again, you can have the tagging strategies and those things built in and bring it into a dashboard that you can continuously improve and you can use to monitor and set up alerts and things like that to help you analyze and with these new generative AI capabilities, then you're going to be in a position to use natural language to be able to investigate where are my opportunities to improve cost.

So let's bring this full circle as we wrap this up. Ok. So basically what we're showing here is a business use case. So the idea we talked about the roadmap at the beginning, just think of cost management as another use case. Another initiative to put on your data analytics roadmap that we talked about at the beginning.

And then once you do that as you are implementing your solutions, you can then take the information that comes from those implementations and feed that back into your cost management solution. So for example, you can begin tracking labor costs as you're tracking, what does it cost to implement the solution through these various work streams? You can then add new value drivers based on new applications that you are implementing. So you can with the unit economics.

In addition to total costs, again, this is total cost of ownership. So we want to have a good understanding of people, process technology and a good data strategy, a good roadmap that is going to have work streams for all of those elements. And those are all opportunities to feed back into this dashboard to get a complete picture of your ecosystems and the cost and value associated with it.

And then of course, you can leverage the very data architecture that you're using to enable other use cases. So again, this becomes just another use case that you leverage your data architecture to enable. Ok. Again, that was very quick, but this is a good resource that will take you step by step through this whole unit economics process and how to build capabilities so that you can enable a dashboard like the one that we showed you.

So I will let, I will let uh Adam wrap up. Uh so um yeah, go ahead, Adam. Yeah, we're just gonna do this together. Um so we're at the end of what we wanted to cover with you, but we did want to pass some more QR codes. I hope we're not drowning you with QR codes, but there's one on the left that is for some optimization training examples that's available um that you can actually sign up for this additional waste to get cost optimization and and identify that waste.

And on the right was just a link to ProServe in case you needed help um with additional outside resources coming in from, from AWS to maybe help drive that forward. So with that, we're kind of at the end of the material, um we would love to uh entertain maybe some questions in the back or right out right outside of this room if you have any.

But a huge thank you for uh for joining us on this journey of level 200 TCO, a shameless plug. We do like surveys. We love data. Just give us.

李白的朋友王维

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Optimizing TCO for business-critical analytics

Adam Driver: Welcome to NT 209. The host today will be myself, Adam Driver. That is my name. There's a famous actor named Adam Driver, but I'm the Adam Driver that actually works at AWS, not the one that's on the big screen. A very warm welcome. I'm joined
复制链接

扫一扫

Optimizing TCO for business-critical analytics

“相关推荐”对你有帮助么？