Reduce your operational and infrastructure costs with Amazon ECS

Thank you for joining everyone and welcome to today's session.

Uh in case you're in the wrong room, we're doing CON 308 about reducing your operational and infrastructure costs with ECS. My name is Vibhav Agarwal, I'm a senior product manager with Amazon ECS. I've been with ECS for about 2.5 years now.

Uh over the course of this time, I've had an opportunity to interact with a number of customers and help them in a number of different ways and also impact the product right across the data plane, the control plane, scheduler deployments, whatnot.

I, I will give a quick heads up. We have a lot of content to share with you all today. So we might not have a ton of time left at the end for Q&A, but I promise to stick around. I'm also around till Friday. Here are three in. So if there's any questions that you would like to talk about afterwards, please do feel free to reach out.

So before we go ahead and look at the agenda for today's session, I wanted to take an opportunity to set a quick preamble for the session. It's adapted from the WS Well Architected framework. But in principle, my pitch here to you today is that I want to share all of the best practices that I've learned, speaking to customers and our principal engineering team over the past 2.5 years and share best practices which when utilized well can help you save up to 30% cost or even in some cases, 73% cost, which we've seen our customers

So quickly walking through the agenda for today. So I'll start with a quick overview of Amazon ECS for those who might not be super familiar or not already using ECS. After that, I'll spend time on how you can and I'll, I'll spend a bunch of time talking about how you can optimize your workloads with ECS. I'll be talking across three subsections and finally, I'll be handing it over to Francis, one of the customers I've worked very closely with over the past 2.5 years who will be sharing his journey of how they have built their platform on ECS and how they've been able to run their organization in a super efficient manner.

A quick show of hands before I move on. How many in the audience are using ECS today? Pretty cool. So about 70% folks. Uh so for those in the audience who aren't using ECS today already, ECS is a fully managed container orchestration solution. ECS was introduced about eight years ago at re:Invent. Since then, we have added a number of capabilities over time. When it started, you could use ECS to run your containerized applications. On two later in 2018, we introduced Fargate, which is a serverless compute engine for running containers. And most recently in 2021 we launched ECS Anywhere which allows you to run containers on your own compute running in your own data centers or at the edge.

There's two key statistics that I'd love to double click on here.

1st 2.25 billion tasks every week. So that's the scale at which customers rely on ECS. So every week we see customers launch over 2.25 billion tasks on ECS. And this is, and this is distributed across hundreds of thousands of customers.

The second big start I would like to focus on here is that 65% of all new AWS containers, customers choose ECS as their first platform for running containers. Many of these customers in fact, might be new to AWS itself and often see ECS as their first intro to the cloud

Before we move on a quick intro to Fargate for those who might not already be familiar with Fargate. So Fargate is a serverless compute engine for containers. When you run containers on Fargate, you don't need to think about ning the underlying instance you don't need to manage. You don't need to think about the underlying host operating system. You don't need to think about patching. You don't need to think about things like observable or agents to gather that information, local storage. All of that is taken care of.

The way we think about ECS is that ECS value proposition closely ties with AWS value proposition as a whole. And and that really boils down into two simple things by being performing at scale and providing availability and reliability and giving you a simple interface to manage and run applications and containers. We can help you unlock cost savings, better availability and help you get to production faster, help you get features out more quickly so that you can innovate faster for your customers in this day and age. And that's why increasingly we think of ECS as your platform team. If our goal is to provide you opinionated and managed experiences with the right level of flexibility, of course. But with the with the two key goals as I identified earlier, improving time to production and reducing operation overhead so that you can focus on what really matters and your developers can focus on what really matters.

So with the intro, out of the way, I'd like to spend a couple of minutes talking about how I'll be going through the rest of the session. So we'll be talking, I'll be talking through three key principles here

First operating less. Why? Because when you operate less, you have more time to do other things that matter more. If you spend less time patching host, managing managing software that does orchestration. If you spend less time doing, managing armies, you get more time to focus on innovating, to focus on improving all the other things that move the needle for your business.

Secondly, utilizing more efficiently. So the way I think about it is the best way to save costs is to use your compute more efficiently. So we'll be doing a deep dive on how you can run your containers efficiently with ECS and how that can unlock cost savings for you.

Finally, we'll be talking through some infrastructure choices on AWS that you can make to to reduce your tc quite substantially. So and when you do all of this, it it can unlock significant cost savings for you.

So let's start by the first pillar that I talked through operating list. So customers want to focus on innovation and our and the goal of containers is to facilitate that innovation. However, it does take time to get through to all towards that end, you need to build out your application and then encode it into containers and finally put deploy them into production. And but once that's done, it's not done, you need to continue to manage, manage your application, you need to maintain host patch host, you need to observe, make continuous improvements.

Now, if you use any other container orchestration solution, you not only have to manage all of this, you also have to manage the orchestrator itself because it's not a fully managed solution. However, when you're using Amazon ECS and Fargate, you can focus purely on, on your application, which is, which is your code and focus and unlock innovation that way

Before we go into the how quick note on how we think about on why we think so. So increasingly we see ECS as a surve container orchestrator. Now what that means is ECS, you don't need to think about a control plane in ECS. When you create a cluster in ECS, it's not, it's not a physical entity that lives in your AWS account. It's a completely, it's purely a logical name space that resides in your account which you can use for distributing applications in the right manner that works well for your organization. There's no charge for the compute for the control plane and the control plane is provided as a fully managed service by ECS and which runs at immense scale as we already showed.

And secondly, Fargate is a sur compute orchestration, a sur compute engine for containers. As I said earlier, you don't need to think about host, you don't need to manage capacity, you don't need to think about army or operating system agent, none of that.

So let's go back and take a closer look at how ECS manages all of this. So first let's start at the build stage. So while there are a number of other different ways that you can deploy to ECS, including the API C you could use, you could use cloud formation console. What not increasingly we think about higher level interfaces which customers can use to develop an ECS and they take a number of different shapes.

So the first being ECS blueprints, so these are terraform based blueprints that are available as an open source project on github and maintained by AWS, which allow you to get started really quickly with your first deployments in case you're already using terraform. And it's not only a getting started solution, it's something that uh that allows you to get started with best practices and continuously evolve that configuration over time so that you're architecting with the best practices from the get go

CDK extensions, the same similar to ECS similar to blueprints for terraform CDK extensions provide a higher level construct. CDK cdk being uh an interface for infrastructure as code which which allows you to provision infrastructure in the programming language of your choice.

Finally, co-pilot cl. Co-pilot is an opinionated cl which allows you to run applications at higher order constructs so that you don't need to think about things like creating a build pipeline, create uh going in looking at infrastructure, looking at the application spec you. It's almost as simple as saying, here's my container, AWS. Please run it for me. We'll do a closer look at copilot cl in a couple of minutes.

I i'd like to spend a couple of minutes here so traditionally, when you deploy an application on two, there's a shared responsibility model that you're signing up for. There's some things that you manage. Whereas other things that AWS manages when you run a traditional ec two machine, typically AWS manages the virtual machine itself as well as the physical server. However, everything that's over that layer, including your application, the operating system, it's running on the run time, any storage, monitoring, logging plugins that you need to uh that you need deployed on your hosts. All of those need to be managed by you.

Now over time, that adds that can add significant overhead. This was the reason why we introduced ECS two back in 2014 to, to take on some of that responsibility away from you so that you can focus more on the application layer. So when you use ECS for DC two, you can use ECS optimized armies which provide a managed host, which provide a super lightweight amazon machine image, which you can use to which you can use to run your hosts, which includes the container runtime storage and logging plugins, the operating system. However, this is still in your accountants, you still need to manage everything associated with patching and whatnot.

However, when we with FAR, all of that is abstracted entirely by ACS, you don't see you don't interact or see the host operating system, you don't see the container run time. You don't need, you can choose to not worry about monitoring or logging plugins. All of that comes fully baked so that you can focus purely on your application. And that is why we are seeing customers increasingly choose FAR as their first way to deploy containers on ECS or in AWS in general.

Now, moving on another key topic that I want to cover on is the observable part ECS brings observable out of the box. So in 2019, we launched CloudWatch Container Insights which allows you to simply say, hey, ECS, I want, I want you to observe, I want observable baked in for my cluster and ECS automatically provides application metrics for all applications running on your cluster. At the granularity that you want for logging ECS provides a file for for ECS which allows you to send your logs to many destinations depending upon what your logging platform is.

However, observable isn't just about observing metrics and logs. Observable is also about observing costs. And this can be super, super important when you're trying to cut down costs given the current macroeconomic environment. However, as those of you already run containers might know monitoring cost for containers is hard if you're if you're running, if you're launching containers, uh which which last for only a few minutes, which are sharing resources across in two host, which is shared by multiple different containers. It can be really hard to keep track of how much computers being utilized by which application and allocating that cost at an application level. This this is a really hard problem.

However, when you use ECS with Fargate ECS provides you task container level cost visibility out of the book. So when you, when you, when you run your application with ECS, you can see how much you're paying per application instead of seeing how much you're paying per box. What this means is that by using tags, you can, you can, you can get to really, really granular data, really, really granular cost data for your application, which you would not be able to do otherwise, even in a shared multi-tenant and management.

I before we end the section, i did want to briefly double click on AWS Copilot, especially for those of you who who might be new, who are new to ECS. So Copilot is built with well architected principles in mind. So what that means is when you launch your first application on ECS with Copilot, you're already well architected from the outset. So your first application itself uses load balances efficiently, it uses health checks, it, it has observable all of that. What's more you can build things like a Copilot also makes it easy for you to set up deployment pipelines which make it really fast and easy for your development teams to push out code. Also, Copilot makes it easier for you to troubleshoot and operate, especially for developers, especially for environments where you have developers running and managing the entire stack

Taking a step back before we move to the next section. When you, when we think about all of the things that i spoke about combined it, it it creates the amount of time saving that it creates for you to not think about managing the control plane, the data plane post patching all of that time can be used for driving innovation for for improving your release velocity, for improving for spending time on other things that you can use to cut down your costs.

With that said, let's move on to the next pillar utilizing more. So i i'm i'm sure i hope most of you are familiar with this image. This is the image that always pops up in my mind. When i think about managing running containers efficiently, just like in in tetris. The goal is to use to fill up lines perfectly. So and increase your score when you run containers efficiently and utilize your compute efficiently, you save costs. And that's why as i said, the best way to optimize your cost is to is to improve efficiency of utilization of your compute

Before we move on. For, for those of you who aren't super familiar with ECS, i want to take a quick moment to talk through some ECS terminology here. So not a lot, just three quick terms.

First cluster, a cluster in ECS is nothing more than a name space. It's not a physical entity that resides in your account. It's fully managed by ECS

You can create as many clusters as you want. An ECS task is the smallest unit of compute in ECS. A task is a group of 1 to 10 containers which constitute, which typically might constitute one single unit of an application or a microservice.

And finally, services - a service typically represents a group of running tasks which together, which combine to form a microservice. The service construct in ECS provides inbuilt resiliency. What that means is if a task in a service crashes, ECS ensures that it will bring it back up so that your application remains highly available.

So with that out of the way, let me take a step back to talk about the three key dimensions that you can think about to improve compute utilization:

  1. Size of the task - When you run a container on ECS, you need to configure how much compute you need for your container - say 1GB CPU, 1 vCPU and 4GB memory. But depending upon how you utilize that or how much your application needs, there might be an opportunity to cut costs.

  2. Number of tasks in a service - How much traffic your application sees over the course of a day can vary a lot. You can use this to make sure that your applications are always right sized.

  3. Compute capacity in the cluster - This represents the synthesis of everything else. If you have n instances running, you want to ensure those n are being utilized to the right extent possible. And by doing that, you can obviously save cost.

Let's start by looking at the task level. In this example scenario, it's pretty clear that it's becoming CPU bound, whereas there's a lot of unused memory. In this case, you could easily reduce the amount of memory and increase the compute you have reserved. And by doing that, you could likely save some cost. Right sizing tasks is important - you need to look at historical metrics over a reasonable period of time to make sure you're right sizing correctly.

You can use Container Insights to monitor historical utilization data for each task and look at the average and peak utilization to make sure you've sized your task correctly.

With that said, let's move on to the next big pillar - number of tasks in your service. The amount of traffic that your application serves can vary a lot. Let's say this application needed 10 tasks at peak times. In some cases, it might need 6, 7, 8 or even 5 tasks to respond to traffic.

To respond to that, ECS allows you to configure auto scaling to scale your service horizontally so that it automatically responds to traffic. What this means is that ECS automatically observes utilization and based on the target metric scales up or down to ensure your application is always right sized - available but not overprovisioned.

You can scale based on a variety of metrics - ECS metrics like CPU and memory, metrics from other AWS services, or custom application metrics - whatever is right for your application.

With that said, let's move on to utilization at the cluster level. This is fundamentally a hard problem because a cluster typically has 10 instances and 10 services running on it, each requiring different numbers of tasks over time.

To make sure the cluster size is sufficient but not overprovisioned is hard. In 2019, ECS introduced capacity providers which automatically look at not just current needs but near future needs and right size the cluster. However, you likely still need some spare capacity for availability.

With Fargate, you don't need to worry about underlying instances at all - you just run tasks on compute provided by AWS. So you only need to worry about right sizing your task and service. As long as those are fine, you will never over reserve and under utilize capacity you're paying for.

This is why we've seen customers like Edmunds reduce costs up to 30% by switching to Fargate.

Quickly moving on, the infrastructure choices you make when running your application on AWS can impact your bill. The three key choices are:

  1. Hardware type
  2. Capacity type
  3. Making reservations/commitments for discounts

First, AWS Graviton is AWS's custom silicon introduced years ago which provides highest performance at 20% lower cost compared to same size instances. ECS is integrated with Graviton so you can run tasks on it efficiently at lower cost.

Second, Spot capacity is spare AWS capacity made available via Spot at up to 90% off on-demand. Containers make it easier to run workloads on Spot because they are prone to interruption already. ECS handles interruptions automatically to gracefully drain connections so all requests are processed before shutting down.

When you use Fargate Spot, the underlying instance pool is diversified so interruptions are less likely. ECS makes it easier to run Spot overall.

Finally, Savings Plans provide flexible discounted pricing based on commitments over a year. Compute Savings Plans apply automatically to ECS resources. But Instance Savings Plans only apply to EC2.

With a 3 year commitment, you can save up to 52% on Fargate costs. We've seen customers like Katina help reduce costs up to 73% from Lambda by moving to ECS with EC2 Spot.

Architecting your applications to use compute efficiently with the right types allows you to unlock significant savings.

I've worked with Francis at CleverTap and been amazed at how he runs an organization at scale with a small team. Over to you Francis!

Thank you, I'm excited to be here and talk about how CleverTap uses ECS to keep us agile and operate infrastructure at scale.

CleverTap provides an omni-channel customer engagement and user retention platform powered by TesseraDB, the world’s first purpose-built database for engagement and retention. This enables us to surface analytics and insights into user behaviors via cohorts, funnels, trends, pivots, and more. Based on the data, we can engage users across messaging channels like push, in-app, text, WhatsApp, email in real time. We enable app owners to retain and grow their user base.

I'm Francis Pereira, VP of Infrastructure and Security at CleverTap. I've been helping keep the lights on for 16 years, right out of college. The last 8 years I've been at CleverTap and today I'll talk about how we do ECS and our journey running infrastructure for 8 years that led us to fall in love with containers and where ECS fits in.

From 2012-2016, in the beginning, there wasn't much ECS. We bootstrapped EC2s for state and stateless workloads using user data to bootstrap configuration management to bring in the runtime, application, and start it up.

But there were too many moving parts and things could go wrong - package repositories being offline during bootstrap and causing failures mid-scale up. There had to be a better way.

You could bake the runtime and application into an AMI, but that had drawbacks too...

But this meant that for every build you're making an me. And that's a lot of em es in an, in an, in an account. So that's when we came across docker a way to package run time dependencies along with the application without the kernel and the other things that are required to keep the operating system going, interact with the hardware.

And so in 2017, we started using ecs with ec two. Sorry. In 2017, we went, what what i think of as container version one where we packaged the application inside a container and deployed it with and sort of deployed with aws code, deploy. No a cs this worked. But then we realized we were sort of reinventing a container orchestration engine and that's when we went all in with ecs.

Um but then something interesting happened this, so we've been on ecs since 2018 for both state full and stateless workloads. There's just everything orchestrated with ecs. But then something interesting started to happen. Starting mid june this year, we kind of made this decision to move to fargate to move all of our stateless workloads to fargate. I'm going to talk to you about fargate and why we want to do that. Why would we choose to do that?

So, a continual application is surrounded by its run time dependencies. That is the root file system as a team. This is where we want to focus our time and money. There is no business benefit on focusing on patching operating system packages, hardening the operating system. But don't get me wrong here. I'm not saying you shouldn't be doing it. It's super critical just that there's just no business benefit. Having run security engineering for a while, you have to figure out how do i balance time and focus on doing these repetitive things as opposed to spending time doing what really takes my business forward.

So what if just like we don't think about um the hypervisor, when was the last time you thought about patching a hypervisor? Um just like you don't think about virtualization. What if you could make the operating system go away? And that was our motivation for far far makes us forget the operating system. So we can just focus on, on what is key on the app, on the runtime environment, hardening and securing it.

The other thing is like ecs is great on ec two, but you have to manage these resources inside of the cluster and resources in the form of two instances things where your containers run. But the thing is your ecs container, the the the resource is decoupled from the containers. So you can find yourself in a state where the service is trying to scale up, but you don't have underlying compute to feed that scale up.

Um in our case, we have custom alarms and lambda functions all glued together to sort of pre empt when a service is going to scale or is going, is likely going to scale and then feed that capacity in place just before the service is going to scale up so that it's there is enough resources to run that. But why do it? I mean, you should be able to say, hey, ecs run this container and boom, it should come to life and that's where fargate comes in. And that's, that's why we made this decision to go all in and we've taken all of our stateless compute and moved it off to fargate.

Um this was like a three month exercise and i'm gonna tell you things that we learned. Um not just from the fargate migration but, but things from just operating ecs uh over, over the last four years, you would expect sort of cluster events. You know, all these things that are going on inside of a cluster things such as containers, starting up containers, scaling out containers, registering with a load balancer containers being terminated, containers, restarting in a restart loop and you expect them to expect to see them in the cloud trail. So you can sort of put up monitoring alerts, alarms, that kind of stuff, but it doesn't, that's not how it is.

And then it kind of struck us, right? Just like you don't see queries from an rds database inside of a r ds database. In the cloud trail or s3 data events get input in the cloud trail. These are events happening inside of a service. So, so they don't show up in the cloud trail. I wish somebody strolled with us up front.

Um the good news is that you can, you can send it to eventbridge and then from eventbridge, you can take it to whatever monitoring system you got and, and, and do what you want and what do you expect to do with it if they showed up in cloud trade.

The other thing is tasks can't be protected. So you can essentially take an easy two. You can take an easy two instance out of a load balancer to sort of observe its the application's behavior with our exposure to traffic. If you attempt to do that with ecs, both on ec two as well as fargate. The task is destroyed.

Um this like some kernel capa capabilities that don't work on fargate for most use cases, this you probably wouldn't encounter it. But if you do, i, i want to tell you up front, there are these nuances and you should go check them out.

So i told you it take like three months to move from ec to ecs to ec ecs fargate. And that's weird, right? You just, these are containers, they should just move from one place to another. And why does it take so long? Turns out we used to the application used instance i am credentials when it started making these calls to other aws services.

Now, when you move the application over to fargate, you no longer control that dc a, um, the underlining host. And so you cannot attach an i am role to it. And so ecs force and far forces you to sort of use the task. I am role. And now suddenly you have to go find every place that is an aws client in your application and change it, change it to a default credentials provider chain.

I wish somebody told me this up front. I wish this is documented and i wish the aws a p said warning do not use task. I am role or instance i am role instead use the default credentials chain because that's an interesting one, that one attempts to use credentials from what is available. So if the task role is available, it tries to tries to use those credentials and then if it isn't, it'll fall back to your uh instance. And this allows your application to run off ecs fargate as well as two.

Um and this is what set us back like a simple thing that had to move from e two to forget it settles back by three months. You can sort of orchestrate um state full workloads with ecs and two with the state being persisted on an ebs volume with, you know, some hacks that basically pin a service to a specific place where you want it to be. But on fargate, you can only do persistence with efs. There's just no support to be able to do persistence with ebs. And so we, these guys move these containers over from ec2 to, to fargate and they spend a lot of time trying to figure out what's wrong. Why isn't it working the way it should? And turns out ecs exac is the answer you can go drop in even though you don't control the underlining two instance to see what's going on just like you would. If it was two, you could ssh ssm and go see docker tail containers to see what's going on.

So s exact definitely a win but needs to be enabled up front. And when you're running state for workloads on ecs, it turns out that if you want change the instance type of the underlying container, you have to go destroy, terminate it, get rid of it and then bring it back to life so that ecs can then register it. But if you just sort of update it in place, shut it down, bring it back up as a different instance type up or down, then s is bluntly going to refuse to use it.

And then having gone sort of containers for the last four years, this independent of ecs, here's my takeaways in the beginning, we almost like built these custom containers um because we had to adapt like upstream containers to, for our use case over time containers became mainstream and we realize that we are doing it wrong. There is almost always a way to make the container behave and act and adapt to your environment from the outside or from these sidecar containers that contain configuration files that can be mounted into an oem container. Thereby letting you not have to worry about building these containers and build pipelines and things like that.

The other thing is ecs, like i told you, we spend a lot of time um going in and figuring out what, why is this container not doing things like what's is it really reading my values? It's understand all these environment variables or not in terms of exac is pretty cool. It's also cool if the service is breaking and you're in the middle of trying to fix it. But the thing here is that it needs to be enabled upfront. It's not something you can turn on on the fly, not some check box, you can run at run time. So you have to sort of just say upfront in a task definition um that you're kind of enabling ecs exac and then there's the side car deployment model.

So instead of like packing your container into one large monolith, for example, imagine a application and logging logging agent along with it, right? Your application runs and there is a logging agent that ships all of your logs to some central system so that it's there for analysis and things like that instead of putting these two together into one monolith um sort of container stick by the one application per container deployment model.

And so in our example, that works out to two containers in a task, the application itself and the log monitoring a and then you can do mount point that can be shared between the containers, thereby enabling the log forwarding agent to read the application logs and ship it to your central monitoring system.

And then having to run ecs for like the last for four years. Here's what we have or i have to tell you um, as takeaways, it's, it can be frustrating to debug wire containers, behaving in a specific way. If it's just in a randomly start loop, there's no way to tell what's going on. So you can use remote logging drivers that take your standard out and they ship it to a remote logging system. Suddenly you have visibility into what, what, what the standard out of the container is. I mean, you could do this on two, but you can't do this, you can't go docker log on fargate. So specifically when going to fargate, this is absolutely critical and saves so much time, then there is all these events going on and depending on how large your deployment is, how many clusters you have, how many task services there just crazy stuff going on all the time inside the stuff makes sense.

If you're doing ecs ec2 to ship your um easiest agent logs over to a central monitoring system. So that when something behaves weirdly, you at least have something to look at 3000 ft view into peeking in to see what's going on and irrespective of ecs ec two or ecs fargate the cluster events themselves, shipping them off to a central monitoring system enables you to look at like a 3000 ft view into what's going on.

And then this is my my my most favorite recommendation clusters are cheap. They are zero maintenance and the cost zero. In our case, we just take one m we literally deploy it like one microservice per cluster. It's very simple when you have to explain this to new people who are, you just hire people who are just starting out, people getting into your team and developers. It's simple. It runs one cluster in one service. Are you on? That's all i got.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值