aws 成本优化_AWS故事或其他任何云成本优化

aws 成本优化

Let me tell you how most, if not all the uneducated cost optimisations start, hit a wall and inevitably never achieve the long term results they’ve set out to realise and how you can avoid some of the pitfalls and save a few thousand, if not millions of $/£/€ along the way.

让我告诉您,如果不是所有未受教育的成本优化,大多数都是如何开始的,碰壁的,并且不可避免地永远无法实现他们已经着手实现的长期结果,以及如何避免某些陷阱并节省数千美元(如果有的话)一路走来不会有数百万美元/英镑/欧元。

This post is meant to allow all those companies that are not flooded by VC capital or have budgets to hire 7 figure teams of engineers to look after their cloud estate (a reasonable/unreasonable argument can be made that if you can’t afford said teams, you’re better off going the “buy” route instead of “build” but I digress, that can be a topic for a different post).

这篇文章旨在允许所有未被VC资金淹没或没有预算的公司聘请7位工程师组成的团队来照顾他们的云资产(可以提出合理/不合理的论据,即如果您负担不起上述团队, ,您最好选择“购买”路线,而不要使用“构建”路线,但是我离题了,这可能是另一篇文章的主题。

开始—预订 (The beginning — Reservations)

As with any “build-vs-buy” tale, there comes a time in all projects that use cloud providers where your Finance/Accounting colleagues schedule a meeting with a title that goes a little something like this: “AWS Cost Analysis”; “Cloud Costs Review”; “Finance — Cloud Sync”. This is usually because some high-level person has seen the current cloud spend and uttered some expletives at how much a project or department is over budget. Said meeting ends up becoming a multi-part engagement in “explain to me what this section of lines in our bill are, why they’re so huge and how you can make them smaller” that would be better replaced by having said colleagues attend the AWS Cost Management Curriculum or enrolling on the AWS Cloud Financial Management Course.

就像任何“构建与购买”故事一样,在所有使用云提供商的项目中都有一段时间,您的财务/会计同事安排了一个会议,会议标题类似:“ AWS成本分析”; “云成本审查”; “财务-云同步”。 这通常是因为一些高级人员看到了当前的云支出,并对项目或部门的预算超出了多少表示了exp贬。 所说的会议最终变成了一个多部分的活动,“向我解释我们法案中的这部分内容是什么,为什么它们如此之大,以及如何缩小它们”,最好由让同事参加会议来代替。 AWS成本管理课程或参加AWS Cloud财务管理课程

Amongst the back and forth between both parties, the engineering team will inevitably retort to some comment about the EC2 spend with the following: “Well if you’d allow us to reserve instances for 1 or even 3 years, we could reduce that spend by up to 60%”.

在双方之间的来回交流中,工程团队将不可避免地反驳有关EC2支出的一些评论,其中包括:“好吧,如果您允许我们将实例保留1年甚至3年,我们可以通过高达60%”。

And herein lies the first pitfall. To think that the answer to a large expenditure halfway through the year is to double down and poney up a multi-year commitment in the same financial calendar is equivalent to seeing that the living room is on fire and proposing that the best solution is to set up a burnout in the hallway to protect the rest of the house. While technically this might be the “smart” thing to do, to the uneducated, it might sound like a crazy plan and it usually ends up resulting in a long process that involves creating a spreadsheet that looks something like this:

这是第一个陷阱。 认为对半年中的大笔支出的答案是加倍并在同一财务日历中增加多年承诺,这等于看到客厅着火了,并提出最佳解决方案是确定在走廊上精疲力尽,以保护房子的其余部分。 从技术上讲,这可能是“聪明”的事情,对于没有受过教育的人来说,这听起来像是一个疯狂的计划,通常最终会导致一个漫长的过程,其中涉及到创建如下所示的电子表格:

Said spreadsheet will inevitably go ten rounds up and down the org chart until it’s deemed “too expensive to do this year” and is best kept in the shelve of “todo projects”.

所说的电子表格将不可避免地在组织结构图上上下移动十圈,直到被认为“今年做起来太昂贵”,并最好保留在“待办项目”的货架上。

Reservations, much like Volume Discounts, are a cost-saving strategy that requires forethought, time, reliable forecasting data and a mature understanding of the organization’s use of a given resource. None of which is clearly in place since you’re doing this exercise. Instead, Reservations should be either the mid or last step in the cost optimization plan. After you’ve gathered most or all of the above requirements, then you can approach your cloud partner of choice to discuss terms.

预订,很像数量折扣,是一种节省成本的策略,需要深思熟虑时间,可靠的预测数据以及成熟的组织使用某种资源的理解。 由于您正在执行此练习,因此显然没有一个合适的位置。 相反,保留应该是成本优化计划的中间或最后一步。 收集了上述大部分或全部要求之后,您可以与您选择的云合作伙伴讨论条款。

Bonus savings: RDS and Elasticaches can also be reserved and since they are usually an always-on asset, it’s a no-brainer to reserve capacity in these categories.

节省的额外费用:RDS和Elasticaches也可以保留,并且由于它们通常是永远在线的资产,因此保留这些类别的容量是不费吹灰之力的。

中间–调整大小和安排 (The middle — rightsizing and scheduling)

Once you’ve surpassed that first hurdle, you start arriving at more reasonable solutions such as turning off or scaling down your environments during off-hours and weekends and rightsizing your instances according to their use.

克服了第一个障碍之后,您就可以找到更合理的解决方案,例如在下班时间和周末关闭或缩减环境,并根据实例的用途调整实例的大小。

Assuming that you’re using infrastructure as code in the form of CloudFormation, Azure Templates or Terraform, then this should be a quasi-trivial task of rolling out a new version of your infra. However, if you have a less mature infrastructure setup or you’ve had enough staff turnover that most of it haven’t been run in months/years then this might end up being just the opportunity you needed to get that “modernisation” plan implemented.

假设您正在以CloudFormation,Azure模板或Terraform的形式使用基础结构作为代码,那么这应该是推出一个新版本的基础设施的简单任务。 但是,如果您的基础设施设置不成熟,或者员工流失率很高,以至于大部分时间都没有运行,那么这最终可能只是您实现“现代化”计划所需的机会。

The second pitfall that’s too easy to commit is to begin reducing and downsizing all the development environments and tooling instance to the point where your teams can go grab a full 3-course meal between build/deployment pipelines. You should focus instead on the following strategies:

容易犯下的第二个陷阱是开始减少和缩小所有开发环境和工具实例的规模,以使您的团队可以在构建/部署管道之间获得完整的三道菜。 您应该专注于以下策略:

多租户 (Multi-Tenancy)

Making use of cloud multitenancy (ECS, EKS, GKE, AKS, Fargate, …) to use your compute estate to its fullest capacity. If your product stack is partially or even fully dockerised, then it makes no sense to have them running in separate machines. Economies of scale matter here and container schedulers are there to ensure your workloads are up and stay up.

利用云多租户(ECS,EKS,GKE,AKS,Fargate等)来充分利用您的计算资产。 如果您的产品堆栈已部分或什至全部码头化,则让它们在单独的计算机上运行是没有意义的。 这里的规模经济很重要,并且那里有集装箱调度程序,以确保您的工作量不断增加。

竞价型实例并不危险或不可怕 (Spot instances are not (that) dangerous or scary)

Consider shifting all non-production workloads/environments to a percentage (or fully) to spot instances. With fleet autoscaling groups, it’s a no-brainer to switch your non-critical and non-persistent workloads (read web and application servers, and not DBs) to spot or mixed tiers of pricing model. Just beware that if there’s an AZ outage, all that capacity will be pulled to serve on-demand requests. So if your workloads are even mildly important, consider setting an “on-demand” percentage value >0% to guarantee that if your spots do get pulled, at least your environment will be just a least less broken. An extra step you might want to consider is switching your bastions to also use Spot Instances, these are usually only transient machines and don’t hold any persistent data.

考虑将所有非生产工作负载/环境转移到某个百分比(或完全转移)以发现实例。 使用车队自动伸缩组,将非关键性和非持久性工作负载(读取Web和应用程序服务器,而不是数据库)切换为现货或混合定价模型,这很容易。 只是要注意,如果发生可用区中断,那么所有容量将被拉动以满足按需请求。 因此,如果您的工作量甚至是非常重要的,请考虑将“按需”百分比值设置为> 0%,以确保如果您的位置确实被拔除,至少可以减少环境的破坏。 您可能要考虑的另一步骤是将堡垒切换为也使用竞价型实例,这些实例通常只是瞬态计算机,并且不保存任何持久性数据。

不使用时缩小Staging / NonProd环境的比例 (Scale your Staging/NonProd environments down when not in use)

One of the biggest fallacies when it comes to modern Software Development is that you need a staging environment the size of production all the time when in fact, you need a staging environment that’s architecturally the same as production and only the same size when you’re doing stress testing. At any other time, you only need it as big as your testing userbase. So allow it to scale down and shutdown overnight and you’ll see massive savings there. Bonus points for seeing how your scalability will work under increased stress and how fast it will scale once your performance testing suite starts hitting those load-balancers. If this last sentence is foreign to you, feel free to reach out and I’ll happily point you in the direction of someone who’ll provide their time for a reasonable fee to help you scale your environments.

关于现代软件开发,最大的谬误之一是您需要始终保持生产规模的过渡环境,而实际上,您需要一个在结构上与生产相同的过渡环境,而当您处于做压力测试。 在任何其他时间,您只需要和测试用户群一样大的需求即可。 因此,让它按比例缩小并在一夜之间关闭,您将在这里看到大量的节省。 奖励点在于,在性能测试套件开始达到那些负载平衡器后,您将可扩展性在压力增加下如何工作以及扩展速度如何。 如果这最后一句话对您而言是陌生的,请随时与我们联系,我会很乐意为您指明一个人的方向,他们将以合理的费用提供时间来帮助您扩展环境。

在生产环境中实施动态扩展 (Implement dynamic scaling in Production environments)

Image for post
Source: https://www.slideshare.net/AmazonWebServices/ent101-embracing-the-cloud-final
资料来源: https : //www.slideshare.net/AmazonWebServices/ent101-embracing-the-cloud-final

If you’ve successfully implemented scaling in non-prod environments, then it’s time to tackle production. Unless you’re Google or Amazon and you operate a 24/7 service, your usage pattern most likely resembles this image. This is a perfect opportunity to leverage scheduled and monitoring based scaling. Understand your usage patterns and adapt to them.

如果您已经成功地在非产品环境中实现了扩展,那么该是解决生产问题的时候了。 除非您是Google或Amazon并且经营24/7全天候服务,否则您的使用方式很可能类似于此图像。 这是利用计划的和基于监视的扩展的绝佳机会。 了解您的使用方式并适应它们。

(永无止境)端—流程,体系结构和保留 (The (never-ending) end — Processes, Architecture and Reservations)

So now that you’ve addressed most of the quick wins, it’s time to address the systemic issues. In this section, I will focus on AWS for the most part due to some idiosyncrasies of their billing model but rest assured other cloud providers have their constraints. The reasons that have caused you to start this exercise most likely stem from the following sources:

因此,既然您已经解决了大多数快速获胜的问题,那么现在该是解决系统性问题的时候了。 在本节中,由于其计费模型的某些特殊性,我将主要关注AWS,但请放心其他云提供商会受到约束。 导致您开始此练习的原因很可能来自以下来源:

  • Less than adequate Architectural Design

    不足的建筑设计
  • Incorrect usage of High Availability capabilities

    高可用性功能的不正确使用
  • Lack of Platform hygiene practices and processes.

    缺乏平台卫生习惯和流程。
  • Lack of Adequate Knowledge

    缺乏足够的知识

Let’s deep dive on some of these topics:

让我们深入探讨其中一些主题:

建筑设计评论 (Architectural Design Review)

One of the most common designs to have when starting in the cloud involves logically isolating environments (dev-test-stage-prod ) or stages of environments (dev-nonprod-prod) in different VPCs. These VPCs, if designed securely, will require NATs or NAT Gateways (times the number of AZs), Internet Gateways (again, times the number of AZs), separate clusters or autoscaling groups (which don’t usually gain with the economies of scale of small environments) and many other components. If you’re then required to connect environments via private routes rather than using publicly exposed endpoints, you’ll have to also consider either VPC Peering or other forms of connection. All of which will cost you a considerable amount of money.

在云中启动时最常见的设计之一是在不同VPC中逻辑隔离环境(dev-test-stage-prod)或环境阶段(dev-nonprod-prod)。 如果安全地设计这些VPC,它们将需要NAT或NAT网关(乘以AZ的数量),Internet网关(再次乘以AZ的数量),单独的群集或自动伸缩组(通常不会通过规模经济获得收益)小型环境)和许多其他组件。 如果随后需要通过专用路由而不是使用公开的端点连接环境,则还必须考虑VPC对连接或其他形式的连接。 所有这些都会使您花费大量金钱。

Compare this with oversizing your non-production VPC subnets and setting up your automation to allow for side by side deployments of multiple environments within. Need to connect two logical components? Security groups can easily be referenced by name. Need to have secure private connections between multiple environments? Internal load balancers are readily available and easy to configure. The benefits go on and on. Just ensure that you’re tagging and enforcing naming conventions so your inventory doesn’t get out of hand and you’ll be fine. (obviously, this does not apply to highly regulated environments which need to follow pre-defined standards and reference architectures).

与此相比,将非生产性VPC子网的规模过大,并设置自动化以允许在其中并行部署多个环境。 需要连接两个逻辑组件吗? 可以通过名称轻松引用安全组。 是否需要在多个环境之间建立安全的专用连接? 内部负载均衡器易于使用且易于配置。 好处不断。 只需确保您在标记和强制执行命名约定,以使您的库存不会失控,就可以了。 (显然,这不适用于需要遵循预定义标准和参考体系结构的高度管制的环境)。

Many other small changes can have big impacts on how efficiently you use the resources you pay for.

其他许多小变化可能会对您所使用的资源使用效率产生重大影响。

高可用性能力=昂贵的网络流量 (High Availability Capabilities = Expensive Network Traffic)

Corey Quinn put it best with this tweet:

Corey Quinn通过此推文表示最佳:

It takes skill to do a spit-take without a drink! 🥃
不喝酒就需要技巧! 🥃

I won’t spend too much time on explaining how AZ network charges on AWS work (I’ll leave that to Corey’s amazing blog here) but suffice to say that it costs twice as much as assumed every time data goes from Zone A to Zone B and four times as much is you happen to respond with a payload from B to A, you know as every modern system does. So you can see how easy it is for a poor multi-AZ design to get expensive quickly.

我不会花太多时间来解释AWS上的可用区网络收费如何工作(我将在此处留给Corey的精彩博客),但是只要说每次从区域A到区域的数据传输的成本是假设的两倍就足够了B和B的响应次数是您从B到A的有效载荷的四倍,就像每个现代系统一样。 因此,您可以看到不良的多可用区设计快速变得昂贵的容易程度。

Unless you need your dev environments to survive a full AZ outage, they probably don’t need to have their web tier in AZ-A, App tier in AZ-b and DB on AZ-C. The name of the game for non-critical workloads is Availability Zone Affinity.

除非您需要开发环境以使AZ完全中断而幸免,否则他们可能不需要在AZ-A中具有其Web层,在AZ-b中具有其应用层以及在AZ-C上具有DB。 非关键工作负载的游戏名称是“可用区亲和力”。

Does your dev1 “inventory” service have an identifiable data flow? Can that flow be isolated from other services? If the answers to those questions are yes, then you can probably pick one AZ at random in your favourite region, drop your database + container/servers in the subnets assigned to that AZ, set the right affinity constraints (if you are using ECS for example, here’s the doc; for EC2, placement groups have you covered) and see your cross availability network traffic charges drop down to zero for that data flow(and may even get a bump in latency). Apply this methodology for all the non-critical components you can apply it to and you’ll see a massive reduction to your AWS bill.

您的dev1“清单”服务是否具有可识别的数据流? 可以将该流量与其他服务隔离吗? 如果对这些问题的回答是肯定的,那么您可能可以在自己喜欢的区域中随机选择一个可用区,将数据库+容器/服务器拖放到分配给该可用区的子网中,设置正确的关联性限制(如果您将ECS用于例如,这是文档;对于EC2,您已覆盖了展示位置组),并看到针对该数据流,交叉可用性网络流量费用下降到了零(甚至可能会增加延迟)。 将这种方法应用于您可以应用到的所有非关键组件,您将看到AWS账单大幅减少。

过程和自动化为王 (Process and automation are kings)

Now that you’ve tagged all your infrastructure that’s living neatly together, you can go ahead and begin automating shutdown and termination of unruly infrastructure. Fortunately, there’s an open-source product that can get you started on this journey. Cloud Custodian is an amazing Cloud Security, Governance, and Management tool that in their own words:

现在,您已经标记了所有生活在一起的基础架构,接下来就可以开始自动执行关闭和终止不正常的基础架构的工作。 幸运的是,有一个开源产品可以帮助您开始这一旅程。 用自己的话来说,云托管人是一个了不起的云安全性,治理和管理工具:

(…) enables users to be well managed in the cloud. The simple YAML DSL allows you to easily define rules to enable a well-managed cloud infrastructure, that’s both secure and cost optimized. It consolidates many of the ad-hoc scripts organizations have into a lightweight and flexible tool, with unified metrics and reporting. Custodian supports managing AWS, Azure, and GCP public cloud environments.

(…)使用户可以在云中得到很好的管理。 简单的YAML DSL允许您轻松定义规则以启用管理良好的云基础架构,既安全又成本优化。 它将组织具有的许多临时脚本整合到一个轻量级且灵活的工具中,并具有统一的指标和报告。 托管人支持管理AWS,Azure和GCP公共云环境。

Do note that Cloud Custodian does not require you to use Terraform to manage your infrastructure, or care that you use ARM Templates or the gcloud alpha CLI to spin up your clusters. All it cares is that the infrastructure that should be up from 9–5 should have a certain set of tags and meet the right filters. Anything that doesn’t meet those gates can be “actioned” on to achieve the desired outcome. You can choose to shut it down, to email the relevant team or even outright terminate an instance a few minutes after it was launched.

请注意,Cloud Custodian不需要您使用Terraform来管理您的基础架构,也不必担心您使用ARM模板或gcloud alpha CLI来加速集群。 它关心的只是应该从9到5升级的基础结构应具有一组特定的标记并符合正确的过滤器。 任何不满足这些条件的事情都可以被“采取行动”以达到期望的结果。 您可以选择关闭实例,向相关团队发送电子邮件,甚至在实例启动后几分钟直接终止实例。

There are many other products and even custom-built solutions based on trusty bash scripts or Jenkins jobs that can achieve a similar result, but the mindset shouldn’t change. If you’re paying for resources on an on-demand basis, then only keep them up for as long as you need them. Everything else should be automated to be restarted from automation to a useful state.

还有许多其他产品,甚至是基于可信赖的bash脚本或Jenkins作业的定制解决方案,都可以达到类似的结果,但是心态不应改变。 如果您按需购买资源,则仅在需要时保持它们使用时间。 其他所有内容都应自动执行,以从自动化重新启动到可用状态。

教育是灵丹妙药 (Education is the silver bullet)

That’s it, that’s the message. Educate your users and your engineers to operate with a “cost-aware” mentality and they’ll be the force for the change you want to see in your landscape.

就是这样,这就是信息。 教育您的用户和工程师以“意识到成本”的心态进行操作,他们将成为您希望在自己的环境中看到的变化的力量。

闭幕 (Closing)

Image for post
Source: https://pixabay.com/illustrations/idea-light-bulb-enlightenment-1296144/
资料来源: https : //pixabay.com/illustrations/idea-light-bulb-enlightenment-1296144/

The TLDR for this post is rather simple, your Cloud costs are a direct result of the design decisions and operating procedures you have. Download your AWS bills, understand your usage patterns, iterate on improving your design and procedures and your costs will start declining. If you need help, feel free to reach out and I’ll link you to people who know their stuff.

这篇文章的TLDR非常简单,您的云成本是您的设计决策和操作程序的直接结果。 下载您的AWS账单,了解您的使用模式,反复改进设计和过程,您的成本将开始下降。 如果您需要帮助,请随时与我们联系,我会将您链接到了解他们的知识的人。

翻译自: https://medium.com/swlh/tales-of-aws-or-any-other-cloud-cost-optimisations-83726fde97dc

aws 成本优化

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值