spark 示例_第2部分：真实的Apache Spark成本调整示例

最新推荐文章于 2023-05-18 11:20:01 发布

weixin_26752759

最新推荐文章于 2023-05-18 11:20:01 发布

阅读量199

点赞数

文章标签： python spark 人工智能

原文链接：https://medium.com/expedia-group-tech/part-2-real-world-apache-spark-cost-tuning-examples-42390ee69194

版权

spark 示例

Expedia Group Technology —软件 (EXPEDIA GROUP TECHNOLOGY — SOFTWARE)

Below is a screenshot highlighting some jobs at Expedia Group™ that were cost tuned using the principles in this guide. I want to stress that no code changes were involved, only the spark submit parameters were changed during the cost tuning process. Pay close attention to the Node utilization column that is highlighted in yellow.

下面的屏幕截图突出显示了Expedia Group™的一些工作，这些工作已使用本指南中的原则进行了成本调整。我想强调的是，在成本调整过程中，不涉及任何代码更改，仅更改了spark提交参数。请密切注意以黄色突出显示的“节点利用率”列。

Shows examples of spark jobs with node configurations before and after cost tuning and the cost savings made — Cost reductions of Apache Spark jobs achieved by getting the node utilization right — costs are representative

Here you can see how improving the CPU utilization of a node lowered the costs for that job. If we have too many Spark cores competing for node CPUs, then time slicing occurs which slows down our Spark cores which in turn hampers job performance. If we have too few Spark cores utilizing node CPUs, then we are wasting money spent for that node’s time because of the node CPUs that are going unused.

在这里，您可以看到如何提高节点的CPU利用率来降低该作业的成本。如果我们有太多的Spark内核争用节点CPU，则会发生时间分片，这会减慢我们的Spark内核的速度，进而影响工作性能。如果我们使用节点CPU的Spark内核太少，那么由于节点CPU未使用，我们浪费了该节点的时间。

You may also notice that perfect node CPU utilization was not achieved in every case. This will happen at times and is acceptable. Our goal is to improve node CPU utilization every time we cost tune rather than trying to get it perfect.

您可能还会注意到，在每种情况下都无法实现完美的节点CPU利用率。这有时会发生并且可以接受。我们的目标是每次调整成本时都提高节点CPU利用率，而不是试图使其达到完美。

实际工作成本 (Actual Job Costs)

Determining actual job costs is pretty difficult. At Expedia, we built pipelines that combine hourly cost data from AWS with job level cost allocations from Qubole that help us determine actual job level costs.

确定实际工作成本非常困难。在Expedia，我们建立了将AWS的每小时成本数据与Qubole的工作级成本分配结合起来的管道，可帮助我们确定实际的工作级成本。

For those who don’t have pipelines built to determine their job costs, check with your Data Management Platform. Qubole recently introduced Big Data Cost Explorer to help their users easily identify job costs. For EMR users, AWS provides Cost Explorer which you can learn to setup via this link.

对于那些没有管道确定工作成本的人，请与您的数据管理平台联系。 Qubole最近推出了大数据成本资源管理器，以帮助其用户轻松识别作业成本。对于EMR用户，AWS提供了Cost Explorer，您可以通过此链接学习设置。

AWS EC2定价 (AWS EC2 Pricing)

Mostly EC2 instance cost, but also costs from EBS volume, data transfer, and data management platform — A breakdown of AWS EC2 pricing

Let’s dig a little into AWS pricing for Spark jobs since most platforms use AWS for their cloud computing needs. The biggest cost for Spark jobs is by far the EC2 instances/nodes used during the job. With that said, there are other charges from AWS that may impact the cost of your job in certain situations.

由于大多数平台都使用AWS满足其云计算需求，因此让我们深入研究一下Spark作业的AWS定价。到目前为止，Spark作业的最大成本是作业期间使用的EC2实例/节点。话虽如此，在某些情况下，AWS还有其他收费可能会影响您的工作成本。

Data Transfer: This charge is for transferring data out of AWS as well as transferring data between AWS regions. This charge is on a per GB basis. Loading data into AWS or operating with data already on AWS is free.

数据传输 ：此费用用于从AWS外部传输数据以及在AWS区域之间传输数据。此费用按每GB收费。免费将数据加载到AWS或使用AWS上已有的数据进行操作。

EBS volumes: This charge is for any AWS storage utilized while persisting datasets during your Spark job to EBS volumes. To be clear, this charge is not for writing data to S3. Instead, this charge is for when a Spark job persists data to disk. Again, if no data is persisted then this charge will be zero. This charge is at a per GB per Month level.

EBS卷 ：此费用用于在Spark作业期间将数据集持久保存到EBS卷时使用的任何AWS存储。需要明确的是，此费用不是用于将数据写入S3。相反，当Spark作业将数据持久保存到磁盘时，将收取此费用。同样，如果没有数据可保留，则该费用将为零。此费用按每月 每GB的 水平收费。

And finally, whatever Data Management Platform tool you are using (Qubole, EMR, Databricks, etc) also has a charge they will add onto your job. In EMR’s case, they will add a per second per instance surcharge on top of your EC2 instance costs that will also need to be accounted for. Qubole adds a similar charge to its QCU (Qubole Compute Unit) values.

最后，无论您使用什么数据管理平台工具(Qubole，EMR，Databricks等)，都需要付费，它们会添加到您的工作中。在EMR的情况下，他们将在您的EC2实例成本之上增加每个实例每秒的附加费，这也需要考虑在内。 Qubole向其QCU(Qubole计算单位)值添加了类似的费用。

估算工作成本 (Estimating job costs)

For the purposes of this guide, I’m only going to focus on how to estimate the costs of your EC2 instances since that’s the biggest factor affecting the overall costs of your jobs. The other charges are either nominal or will scale linearly with your EC2 instance costs.

就本指南而言，我将仅着重于如何估算EC2实例的成本，因为这是影响工作总体成本的最大因素。其他费用是象征性的，或与您的EC2实例成本成线性比例。

Before I do, I need to mention there are several methods for reducing EC2 instance costs — like dynamic allocation and spot nodes — that we are going to ignore for this exercise. We are ignoring these additional cost reduction methods because they don’t impact the goal of what we are trying to achieve…aligning executors across the optimum amount of nodes.

在开始之前，我需要提一下降低EC2实例成本的几种方法，例如动态分配和竞价型节点，在本练习中我们将忽略它们。我们忽略了这些额外的成本降低方法，因为它们不会影响我们要实现的目标……在最佳数量的节点上调整执行器。

Because dynamic allocation complicates the estimation process with the number of executors changing wildly throughout a job, I recommend that you disable dynamic allocation during cost tuning when trying to estimate job costs. We do this so we can gauge the efficiency of the executor. You can turn dynamic allocation back on when you are done cost tuning to save more money.

由于动态分配使估算过程变得复杂，并且整个作业中执行者的数量发生了巨大变化，因此，当您尝试估算作业成本时，建议您在成本调整期间禁用动态分配。我们这样做是为了评估执行器的效率。完成成本调整后，您可以重新打开动态分配以节省更多资金。

We estimate job costs by doing the following.

我们通过执行以下估算工作成本。

1) Determine how many executors should fit on a node by taking total executor memory size (executor_memory + overhead memory) divided by available node memory.

1)通过将执行程序总内存大小(executor_memory +开销内存)除以可用节点内存，确定在一个节点上适合多少执行程序。

2) Determine EC2 cost for your node type by looking on this AWS EC2 Instance Pricing page.

2)通过查看此AWS EC2实例定价页面，确定节点类型的EC2成本。

3) Multiply node counts used by EC2 node cost by run time (as expressed in fractional hours) .

3)将EC2节点成本使用的节点计数乘以运行时间(以小数小时表示)。

While the costs calculated will only be an initial estimate, these are valuable because we can use them for comparison with job costs of various configurations.

尽管计算出的成本只是初步估计，但它们很有价值，因为我们可以将其与各种配置的工作成本进行比较。

In the next part, we will look at how to determine what are the efficient executor configurations for the EC2 nodes your batch Spark job runs on.

在下一部分中，我们将研究如何确定批处理Spark作业在其上运行的EC2节点的有效执行程序配置。

系列内容 (Series contents)

Learn more about technology at Expedia Group

在Expedia Group上了解有关技术的更多信息

翻译自: https://medium.com/expedia-group-tech/part-2-real-world-apache-spark-cost-tuning-examples-42390ee69194

spark 示例

weixin_26752759

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark 示例_第2部分：真实的Apache Spark成本调整示例

spark 示例 Expedia Group Technology —软件 (EXPEDIA GROUP TECHNOLOGY — SOFTWARE) 我概述了进行成本调整的过程 (I outline the procedure for working through cost tuning)Photo by Fabrik Bilder on Shutterstock Fabrik Bil...
复制链接

扫一扫