gcp 创建实例_使用带有可抢占实例的dataproc在gcp上长时间运行spark作业

最新推荐文章于 2022-12-13 22:23:10 发布

weixin_26746861

最新推荐文章于 2022-12-13 22:23:10 发布

阅读量688

点赞数

文章标签： spark python

原文链接：https://blog.doit-intl.com/long-running-spark-jobs-on-gcp-using-dataproc-with-preemptible-instances-721e33d6c09

版权

gcp 创建实例

在Google Cloud上使用Dataproc的Hadoop / Spark (Hadoop/Spark using Dataproc on Google Cloud)

Dataproc is the go-to option when running a Hadoop cluster on top of Google Cloud Platform and is definitely a LOT easier than manually managing a cluster. Since Hadoop is part of the big data ecosystem, it many times has a big price tag to match that big data aspect as well. In order to reduce this price tag, many customers use preemptible instances as worker nodes on this cluster.

Dataproc是去到选项运行在谷歌云平台之上的Hadoop集群的时候，绝对是很多比手工管理群集容易。由于Hadoop是大数据生态系统的一部分，因此它在很多方面也具有与大数据方面相匹配的高昂价格。为了降低此价格标签，许多客户将可抢占实例用作该群集上的工作节点。

For the uninitiated, preemptible instances are virtual machines that exist using excess compute resources that a cloud provider has at that moment and can be reclaimed when those computing resources are needed elsewhere, thus they are good to think of as temporary virtual machines. They may or may not be available at any given point in time due to the availability of computing resources and they may be reclaimed back at any point in time with little to no notification. To offset this potential downfall the price of these instances is massively discounted, up to 80% according to Google, versus traditional virtual machine instances.

对于未启动的可抢占实例，是使用云提供商当时拥有的过量计算资源存在的虚拟机，可以在其他地方需要这些计算资源时将其回收，因此可以将它们视为临时虚拟机。由于计算资源的可用性，它们在任何给定时间点可能不可用，并且可能在任何时间点收回，而几乎没有通知。为了弥补这种潜在的故障，这些实例的价格被大幅打折，根据Google的说法，与传统的虚拟机实例相比，价格高达80％。

These instances are often attached to Dataproc clusters to reduce costs significantly or to allow add additional processing capability when needed.

这些实例通常附加到Dataproc群集上，以显着降低成本或在需要时允许添加其他处理能力。

A scenario that is brought up to us at DoiT International quite often is that a customer has an existing or is needing a new Hadoop cluster that will be running Spark jobs that run for long periods of time (hours or even days), but they need to be able to be scaled to the needed load or priced as cheaply as possible. Most of the time Dataproc with preemptible instances is ours and Google’s recommended option for this.

DoiT International经常向我们提出的一个场景是，客户已有或需要一个新的Hadoop集群，该集群将运行运行很长时间(几小时甚至几天)的Spark作业，但他们需要以便能够扩展到所需的负载或尽可能便宜地定价。大多数情况下，具有可抢占实例的Dataproc是我们的建议，也是Google推荐的选择。

A question that some more risk-averse customers have asked us is: how does Dataproc handle the scenario when preemptible instances are reclaimed back by Google especially on very long-running jobs processing mission-critical data?

一些其他规避风险的客户问我们的一个问题是：当Google收回可抢占实例时，尤其是在处理任务关键型数据的时间很长的作业中，Dataproc如何处理这种情况？

In order to answer this particular question, I created a scenario or experiment to simulate this happening in a production batch-load environment to determine how GCP’s managed Hadoop service would react.

为了回答这个特定问题，我创建了一个场景或实验来模拟在批量生产环境中发生的情况，以确定GCP的托管Hadoop服务将如何做出React。

火花检查点 (Spark Checkpointing)

First some background on how Spark handles moving workloads between different virtual machines or nodes that may exist for one operation and then will not for the next.

首先是一些背景知识，介绍了Spark如何处理在一个操作可能存在而在下一个操作中可能不存在的不同虚拟机或节点之间移动工作负载的情况。

Spark has a concept called checkpointing which at a very high level is writing the current state of a RDD or DataFrame (think a dataset inside of Spark) to disk. This is usefu

最低0.47元/天解锁文章

weixin_26746861

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
gcp 创建实例_使用带有可抢占实例的dataproc在gcp上长时间运行spark作业

gcp 创建实例在Google Cloud上使用Dataproc的Hadoop / Spark (Hadoop/Spark using Dataproc on Google Cloud)Dataproc is the go-to option when running a Hadoop cluster on top of Google Cloud Platform and is definit...
复制链接

扫一扫