gcp 创建实例_使用带有可抢占实例的dataproc在gcp上长时间运行spark作业

gcp 创建实例

在Google Cloud上使用Dataproc的Hadoop / Spark (Hadoop/Spark using Dataproc on Google Cloud)

Dataproc is the go-to option when running a Hadoop cluster on top of Google Cloud Platform and is definitely a LOT easier than manually managing a cluster. Since Hadoop is part of the big data ecosystem, it many times has a big price tag to match that big data aspect as well. In order to reduce this price tag, many customers use preemptible instances as worker nodes on this cluster.

Dataproc是去到选项运行在谷歌云平台之上的Hadoop集群的时候,绝对是很多比手工管理群集容易。 由于Hadoop是大数据生态系统的一部分,因此它在很多方面也具有与大数据方面相匹配的高昂价格。 为了降低此价格标签,许多客户将可抢占实例用作该群集上的工作节点。

For the uninitiated, preemptible instances are virtual machines that exist using excess compute resources that a cloud provider has at that moment and can be reclaimed when those computing resources are needed elsewhere, thus they are good to think of as temporary virtual machines. They may or may not be available at any given point in time due to the availability of computing resources and they may be reclaimed back at any point in time with little to no notification. To offset this potential downfall the price of these instances is massively discounted, up to 80% according to Google, versus traditional virtual machine instances.

对于未启动的可抢占实例,是使用云提供商当时拥有的过量计算资源存在的虚拟机,可以在其他地方需要这些计算资源时将其回收,因此可以将它们视为临时虚拟机。 由于计算资源的可用性,它们在任何给定时间点可能不可用,并且可能在任何时间点收回,而几乎没有通知。 为了弥补这种潜在的故障,这些实例的价格被大幅打折,根据Google的说法,与传统的虚拟机实例相比,价格高达80%。

These instances are often attached to Dataproc clusters to reduce costs significantly or to allow add additional processing capability when needed.

这些实例通常附加到Dataproc群集上,以显着降低成本或在需要时允许添加其他处理能力。

A scenario that is brought up to us at DoiT International quite often is that a customer has an existing or is needing a new Hadoop cluster that will be running Spark jobs that run for long periods of time (hours or even days), but they need to be able to be scaled to the needed load or priced as cheaply as possible. Most of the time Dataproc with preemptible instances is ours and Google’s recommended option for this.

DoiT International经常向我们提出的一个场景是,客户已有或需要一个新的Hadoop集群,该集群将运行运行很长时间(几小时甚至几天)的Spark作业,但他们需要以便能够扩展到所需的负载或尽可能便宜地定价。 大多数情况下,具有可抢占实例的Dataproc是我们的建议,也是Google推荐的选择。

A question that some more risk-averse customers have asked us is: how does Dataproc handle the scenario when preemptible instances are reclaimed back by Google especially on very long-running jobs processing mission-critical data?

一些其他规避风险的客户问我们的一个问题是:当Google收回可抢占实例时,尤其是在处理任务关键型数据的时间很长的作业中,Dataproc如何处理这种情况?

In order to answer this particular question, I created a scenario or experiment to simulate this happening in a production batch-load environment to determine how GCP’s managed Hadoop service would react.

为了回答这个特定问题,我创建了一个场景或实验来模拟在批量生产环境中发生的情况,以确定GCP的托管Hadoop服务将如何做出React。

火花检查点 (Spark Checkpointing)

First some background on how Spark handles moving workloads between different virtual machines or nodes that may exist for one operation and then will not for the next.

首先是一些背景知识,介绍了Spark如何处理在一个操作可能存在而在下一个操作中可能不存在的不同虚拟机或节点之间移动工作负载的情况。

Spark has a concept called checkpointing which at a very high level is writing the current state of a RDD or DataFrame (think a dataset inside of Spark) to disk. This is useful because it makes a “bookmark” in your job so that if a virtual machine becomes unhealthy (dies or otherwise becomes unavailable) another instance can pick it up from the last bookmark and start from there.

Spark有一个称为检查点的概念,它在非常高的层次上将RDD或DataFrame的当前状态(认为Spark内部的数据集)写入磁盘。 这很有用,因为它在您的工作中添加了一个“书签”,这样,如果虚拟机运行状况不佳(死亡或以其他方式变得不可用),另一个实例可以从上一个书签中选取它并从那里开始。

In this case, if a cluster uses preemptible instances and it is reclaimed then the existence of a checkpoint will allow processing to continue mostly interrupted from that checkpoint on a different worker node.

在这种情况下,如果集群使用抢占式实例并被回收,则检查点的存在将使处理继续在不同的工作节点上继续从该检查点中断。

Here is a quick example of how to set up a checkpoint with some PySpark code. It is setting the checkpoint directory, selecting out a column from a dataframe and checkpointing the result before writing it to a parquet file to HDFS:

这是一个如何使用一些PySpark代码设置检查点的快速示例。 它正在设置检查点目录,从数据框中选择一列,并在将结果写入到实木复合地板文件中之前对结果进行检查点设置:

spark.sparkContext.setCheckpointDir('gs://bucket/checkpoints')events_df = df.select('event_type')
events_df.checkpoint()
events_df.write.format("parquet").save("/results/1234/")

Now assuming you have a Dataproc cluster with a single master, 2 worker nodes, and 2 preemptible instance worker nodes running the above code. If it is running and one of the preemptible instance worker nodes is reclaimed by Google during the final line of the example writing the results then Spark will detect the failure of the node and schedule it back onto another node. Now instead of starting over at the beginning of the example, it will start at the last line because it has a saved checkpoint right before that line.

现在,假设您有一个具有单个主节点,2个工作程序节点和2个运行上述代码的可抢占实例工作程序节点的Dataproc集群。 如果它正在运行并且在示例的最后一行中,可抢占的实例工作器节点之一被Google回收,那么Spark将检测到该节点的故障并将其安排回另一个节点。 现在,它不是从示例的开头重新开始,而是从最后一行开始,因为它在该行的前面有一个已保存的检查点。

This is a very simple example, but when you have a Spark job that has 100+ operations that may take 5 minutes each to run this can be a lifesaver especially with a large fleet of preemptible instances that might have any number of them disappearing between operations.

这是一个非常简单的示例,但是当您有一个Spark作业包含100多个操作,每个操作可能需要5分钟才能运行时,这可能是一个救命稻草,特别是对于大量的抢占式实例而言,在操作之间可能有任意数量的实例消失。

在Dataproc上长时间运行Spark作业的实验 (The Experiment of a Long-Running Spark Job on Dataproc)

While this sounds incredible in theory there seems to be very little actual validation of this documented online running on Dataproc and this seems to be why there is a lot of uncertainty on this by customers considering Dataproc for their big data workloads.

尽管从理论上讲这听起来令人难以置信,但实际上几乎没有对在Dataproc上运行的此文档在线进行实际验证,这似乎就是为什么考虑将Dataproc用于大数据工作负载的客户对此存在很多不确定性的原因。

TL;DR version: yes it works and runs the job to completion.

TL; DR版本:是的,它可以正常运行并完成作业。

Here is how I set up this environment so it can be recreated for validation by others as required:

这是我设置此环境的方式,以便可以根据需要重新创建该环境以供其他人验证:

In order to test this and not let any potential preemptible actions slip by, I created a test environment that would be theoretically running a Spark job 24/7 to see what occurred when a preemptible VM was reclaimed and/or replaced.

为了测试这一点,并且不让任何潜在的抢占式动作溜走,我创建了一个测试环境,该环境在理论上将运行Spark作业24/7,以查看回收和/或替换抢占式VM时发生的情况。

This test environment consists of a Dataproc cluster with a master node, 2 worker nodes, and 2 preemptible instance worker nodes, a batch (non-streaming) Spark job that runs in a little over 30 minutes, and a Cloud Scheduler job that runs the Spark job every 30 minutes so that it would run as closely to 24/7 as possible. I chose to use N1 vCPUs on the worker nodes since they are older and had a higher chance of being reclaimed more often, it turns out E2 instances are reclaimed a lot less than N1 instances. This Spark job is a very basic job where I pulled an open data set from BigQuery and performed a multitude of random expensive operations, such as joins, cross joins, random sample aggregations, etc., on it to simulate a real data processing job that would spread the processing out across all nodes in the cluster.

此测试环境由一个Dataproc集群组成,该集群具有一个主节点,2个工作程序节点和2个可抢占的实例工作程序节点,一个批处理(非流式)Spark作业,该作业在30分钟内运行,而Cloud Scheduler作业在一个每30分钟执行一次火花作业,以使其尽可能接近24/7。 我选择在工作节点上使用N1个vCPU,因为它们较旧并且有更高的回收机会,事实证明,回收E2实例的数量比N1实例少得多。 这个Spark作业是非常基本的作业,我从BigQuery中提取了一个开放数据集,并对其执行了许多随机的,昂贵的操作,例如联接,交叉联接,随机样本聚合等,以模拟实际的数据处理作业,将把处理分散到群集中的所有节点上。

To monitor when a preemptible instance was reclaimed and/or replaced I created a custom metric for monitoring on the managed instance group Dataproc creates for the cluster (the name is usually dataproc-cluster-<cluster name>) and put it on a dashboard graph. Monitoring the dips and rises on this graph showed when preemptible instances were reclaimed and/or replaced, which gave me the times to use for filtering the logs.

为了监视何时抢占和/或替换了可抢占实例,我创建了一个自定义指标以监视Dataproc为集群创建的托管实例组(名称通常为dataproc-cluster- <cluster name>)并将其放在仪表板图上。 监视此图上的跌落和上升情况表明,何时收回和/或替换了可抢占实例,这使我有时间来过滤日志。

Image for post
Example metric on a dashboard showing managed instance group size
仪表板上的指标指标示例,显示托管实例组的大小

Once this was metric was validated with the instance and job’s master node logs, I started the process to run the job over the long Labor Day weekend here in the US and the following Tuesday to get a full weekend and a busy workday into the analysis. One thing to note is that preemptible instances run for a maximum of 24 hours at a time and will be reclaimed then a restart attempted at the 24-hour mark.

在实例和作业的主节点日志中验证了该指标后,我就开始在美国劳动节这一长周末和下个星期二运行作业,以使整个周末和繁忙的工作日进入分析。 需要注意的一件事是,抢占式实例一次最多运行24小时,将被回收,然后在24小时标记尝试重新启动。

Over the course of this time, there were multiple dips and rises in the graph when the instances were reclaimed and replaced (see above graphic for an example). There were some “false positives” where the jobs were still starting up or were performing reads during the time the reclaim operations occurred that showed the intended behavior, they just were not very easy examples to show for the purpose of this article. Though a textbook example allowing easy visualization of the behavior did occur when right after a checkpoint operation and in the middle of writing dataframe to HDFS the preemptible instance doing the writing was reclaimed and restarted leaving some very clear logs behind showing its behavior.

在这段时间中,当实例被回收和替换时,图中有多次下降和上升(有关示例,请参见上图)。 在回收操作发生期间,仍有一些“误报”表明作业仍在启动或正在执行读取操作,这些操作显示了预期的行为,但就本文目的而言,它们并不是非常简单的示例。 尽管在进行检查点操作之后以及在将数据帧写入HDFS的过程中确实发生了一个允许对行为进行轻松可视化的教科书示例,但该可写实例被收回并重新启动,从而留下了一些非常清晰的日志来显示其行为。

The exception is thrown, from the Dataproc master log, was this:

从Dataproc主日志中引发的异常是:

20/09/08 20:05:36 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 18.0 in stage 21.0 (TID 1490, cluster-4b46-sw-41l5.c.project-id.internal, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:288)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

This was exactly what I was looking for as it showed the machine being reclaimed during a write operation and conveniently right after a checkpoint operation had been done. After this exception there were a few sets of exceptions that raised warnings like these showing the failure to communicate with and schedule work on the reclaimed node (executor 2) confirming the reclaiming of a node:

这正是我要查找的内容,因为它显示了在写操作期间以及检查点操作完成后立即回收的机器。 在发生此异常之后,有几组异常引发了如下警告,这些警告表明无法与回收节点(执行器2)进行通信并无法安排工作,从而确认了节点的回收:

20/09/08 20:06:41 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container marked as failed: container_1599249516460_0228_01_000003 on host: cluster-rand-sw-4c7x.c.project-id.internal. Exit status: -100. Diagnostics: Container released on a *lost* node20/09/08 20:06:41 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 2 on cluster-rand-sw-4c7x.c.project-id.internal: Container marked as failed: container_1599249516460_0228_01_000003 on host: cluster-rand-sw-4c7x.c.project-id.internal. Exit status: -100. Diagnostics: Container released on a *lost* node20/09/08 20:06:41 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 73.0 in stage 63.0 (TID 9855, cluster-rand-sw-4c7x.c.project-id.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1599249516460_0228_01_000003 on host: cluster-rand-sw-4c7x.c.project-id.internal. Exit status: -100. Diagnostics: Container released on a *lost* node20/09/08 20:06:41 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 74.0 in stage 63.0 (TID 9860, cluster-rand-sw-4c7x.c.project-id.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1599249516460_0228_01_000003 on host: cluster-rand-sw-4c7x.c.project-id.internal. Exit status: -100. Diagnostics: Container released on a *lost* node20/09/08 20:06:41 WARN org.apache.spark.ExecutorAllocationManager: Attempted to mark unknown executor 2 idle

After a few seconds and a few sets of those exceptions, the job continued on as normal to pass a job success value back to Dataproc. I verified that the data had written correctly to the destination folder on HDFS showing that the operation did complete successfully as intended.

在几秒钟和几套这些异常之后,作业照常继续进行,以将作业成功值传递回Dataproc。 我确认数据已正确写入HDFS的目标文件夹,表明该操作确实按预期成功完成。

When I switched over to the logs for the new preemptible instance that replaced the reclaimed one I found it had started exactly where the processing on the reclaimed one had left off. Please note this was a random but excellent example as it pushed the task back onto the replacement instance instead of another worker node, this will not happen most of the time (1 out of 9 for me in this experiment). Here are the log entries from the new instance:

当我切换到替换可回收实例的新可抢占实例的日志时,我发现它恰好开始于已回收实例的处理停止位置。 请注意,这是一个随机但出色的示例,因为它将任务推回到替换实例上,而不是另一个工作节点上,这种情况在大多数情况下都不会发生(在本实验中,我的9分之一)。 这是新实例的日志条目:

{
"insertId": "j96wpu5rh8p09edb5",
"jsonPayload": {
"message": "src: /10.128.0.9:55928, dest: /10.128.0.8:9866, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-1208291363_17, offset: 0, srvID: 3b9b065f-15f4-49d7-a9ad-a5a2136e4ce1, blockid: BP-2070054281-10.128.0.10-1599249511859:blk_1073816330_75506, duration(ns): 556814753645",
"class": "org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace",
"filename": "hadoop-hdfs-datanode-cluster-rand-w-1.log"
},
"resource": {
"type": "cloud_dataproc_cluster",
"labels": {
"project_id": "project-id",
"cluster_uuid": "3de29175-f051-4aa5-9dee-e9925bfabec2",
"region": "us-central1",
"cluster_name": "cluster-rand"
}
},
"timestamp": "2020-09-08T19:06:15.035Z",
"severity": "INFO",
"labels": {
"compute.googleapis.com/resource_id": "5331347012694516446",
"compute.googleapis.com/resource_name": "cluster-rand-w-1",
"compute.googleapis.com/zone": "us-central1-a"
},
"logName": "projects/project-id/logs/hadoop-hdfs-datanode",
"receiveTimestamp": "2020-09-08T19:06:21.477492444Z"
}
{
"insertId": "j96wpu5rh8p09edb6",
"jsonPayload": {
"class": "org.apache.hadoop.hdfs.server.datanode.DataNode",
"filename": "hadoop-hdfs-datanode-cluster-rand-w-1.log",
"message": "PacketResponder: BP-2070054281-10.128.0.10-1599249511859:blk_1073816330_75506, type=LAST_IN_PIPELINE terminating"
},
"resource": {
"type": "cloud_dataproc_cluster",
"labels": {
"project_id": "project-id",
"cluster_uuid": "3de29175-f051-4aa5-9dee-e9925bfabec2",
"region": "us-central1",
"cluster_name": "cluster-rand"
}
},
"timestamp": "2020-09-08T19:06:15.035Z",
"severity": "INFO",
"labels": {
"compute.googleapis.com/resource_id": "5331347012694516446",
"compute.googleapis.com/zone": "us-central1-a",
"compute.googleapis.com/resource_name": "cluster-rand-w-1"
},
"logName": "projects/project-id/logs/hadoop-hdfs-datanode",
"receiveTimestamp": "2020-09-08T19:06:21.477492444Z"
}

While unfortunately, it does not write out any log entries showing it reading back from the checkpoint directory that I could find in any of the available logs, it did continue exactly where it had left off then completed the operations left in the job.

不幸的是,它没有写出任何日志条目,表明它从我可以在任何可用日志中找到的检查点目录中进行回读,但它确实继续从上次退出的位置继续,然后完成了作业中剩下的操作。

结论 (Conclusion)

To conclude on this experiment Dataproc handled the reclaiming and the replacing of a preemptible instance node as expected by the design of Hadoop and Spark. Google’s engineers have done a wonderful job in ensuring that Dataproc seamlessly handles “failures” of worker nodes i.e. when a preemptible instance is reclaimed without warning.

总结一下这个实验,Dataproc按照Hadoop和Spark的设计处理了可抢占实例节点的回收和替换。 Google的工程师在确保Dataproc无缝处理工作程序节点的“故障”(即在无预警的情况下收回可抢占实例)方面做得非常出色。

翻译自: https://blog.doit-intl.com/long-running-spark-jobs-on-gcp-using-dataproc-with-preemptible-instances-721e33d6c09

gcp 创建实例

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值