spark构建数据仓库_我们如何构建无服务器Spark平台的数据机制视频教程

最新推荐文章于 2024-06-21 08:30:00 发布

weixin_26735933

最新推荐文章于 2024-06-21 08:30:00 发布

阅读量248

点赞数

文章标签：大数据 spark python java 区块链

原文链接：https://towardsdatascience.com/how-we-built-a-serverless-spark-platform-video-tour-of-data-mechanics-583d1b9f6cb0

版权

spark构建数据仓库

Our mission at Data Mechanics is to give data engineers and data scientists the ability to build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.

我们在Data Mechanics的使命是使数据工程师和数据科学家能够在便携式计算机上运行脚本的简单性，从而在大型数据集上构建管道和模型。在我们处理基础架构管理机制的同时，让他们专注于数据。

So, we built a serverless Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.

因此，我们构建了一个无服务器的Spark平台，它是诸如Amazon EMR，Google Dataproc，Azure HDInsight，Databricks，Qubole，Cloudera和Hortonworks之类的服务的更易用且性能更高的替代产品。

In this video, we will give you a product tour of our platform and some of its core features:

在此视频中，我们将为您提供有关我们平台及其一些核心功能的产品介绍：

How to connect a Jupyter notebook to the platform, play with Spark interactively
如何将Jupyter笔记本电脑连接到平台并与Spark互动玩
How to submit applications programmatically using our API or our Airflow integration
如何使用我们的API或Airflow集成以编程方式提交应用程序
How to monitor logs and metrics for your Spark app from our dashboard
如何从我们的仪表板监控Spark应用程序的日志和指标
How to track your costs, stability and performance over time of your jobs (recurring apps)
如何跟踪工作期间的成本，稳定性和性能(重复使用的应用程序)

演示地址

Data Mechanics Intro to Spark & Product Tour

Data Mechanics Spark和产品导览

是什么使Data Mechanics成为无服务器Spark平台？ (What makes Data Mechanics a Serverless Spark platform?)

我们的自动驾驶功能 (Our autopilot features)

Our platform dynamically and continuously optimizes the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:

我们的平台动态，连续地优化每个Spark应用程序的基础结构参数和Spark配置，以使其稳定和高效。以下是我们调整的一些参数：

The container sizes (memory, CPU) — to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, CPU-bound, I/O-bound)
容器大小(内存，CPU)-为了保持您的应用稳定(避免OutOfMemory错误)，优化节点上容器的装箱并通过应对其瓶颈来提高应用性能(内存受限，CPU-绑定，I / O绑定)
The default number of partitions used by Spark to increase its degree of parallelism.
Spark用于增加其并行度的默认分区数。
The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.
磁盘大小，随机播放和I / O配置可确保数据传输阶段以最佳速度运行。

Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without the need for manual action from you.

我们的自动调整功能是针对重复应用程序的过去运行进行训练的。它会自动对代码或输入数据的更改做出React，从而使您的应用程序随着时间的推移保持稳定和高效，而无需您手动进行操作。

Image for post — How to automate performance tuning for Apache Spark 如何为Apache Spark自动化性能调整

In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:

除了自动调整之外，我们的第二个自动驾驶功能是自动调整比例。我们支持两个级别的自动缩放：

At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
在应用程序级别：每个Spark应用程序都基于负载(动态分配)动态扩展其执行程序的数量
At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider
在集群级别：Kubernetes集群自动从云提供商中添加和删除节点

This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all times.

该模型使每个应用程序都能完全隔离地工作(具有其自己的Spark版本，依赖关系和资源)，同时始终保持基础架构的成本效益。

我们的云原生容器化 (Our cloud-native containerization)

Data Mechanics is deployed on a Kubernetes cluster in our customers’ cloud accounts (while most other platforms still run Spark on YARN, Hadoop’s scheduler).

Data Mechanics部署在客户云帐户中的Kubernetes集群上(而其他大多数平台仍在Hadoop的调度程序YARN上运行Spark)。

This deployment model has key benefits:

此部署模型具有以下主要优点：

An airtight security model: our customers’ sensitive data stays in their cloud account and VPC.
密封的安全模型：我们客户的敏感数据保留在他们的云帐户和VPC中。
Native Docker support: our customers can use our set of pre-built optimized Spark docker images or build their own Docker images to package their dependency in a reliable way. Learn more about using custom Docker images on Data Mechanics.
本地Docker支持：我们的客户可以使用我们的一组预先构建的优化的Spark Docker映像或构建自己的Docker映像以可靠的方式打包其依赖关系。了解有关在Data Mechanics上使用自定义Docker映像的更多信息。
Integration with the rich tools from the Kubernetes ecosystem.
与Kubernetes生态系统中丰富的工具集成。
Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.
云不可知论性：AWS，GCP和Azure上都可以使用Data Mechanics。

我们的无服务器定价模型 (Our serverless pricing model)

Competing data platforms’ pricing models are based on server uptime. For each instance type, they’ll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they’re not wasting ressources due to over-provisioning or parallelism issues.

竞争数据平台的定价模型基于服务器正常运行时间。对于每种实例类型，无论该实例是否实际用于运行Spark应用，他们都会向您收取小时费用。这给Spark开发人员带来了负担，使其无法有效地管理集群，并确保他们不会因为过度配置或并行性问题而浪费资源。

Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don’t make money:

取而代之的是， Data Mechanics费用基于所有Spark任务的持续时间(Spark分配的工作单位，以毫秒为单位报告)的总和。这意味着我们的平台只有在用户完成实际工作时才能赚钱。我们不赚钱：

When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
当应用程序完全处于空闲状态时(因为您从笔记本电脑上休息了一下，却忘记了缩小集群规模)
When most of your application ressources are waiting on a straggler task to finish
当您的大多数应用程序资源都在等待松散的任务完成时
When you run a Spark driver-only operation (pure scala or python code)
当您运行仅Spark驱动程序的操作时(纯scala或python代码)

As a result, Data Mechanics will aggressively scale down your apps when they’re idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.

因此，Data Mechanics会在闲置状态下积极缩减您的应用程序，以便我们降低您的云成本(不影响我们的收入)。实际上，我们在您的云成本上节省的费用将覆盖或通常超过我们为服务收取的费用。

我想尝试一下，如何开始？ (I’d like to try this, how do I get started?)

Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we’ll invite you to a shared slack channel — we use Slack for our support and we’re very responsive there. We’ll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we’ll deploy Data Mechanics and you’ll be ready to get started using our docs.

大！第一步是与我们的团队预定一个演示，以便我们可以更多地了解您的用例。初次聊天后，我们将邀请您加入一个共享的slack频道-我们使用Slack作为我们的支持，我们在那里的React非常快。我们将向您发送有关如何向我们授予您选择的AWS，GCP或Azure帐户权限的说明，一旦获得这些权限，我们将部署Data Mechanics，您将准备好开始使用我们的文档。

There are other features which we didn’t get to cover in this post — like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you’re curious to learn more.

我们在这篇文章中没有涉及其他功能，例如我们对现货/可抢占节点的支持，对私有集群的支持(从互联网断开)，Spark UI替换项目，与CI工具的集成/ CD工具和机器学习模型的跟踪和服务。因此，如果您想了解更多信息，请继续关注并联系我们。