aws emr_aws emr vs aws批处理vs ray框架

aws emr

Let’s say you have a lot of data to process and now you are going through a portfolio of tools and approaches to decide what your next step is. It can get very confusing and even after picking a tool/approach and committing to it, you may be doubting yourself.

假设您要处理大量数据,现在您正在研究一系列工具和方法来决定下一步是什么。 它会变得非常混乱,即使选择了工具/方法并承诺使用它,您也可能会怀疑自己。

This blog post is about:

这篇博客文章是关于:

  • Giving you a quick overview of a few technologies in this ecosystem

    快速概述该生态系统中的一些技术
  • Describing what these technologies are meant for

    描述这些技术的意义

Here is a quick overview:

快速概述:

AWS EMR:

AWS EMR:

  • Library: Python PySpark library needed.

    :需要Python PySpark库。

  • Coding style: needs to be adjusted to work around PySpark library.

    编码风格 :需要调整以适应PySpark库。

  • Very good in preparing the data before action-ing the data.

    在对数据进行操作之前非常擅长准备数据。

AWS Batch:

AWS批次:

  • Library: NO library needed.

    图书馆 :无需图书馆。

  • Coding style stays as you always write your batch application.

    编码风格保持不变,就像您始终编写批处理应用程序一样。

  • Not good in preparing the unstructured data, but it is very good in taking an already prepared file and looping through and performing actions for each record.

    在准备非结构化数据方面不好 ,但是在获取已经准备好的文件并为每个记录循环并执行操作方面非常好

Ray framework (Python Ray library):

Ray框架(Python Ray库):

  • Library: Python Ray library needed

    :需要Python Ray库

  • Coding style: Slight adjustment but for the most part, its mission is about writing your functions and classes as usual. You just decorate the functions/classes slightly differently and when you invoke functions, you indicate if you want to run it “remotely/distributed” way. The same code works locally on your machine as it runs distributed.

    编码样式 :稍作调整,但在大多数情况下,其任务是照常编写函数和类。 您只需稍微不同地装饰函数/类,并在调用函数时指出是否要“远程/分布式”运行它。 相同的代码在分布式计算机上本地运行。

  • Good in BOTH preparing the data and also processing each record in the prepared file in a distributed fashion.

    既可以准备数据,也可以以分布式方式处理准备好的文件中的每个记录。

Latest documentation on Ray: https://docs.ray.io/en/latest/

有关Ray的最新文档: https : //docs.ray.io/en/latest/

Screenshots of a presentation summarizing how Ray approaches the distributed programming using function and class decorators while maintaining the regular Python code within functions and classes.

演示文稿的屏幕快照,概述了Ray如何使用函数和类修饰符进行分布式编程,同时在函数和类中维护常规Python代码。

Image for post
Image for post

Let’s now look into some use cases and see where each one of these could be useful.

现在让我们研究一些用例,看看其中每个用例在哪里有用。

Diagram 1 (below) is the typical machine learning example. You have data coming from many different sources and you are ingesting that data into a data lake. Then as part of machine learning process, you need to have distributed processing components that do the following:

图1 ( 下图 )是典型的机器学习示例。 您有来自许多不同来源的数据,并且正在将该数据提取到数据湖中。 然后,作为机器学习过程的一部分,您需要具有执行以下操作的分布式处理组件:

  • Read a lot of files/data from the data lake and create a big sample CSV file for the machine learning algorithm to use for training.

    从数据湖中读取大量文件/数据,并创建一个大样本CSV文件供机器学习算法用于训练。
  • Read a lot of files/data from the data lake to generate/keep and an up-to-date version of all features/attributes for each customer that you may need to predict/infer for.

    从数据湖中读取大量文件/数据,以生成/保留您可能需要预测/推断的每个客户的所有功能/属性的最新版本。

This use case is a good example where you could use the following:

这个用例是一个很好的示例,您可以使用以下示例:

  • AWS EMR or AWS Glue (Apache Spark as back engine)

    AWS EMR或AWS Glue(作为后引擎的Apache Spark)
  • Ray framework

    雷框架
Image for post
Diagram 1
图1

Let’s consider another example. Diagram 2 (below) is an example where you you already have some CSV/JSON files prepared and now you need to traverse through the records within those files and perform a list of actions for each record. One solution is having a single-threaded application that traverses through the records and performs the actions; however, this would be extremely slow if you were dealing with a large number of records. Another way is writing a multi-threaded batch application using your choice of a programming language, but this solution limits you to a single server and your ability to scale horizontally is not there. Next, you can start thinking about splitting these files/records across multiple servers and performing this in parallel and instead of building a custom solution for this, AWS already has one and it is:

让我们考虑另一个例子。 下图2是一个示例,其中您已经准备了一些CSV / JSON文件,现在您需要遍历这些文件中的记录并为每个记录执行操作列表。 一种解决方案是拥有一个单线程应用程序,该应用程序遍历记录并执行操作。 但是,如果您要处理大量记录,这将非常慢。 另一种方法是使用您选择的编程语言来编写多线程批处理应用程序,但是此解决方案将您限制为一台服务器,并且没有水平扩展的能力。 接下来,您可以开始考虑在多个服务器之间拆分这些文件/记录并并行执行此操作,而不是为此构建自定义解决方案,AWS已经有了一个解决方案,它是:

  • AWS Batch

    AWS批处理

For example, AWS Batch allows you develop a standard Python application and execute it within the AWS Batch environment which takes care of all the distribution and scaling. What you need to do in AWS Batch is:

例如,AWS Batch允许您开发标准的Python应用程序并在AWS Batch环境中执行它,该环境负责所有分发和扩展。 您需要在AWS Batch中执行的操作是:

  • Create a Compute Environment (minimum, desired, maximum CPU setting)

    创建一个计算环境 (最小,所需的最大CPU设置)

  • Job Queue (associate with a compute environment)

    作业队列 (与计算环境关联)

  • Job Definition (Docker container with your application and vCPU and Memory settings)

    作业定义 (带有应用程序以及vCPU和内存设置的Docker容器)

  • Jobs (Pull all of the above together in a form a job that also allows you choose single or multiple EC2 usage)

    作业 (将以上所有内容组合在一起,形成一个作业,您也可以选择使用一次还是多次使用EC2)

This approach allows you execute the batch in a very scalable way, but you have to make sure that the downstream systems can handle the scale. For example in the diagram below, Action A is an API call to a service that you or your sister team may own. It could also be a service that is owned by a 3rd party. You need to make sure that it can handle it the extra traffic coming from your AWS Batch application.

这种方法允许您以非常可扩展的方式执行批处理,但是必须确保下游系统可以处理规​​模。 例如,在下图中,操作A是对您或您的姊妹团队可能拥有的服务的API调用。 它也可能是第三方拥有的服务。 您需要确保它可以处理来自您的AWS Batch应用程序的额外流量。

Image for post
Diagram 2
图2

I hope that this article paints the big picture of this ecosystem so that you can research more and find an optimal solution for your use cases.

我希望本文能够概述此生态系统的概况,以便您可以进行更多研究并找到适用于您的用例的最佳解决方案。

Thank you for reading. Keep geeking out!

感谢您的阅读。 继续发呆!

Almir Mustafic

阿尔米尔·穆斯塔菲奇(Almir Mustafic)

翻译自: https://medium.com/@almirx101/aws-emr-vs-aws-batch-vs-ray-framework-6c447910504f

aws emr

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值