一切从头_从头六个月中激发出重要的教训

最新推荐文章于 2024-04-16 12:47:13 发布

weixin_26739165

最新推荐文章于 2024-04-16 12:47:13 发布

阅读量177

点赞数

原文链接：https://towardsdatascience.com/spark-databricks-important-lessons-from-my-first-six-months-d9b26847f45d

版权

一切从头

重点(Top highlight)

入门(Getting Started)

If you’re reading this article then perhaps, like me, you have just started a new tech job and are trying to leverage Spark & Databricks for big data operations. Whilst Databricks has a friendly-looking UI that surfaces the complex internal workings of Spark do not be fooled; there are many traps and pitfalls which new users can find themselves in. These can lead to highly inefficient coding practices causing ‘hanging’ operations or inexplicable errors that will leave you scratching your head.

如果您正在阅读本文，那么也许像我一样，您刚刚开始一项新的技术工作，并试图利用Spark＆Databricks进行大数据操作。尽管Databricks有一个外观友好的UI，但它不冒充Spark的复杂内部工作；新用户会发现很多陷阱和陷阱。这些陷阱和陷阱会导致效率极低的编码实践，导致“挂起”操作或莫名其妙的错误，使您抓狂。

In my first six months of using Spark, I learned two very important lessons which drastically improved the performance of my code and helped me to program with a mindset oriented around distributed computing. I would like to share these lessons with you to help develop your own understanding and potentially fast track you through some problems you may currently be facing in your work.

在使用Spark的头六个月中，我学到了两个非常重要的课程，它们极大地提高了代码的性能，并帮助我以面向分布式计算的思维方式进行编程。我想与您分享这些课程，以帮助您发展自己的理解，并有可能快速帮助您解决当前可能在工作中遇到的问题。

I will illustrate these lessons through the problems caused, some of the theory behind them, and some practical usage examples which could aid in the understanding of these common Spark issues.

我将通过引起的问题，它们背后的一些理论以及一些可以帮助理解这些常见Spark问题的实际使用示例来说明这些课程。

1.了解分区 (1. Understanding Partitions)

1.1 The Problem

1.1问题

Perhaps Spark’s most important feature for data processing is its DataFrame structures. These structures can be accessed in a similar manner to a Pandas Dataframe for example and support a Pyspark API interface that enables you to perform most of the same transformations and functions.

对于数据处理，Spark最重要的功能也许就是其DataFrame结构。例如，可以类似于Pandas Dataframe的方式访问这些结构，并支持Pyspark API接口，使您能够执行大多数相同的转换和功能。

However, treating a Spark DataFrame in the same manner as a Pandas DataFrame is a common mistake as it means that a lot of Spark’s powerful parallelism is not leveraged. Whilst you may be interacting with a DataFrame variable in your Databricks notebook, this does not exist as a single object in a single machine, but in fact, the physical structure of the data is vastly different under the surface.

但是，以与Pandas DataFrame相同的方式处理Spark DataFrame是一个常见的错误，因为这意味着很多Spark强大的并行性没有得到利用。虽然您可能正在与DataFrame变量进行交互在您的Databricks笔记本中，这并不作为单个对象存在于单个计算机中，但是实际上，数据的物理结构在表面之下有很大的不同。

When first starting to use Spark you may find that some operations are taking an inordinate amount of time when you feel that quite a simple operation or transformation is being applied. A key lesson to help with this problem, and understanding Spark in earnest, is learning about partitions of data and how these exist in the physical realm as well as how operations are applied to them.

刚开始使用Spark时，您会感到有些简单的操作或转换正在应用，因此可能会花费一些时间。解决此问题并认真理解Spark的关键课程是学习数据分区以及它们在物理领域中的存在方式以及如何对它们进行操作。

1.2 The Theory

1.2理论

Beneath Databricks sits Apache Spark which is a unified analytics engine designed for large scale data processing which boasts up to 100x performance over the now somewhat outdated Hadoop. It utilises a cluster computing framework that enables workloads to be distributed across multiple machines and executed in parallel which has great speed improvements over using a single machine for data processing.

位于Databricks之下的是Apache Spark，它是专为大规模数据处理而设计的统一分析引擎，与现在已经过时的Hadoop相比，其性能高达100倍。它利用集群计算框架，使工作负载可以分布在多台计算机上并并行执行，与使用单台计算机进行数据处理相比，速度有了很大的提高。

Distributed computing is the single biggest breakthrough in data processing since limitations in computing power on a single machine have forced us to scale out rather than scale up.

分布式计算是数据处理中的最大突破，因为单台计算机上计算能力的局限性迫使我们扩大规模而不是扩大规模。

Nevertheless, whilst Spark is extremely powerful it must be used correctly in order to gain maximum benefits from using it for Big Data Processing. This means changing your mindset from one where you may have been dealing with single tables sitting in a single file in a single machine, to this massively distributed framework where parallelism is your superpower.

尽管如此，尽管Spark非常强大，但必须正确使用Spark才能将其用于大数据处理以获得最大利益。这意味着将您的思维方式从您可能一直在处理单个表中位于单个文件中的单个表的思维方式转变为这种庞大的分布式框架，其中并行性是您的超级大本领。

In Spark, you will often be dealing with data in the form of DataFrames which are an intuitive and easy to access structured API which sits above Spark’s core specialised and fundamental data structure known as RDDs (Resilient Distributed Datasets). These are logical collections of data partitioned across machines (distributed) and can be regenerated from a logical set of operations even if a machine in your cluster is down (resilient). The Spark SQL and PySpark APIs make interaction with these low-level data structures very accessible to developers that have experience in these respective languages, however, this can lead to a false sense of familiarity as the underlying data structures themselves are so different.

在Spark中，您经常会以DataFrames的形式处理数据，DataFrames是一种直观且易于访问的结构化API，它位于Spark的核心专用和基础数据结构(称为RDD(弹性分布式数据集))之上。这些是跨机器分区的数据的逻辑集合(分布式)，即使集群中的一台机器停机(有弹性)，也可以从一组逻辑操作中重新生成这些数据。对于拥有相应语言经验的开发人员而言，Spark SQL和PySpark API使与这些低级数据结构的交互非常容易，但是，这可能会导致错误的熟悉感，因为基础数据结构本身是如此不同。

Distributed datasets that are common in Spark do not exist on a single machine but exists as RDDs across multiple machines in the form of partitions. So although you may be interacting with a DataFrame in the Databricks UI, this actually represents an RDD sitting across multiple machines. Subsequently, when you call transformations, it is key to remember that these are not instructions that are all applied locally to a single file, but in the background, Spark is optimising your query so that these operations can be performed in the most efficient way across all partitions (explanation of Spark’s catalyst optimiser).

Spark中常见的分布式数据集不存在于单台计算机上，而是作为RDD以分区的形式存在于多台计算机上。因此，尽管您可能正在与Databricks UI中的DataFrame进行交互，但这实际上表示跨多个计算机的RDD。随后，当您调用转换时，请务必记住，这些指令不是全部本地应用到单个文件的指令，但是在后台，Spark正在优化查询，以便可以在整个过程中以最有效的方式执行这些操作。所有分区( Spark的催化剂优化器的说明)。

Image for post — Figure 1 — Partitioned Datasets (image by the author)

Taking the paritioned table in Figure 1, as an example if a filter was called on this table the Driver would actually send instructions to each of the workers to perform a filter on each coloured partitions in parallel before combining the results together to form the final result. As you can see for a huge table partitioned into 200+ partitions the speed benefit will be drastic when compared to filtering a single table.

以图1中的分区表为例，如果在此表上调用了过滤器，则驱动程序实际上会向每个工作程序发送指令，以对每个有色分区并行执行过滤器，然后再将结果组合在一起以形成最终结果。如您所见，将一个巨大的表划分为200多个分区，与过滤单个表相比，其速度优势将非常明显。

The number of partitions an RDD has determines the parallelism that Spark can achieve when processing it. This means that Spark can run one concurrent task for every partition your RDD has. Whilst you may be using a 20 core cluster, if your DataFrame only exists as one partition, your processing speed will be no better than if the processing was performed by a single machine and Spark’s speed benefits will not be observed.

RDD拥有的分区数量决定了Spark在处理时可以实现的并行性。这意味着Spark可以为RDD的每个分区运行一个并发任务。虽然您可能正在使用20核心集群，但是如果DataFrame仅作为一个分区存在，则处理速度将不会比仅由一台计算机执行处理时要好，并且不会观察到Spark的速度优势。

1.3 Practical Usage

1.3实际用法

This idea can be confusing at first and requires a switch in mindset to one of distributed computing. By switching your mindset it can be easy to see why some operations may be taking much longer than usual. A good example of this is the difference between narrow and wide transformations. A narrow transformation is one in which a single input partition maps to a single output partition for example a .filter()/.where() in which each partition is searched for given criteria and will at most output a single partition.

这种想法一开始可能会令人困惑，并且需要将思维模式切换到分布式计算之一。通过改变思维方式，可以轻松了解为什么某些操作可能比平时花费更长的时间。一个很好的例子是窄变换和宽变换之间的区别。狭义转换是指单个输入分区映射到单个输出分区的转换，例如.filter()/.where() 其中，在每个分区中搜索给定的标准，最多将输出一个分区。

A wide transformation is a much more expensive operation and is sometimes referred to as a shuffle in Spark. A shuffle goes against the ethos of Spark which is that moving data should be avoided at all costs as this is the most time consuming and expensive aspect of any data processing. However, it is obviously necessary for many instances to do a wide transformation such as when performing a .groupBy()or a join.

广泛的转换是一项昂贵得多的操作，有时在Spark中被称为洗牌。洗牌违反了Spark的精神，即不惜一切代价避免移动数据，因为这是任何数据处理中最耗时，最昂贵的方面。但是，显然很多实例都需要进行广泛的转换，例如执行.groupBy()或.groupBy()时。

In a narrow transformation, Spark will perform what is known as pipelining meaning that if multiple filters are applied to the DataFrame then these will all be performed in memory. This is not possible for wide transformations and means that results will be written to disk causing the operation to be much slower.

在狭窄的转换中，Spark将执行所谓的流水线操作，这意味着如果将多个过滤器应用于DataFrame，则所有这些过滤器都将在内存中执行。对于广泛的转换而言，这是不可能的，这意味着结果将被写入磁盘，从而导致操作慢得多。

This concept forces you to think carefully about how to achieve different outcomes with the data you are working with and how to most efficiently transform data without adding unnecessary overhead.

该概念迫使您仔细考虑如何使用正在使用的数据实现不同的结果，以及如何最有效地转换数据而又不增加不必要的开销。

There are also some practical ways in which you can use partitioning to your benefit as well. These include .partitionBy() and .repartition() (this article explains both). By controlling the size and form of the partitions used in a table, operation speeds can exponentially increase (think indexes in SQL). Both of these operations add overhead to your processes, but by partitioning on a given column or set of columns, filters can become a lot quicker. This is most beneficial if you know that a certain column is going to be used extensively for filtering.

您还可以使用一些实用的方法来使用分区以从中受益。这些包括.partitionBy()和.repartition() (本文.repartition()这两者进行解释)。通过控制表中使用的分区的大小和形式，操作速度可以成倍增加(请考虑一下SQL中的索引)。这两个操作都增加了过程的开销，但是通过在给定的列或一组列上进行分区，过滤器可以变得更快。如果您知道某个列将被广泛用于过滤，这将是最有益的。

2. Spark很懒...真的很懒！ (2. Spark is Lazy… Really Lazy!)

2.1 The Problem

2.1问题

The feature of Spark that is definitely the most frustrating as a new user is Spark’s lazy evaluation as it goes against everything you have previously taken for granted in programming. Many developers have spent many hours in a code editor, setting breakpoints and stepping through code to understand what is happening at each step as the logical order of code progresses. Similarly, in a Jupyter Notebook, it is easy to run each cell and know the exact state of variables and whether processes have been successful or not.

作为新用户，Spark的功能无疑是最令人沮丧的，这是Spark的懒惰评估，因为它违背了您以前在编程中理所当然的一切。许多开发人员在代码编辑器中花费了许多时间，设置断点并逐步遍历代码，以了解随着代码逻辑顺序的发展，每一步所发生的事情。同样，在Jupyter Notebook中，很容易运行每个单元并知道变量的确切状态以及过程是否成功。

The issue with this is that whilst when you call a certain transformation on a Pandas DataFrame in Jupyter, this is carried out instantaneously and the transformed variable sits there in memory ready to be accessed by the user. Conversely in Spark transformations are not applied as soon as they are called; instead, the transformations are saved and a plan of transformations is built up ready to be applied only when they are required.

这样做的问题是，当您在Jupyter中的Pandas DataFrame上调用某个转换时，该转换是即时执行的，并且转换后的变量位于内存中，可供用户访问。相反，在Spark转换中，调用转换后不立即应用它们。相反，将保存转换并建立转换计划，以准备仅在需要时才应用。

To the new user, this can lead to the confusing scenarios of:

对于新用户，这可能导致以下令人困惑的场景：

Complex operations in a DataBricks cell taking only a matter of milliseconds.
DataBricks单元中的复杂操作仅需几毫秒的时间。
Code exiting with errors at unexpected points.
代码退出并在意外点出现错误。

2.2 The Theory

2.2理论

This feature is known as lazy evaluation, and whilst this feature of Spark is hard to get used to, it is actually one of the key design features which makes Spark so fast and boasts 100x speed over technologies like Hadoop. Moving data is computationally expensive, and if after each transformation the intermediate table had to be written to disk then the overall process would take a long time — especially with large tables.

此功能被称为惰性评估，虽然Spark的此功能很难适应，但它实际上是关键设计功能之一，它使Spark如此之快，并且拥有比Hadoop等技术高100倍的速度。移动数据的计算量很大，如果每次转换后都必须将中间表写入磁盘，则整个过程将花费很长时间-特别是对于大型表。

In Spark, there are two different types of operations that can be called: transformations and actions. Transformations are as the name suggests, any transformations that can be applied to DataFrames which modify it in some way to present that data in a different form. What is different about Spark is that when these transformations occur, rather than the necessary computation actually being applied, Spark is building up an optimized Physical Plan ready to execute your code when an action is called. An action is a command such as a .count(), .collect() or .save()which actually requires the data to be computed.

在Spark中，可以调用两种不同类型的操作：转换和操作。 转换顾名思义，是可以应用于DataFrame的任何转换，这些转换以某种方式对其进行了修改，从而以不同的形式表示该数据。 Spark的不同之处在于，当发生这些转换时，而不是实际应用必要的计算时，Spark会建立优化的物理计划，准备在调用操作时执行您的代码。动作是诸如.count(), .collect() or .save() ，实际上需要计算数据。

The process is as follows:

流程如下：

1. Write DataFrame/Dataset/SQL Code.

1.编写DataFrame / Dataset / SQL代码。

2. If this code is valid then, code is converted to a Logical Plan (logical operations).

2.如果此代码有效，则将代码转换为逻辑计划(逻辑运算) 。

3. Spark transforms this Logical Plan to a Physical Plan (how these operations will be carried out on the cluster).

3. Spark将此逻辑计划转换为物理计划(如何在群集上执行这些操作) 。

4. Spark then executes this Physical Plan (RDD manipulations) on the cluster (when an action is called).

4.星火然后执行此计划的物理集群(当一个动作叫)上(RDD操作)。

This process allows Spark to perform optimisations within the plan before any code is actually run. This includes operations such as predicate pushdown which is where a filter that is applied at the end of the set of transformations in code, is pushed to the front of the physical plan to ensure that transformations are being applied on a smaller set of data and is therefore faster.

通过此过程，Spark可以在实际运行任何代码之前在计划内执行优化。这包括谓词下推等操作，在谓词下推中，将在代码转换集的末尾应用的过滤器推到物理计划的前面，以确保将转换应用于较小的数据集，并且因此更快。

2.3 Practical Usage

2.3实际用法

In practice, this problem manifests itself quite often in simple operations causing large hard to digest errors upon running often after a long list of transformations has been applied to a DataFrame or RDD. This makes Spark notoriously hard to debug for new users as it can be very hard to identify which exact operation has caused a pipeline of operations to fail, as typical debugging methods like print statements and breakpoints lose all meaning when the code they intersect has not actually been executed.

在实践中，此问题经常在简单的操作中显现出来，导致在将一长串转换应用于DataFrame或RDD之后经常运行时导致大量难以消化的错误。众所周知，这使得Spark很难为新用户调试，因为很难识别出哪个确切的操作导致了一系列操作失败，因为诸如print语句和断点之类的典型调试方法在它们相交的代码没有实际使用时就失去了所有意义。被执行。

Take Figure 4 for example. In this simple pipeline, an aggregate function is applied to test_df and then joined to a second df (transformations), before a collect action called. As previously discussed, the transformations will be used to generate a Logical (and then Physical) Plan which will be executed when required. At stage 1, however, the transformed DataFrame does not exist — it is merely an idea of what you want the data to look like; this means that any prints or success checks applied at this stage are actually not testing whether the code is successful, they just assert that the logical plan has been created.

以图4为例。在这个简单的管道中，在调用collect动作之前，将聚合函数应用于test_df，然后将其连接到第二个df( transforms )。如前所述，转换将用于生成逻辑(然后是物理)计划，该计划将在需要时执行。但是，在第1阶段，转换后的DataFrame不存在-仅仅是您希望数据看起来像什么；这意味着在此阶段进行的任何打印或成功检查实际上都没有测试代码是否成功，它们只是断言逻辑计划已创建。

The same is true of new_df as again the join to this table is just another step in the logical plan — so again at this stage, you cannot say for certain whether the code is successful although it may appear as such in Databricks. This idea also explains why joining two tables with millions (or more) rows takes only a second in a Databricks cell: because it is merely adding this step to the plan.

new_df也是如此，因为再次连接到该表只是逻辑计划中的又一步-因此，即使在Databricks中看起来也是如此，因此在此阶段，您不能确定代码是否成功。这个想法还解释了为什么在Databricks单元中连接具有数百万个(或更多)行的两个表只花一秒钟的时间：因为这只是将这一步骤添加到了计划中。

When we get to stage 3 however and call a collect the physical plan created is executed on the cluster and all of the desired operations will occur. It is at this point that if any of the actual operations are invalid, errors will be thrown.

但是，当我们进入第3阶段并调用集合时，将在集群上执行创建的物理计划，并且所有需要的操作都会发生。此时，如果任何实际操作无效，将引发错误。

Remember the logical plan only checks if code is valid, the execution of the physical plan will reveal errors in your operations.

请记住，逻辑计划仅检查代码是否有效，物理计划的执行将揭示您的操作中的错误。

In the simple example above you can probably see how it would be easy to trace the error back to the source, however, in lengthy pipelines, this can become convoluted especially as the physical plan may not follow the same order as the logical one.

在上面的简单示例中，您可能会看到将错误追溯到源头的方式将很容易，但是，在冗长的管道中，这可能会令人费解，特别是因为物理计划可能不遵循逻辑计划的顺序。

To overcome this problem and debug effectively it is necessary to isolate operations and test thoroughly before creating huge pipelines as this will just cause issues down the line when you begin to read and write the data. Further to this, it can be highly beneficial to generate small test datasets to test the expected behaviour of functions and operations. If you are having to read in huge tables and collect these to the driver in order to test your functionality this will increase your debugging time drastically.

为了克服此问题并进行有效的调试，有必要在创建巨大的管道之前隔离操作并进行彻底的测试，因为这将在您开始读取和写入数据时引起问题。除此之外，生成小型测试数据集以测试功能和操作的预期行为可能会非常有益。如果您必须读取巨大的表并将其收集到驱动程序中以测试功能，则将大大增加调试时间。

结论 (Conclusion)

In conclusion, Spark is an amazing tool that has made data processing at scale much quicker and simpler than ever before. Despite it requiring a change of mindset to leverage it properly, it is definitely worth gaining a deeper understanding of its inner workings if for no other reason than realising its ingenious design.

总之，Spark是一个了不起的工具，它使大规模数据处理比以往任何时候都更快，更简单。尽管需要改变思维方式来适当地利用它，但绝对有必要对它的内部运作有一个更深入的了解，如果没有其他原因，除了实现它的巧妙设计。

With Spark it is definitely useful to understand how it is working under the hood as it is so different from any other technologies in use. With Databricks providing such a low barrier to entry when using Spark, it is easy to start using bad practices early on causing large cluster bills and long run times. But with some education its potential can be truly exploited and will lead to huge improvements in efficiency and performance.

借助Spark，了解它在引擎盖下的工作方式绝对有用，因为它与所使用的任何其他技术有很大不同。由于Databricks在使用Spark时提供了如此低的进入门槛，因此很容易在很早就开始使用不良做法，从而导致产生大量集群账单和延长运行时间。但是通过一些教育，它的潜能就可以得到真正的开发，并将大大提高效率和绩效。

翻译自: https://towardsdatascience.com/spark-databricks-important-lessons-from-my-first-six-months-d9b26847f45d

一切从头

weixin_26739165

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
一切从头_从头六个月中激发出重要的教训

一切从头重点(Top highlight)入门(Getting Started)If you’re reading this article then perhaps, like me, you have just started a new tech job and are trying to leverage Spark & Databricks for big data opera...
复制链接

扫一扫