端到端集成测试_pyspark单元集成和端到端测试

端到端集成测试

介绍 (Introduction)

Many years ago software developers understood the benefits of testing their code. Testing in the big data world is not so extended. Through this article I intend to show one way of creating and running PySpark tests.

许多年前,软件开发人员了解测试代码的好处。 大数据世界中的测试并没有那么广泛。 通过本文,我打算展示一种创建和运行PySpark测试的方法。

There are many articles out there where it is explained how to write tests and integrate them in the CI pipelines. When working with Spark I did not manage to find any good documentation or patterns that could help me to create and automate tests in the same way as I used to do with other frameworks.

那里有很多文章,其中解释了如何编写测试并将其集成到CI管道中。 使用Spark时,我没有找到任何好的文档或模式来帮助我以与以前使用其他框架相同的方式来创建和自动化测试。

This article explains the way I run PySpark tests. Hopefully, it will be useful for future big data developers searching ways to improve the quality of their code and at the same time their CI pipelines.

本文介绍了我运行PySpark测试的方式。 希望它将对将来的大数据开发人员搜索提高其代码质量以及同时提高其CI管道质量的方法很有用。

单元,集成和端到端测试。 (Unit, integration and end-to-end tests.)

These are the kinds of tests that we can need when working with Spark. I am not mentioning others like smoke tests, acceptance tests, etc, etc because I think they are outside the scope of this article.

这些是使用Spark时我们需要的测试。 我没有提及诸如烟雾测试,验收测试等之类的其他内容,因为我认为它们不在本文讨论范围之内。

  • Unit Tests: at this level we will be dealing with code that does not require a Spark Session in order to work. Also, this kind of code does not talk with the outside world.

    单元测试:在这个级别上,我们将处理不需要Spark会话即可工作的代码。 而且,这种代码不会与外界对话。

  • Integration Tests: at some point we will need to use a Spark Session. At this level we will be testing Spark transformations and in many cases we will have to deal with external systems such as databases, Kafka clusters, etc, etc.

    集成测试:在某些时候,我们将需要使用Spark Session。 在此级别上,我们将测试Spark转换,在许多情况下,我们将不得不处理外部系统,例如数据库,Kafka集群等。

  • End-to-end Tests: our application probably will be composed of several Spark transformations working together in order to implement some feature required by some user. Here, we will be testing the whole application.

    端到端测试:我们的应用程序可能将由几个Spark转换共同组成,以实现某些用户所需的某些功能。 在这里,我们将测试整个应用程序。

You must be wondering why I am not drawing the typical test pyramid. Well, the answer is simple: most of the time, when working with data we will be dealing with integration tests instead of unit tests so the test pyramid does not make too much sense in our case.

您一定想知道为什么我没有绘制典型的测试金字塔。 答案很简单:大多数时候,在处理数据时,我们将处理集成测试而不是单元测试,因此在我们的情况下,测试金字塔没有太大意义。

PySpark项目布局 (PySpark project layout)

This is a strongly opinionated layout so do not take it as if it was the only and best solution. I think this layout should work under any use case but if it does not work for you, at least I hope, it will bring some inspiration or ideas to your testing implementation.

这是一个自以为是的布局,因此不要以为它是唯一,最好的解决方案。 我认为这种布局应该在任何用例下都可以使用,但是至少如果我希望对您不起作用,它将为您的测试实现带来一些启发或想法。

├── src
│ └── awesome
│ ├── app
│ │ └── awesome_app.py
│ ├── job
│ │ └── awesome_job.py
│ └── service
│ └── awesome_service.py
└── tests
├── endtoend
│ ├── fixtures
│ └── test_awesome_app.py
├── integration
│ ├── fixtures
│ └── test_awesome_job.py
├── shared_spark_session_helper.py
└── unit
└── test_awesome_service.py

It is a src layout using pytest with three different packages: app, job and service for our PySpark application. For pytest we will be using three different folders: endtoend, integration and unit.

这是一个使用pytestsrc布局,带有三个不同的程序包:PySpark应用程序的应用程序,作业和服务。 对于pytest,我们将使用三个不同的文件夹:endtoend,integration和unit。

应用布局 (Application layout)

应用程式套件 (app package)

Under this folder we will find the modules in charge of running our PySpark applications. Typically we will have only one PySpark application. For example, in the above layout, awesome_app.py will contain the __main__ required for running the application.

在此文件夹下,我们将找到负责运行PySpark应用程序的模块。 通常,我们只有一个PySpark应用程序。 例如,在上述布局中, awesome_app.py将包含运行该应用程序所需的__main__

工作包 (job package)

A Spark application should implement some kind of transformations. Modules under this package run Spark jobs that require a Spark Session. For example, awesome_job.py could contain Spark code implementing one or several transformations.

Spark应用程序应实现某种转换。 此软件包下的模块运行需要Spark会话的Spark作业。 例如, awesome_job.py可能包含实现一个或多个转换的Spark代码。

服务包 (service package)

Sometimes business logic does not require a Spark Session in order to work. In such cases, we can implement the logic in a different module.

有时,业务逻辑不需要Spark Session即可正常工作。 在这种情况下,我们可以在其他模块中实现逻辑。

Pytest布局 (Pytest layout)

单位文件夹 (unit folder)

Folder where our unit tests will reside. Understanding unit tests as those that neither require a Spark Session in order to work nor talk with the outside world (file systems, databases, etc, etc)

我们的单元测试将驻留的文件夹。 将单元测试理解为既不需要Spark会话即可工作又无需与外界对话(文件系统,数据库等)的测试

整合资料夹 (integration folder)

Contains tests that require a Spark Session or deal with the outside world. There is also a fixture folder where we can store the required data sets for running our tests. Under this path we can find tests that check Spark transformations.

包含需要Spark Session或与外界打交道的测试。 还有一个夹具文件夹,我们可以在其中存储运行测试所需的数据集。 在此路径下,我们可以找到检查Spark转换的测试。

端到端文件夹 (endtoend folder)

This folder includes tests that run a whole PySpark application and check that results are correct. Spark applications can be composed of multiple transformations, tests under this path check the application as a whole. As in the integration folder, there is a fixture folder where we can include data sets for testing our applications.

该文件夹包括运行整个PySpark应用程序并检查结果是否正确的测试。 Spark应用程序可以由多个转换组成,此路径下的测试将整个应用程序作为检查。 与集成文件夹中一样,这里有一个fixture文件夹,我们可以在其中包含用于测试应用程序的数据集。

共享Spark会话 (Shared Spark Session)

One of the biggest problems to be solved when running Spark tests is the isolation of these tests. Running a test should not affect the results of another. In order to achieve this goal we are going to need a Spark Session for each set of tests, in this way, the results of these tests will not affect others that will also require a Spark Session.

运行Spark测试时要解决的最大问题之一是这些测试的隔离。 运行测试不应影响另一个测试的结果。 为了实现此目标,我们将需要为每组测试创建一个Spark Session,这样,这些测试的结果将不会影响也需要Spark Session的其他测试。

So, we need to implement a system that will enable us to run, clear and stop a Spark Session whenever we need it (before and after a set of related Spark tests)

因此,我们需要实现一个系统,使我们能够在需要时(一组相关的Spark测试之前和之后)运行,清除和停止Spark Session。

As stated before, we will be using pytest. Pytest works with fixtures but also allows us to use a classic xunit style setup. We will be using this xunit style for our Shared Spark Session. The details of the implementation are explained down below:

如前所述,我们将使用pytest。 Pytest与固定装置一起使用,但也允许我们使用经典的xunit样式设置 。 我们将在共享Spark会话中使用这种xunit样式。 下面详细说明了实现的细节:

  • setup_class method: we want to share the same Spark Session across a set of tests. The boundaries of this set will be represented by a test class, what means, in one test class we can share the same Spark Session through all the tests located in that test class.

    setup_class方法:我们希望在一组测试中共享相同的Spark Session。 这个集合的边界将由一个测试类表示,这意味着,在一个测试类中,我们可以通过该测试类中的所有测试共享同一个Spark会话。

  • spark_conf class method: different set of tests may also need Spark Sessions with different configurations. The spark_conf method enables us to load a Spark Session with the required configuration for each set of tests.

    spark_conf类方法:不同的测试集可能还需要具有不同配置的Spark会话。 spark_conf方法使我们能够为每组测试加载具有所需配置的Spark会话。

  • Embedded Hive: spark-warehouse and metastore_db are folders used by Spark when enabling the Hive support. Different Spark Sessions in the same process can not use the same folders. Because of that, we need to create random folders in every Spark Session.

    嵌入式Hive: spark-warehousemetastore_db是启用Hive支持时Spark使用的文件夹。 同一进程中的不同Spark会话不能使用相同的文件夹。 因此,我们需要在每个Spark Session中创建随机文件夹。

  • setup_method: creates a temporary path which is useful when our Spark tests end up writing results in some location.

    setup_method :创建一个临时路径,该路径在我们的Spark测试最终将结果写入某个位置时很有用。

  • teardown_method: clears and resets the Spark Session at the end of every test. Also, it removes the temporary path.

    teardown_method :在每次测试结束时清除并重置Spark Session。 此外,它还会删除临时路径。

  • teardown_class method: stops the current Spark Session after the set of tests are run. In this way we will be able to run a new Spark Session if it is needed (if there is another set of tests requiring the use of Spark)

    teardown_class方法 :运行一组测试后,停止当前的Spark Session。 这样,如果需要,我们将能够运行一个新的Spark会话(如果存在另一组需要使用Spark的测试)

这个怎么运作 (How it works)

The basic idea behind SharedSparkSessionHelper lies in the fact that there is one Spark Session per Java process and it is stored in an InheritableThreadLocal. When calling getOrCreate method from SparkSession.Builder we end up either creating a new Spark Session (and storing it in the InheritableThreadLocal) or using an existing one.

SharedSparkSessionHelper背后的基本思想在于,每个Java进程有一个Spark Session,并将其存储在InheritableThreadLocal中 。 从SparkSession.Builder调用getOrCreate方法时,我们最终要么创建一个新的Spark会话(并将其存储在InheritableThreadLocal中 ),要么使用现有的会话。

So, for example, when running an end-to-end test, because SharedSparkSessionHelper is loaded before anything else (by means of the setup_class method), the application under test will be using the Spark Session launched by SharedSparkSessionHelper.

因此,例如,在运行端到端测试时,由于SharedSparkSessionHelper首先加载(通过setup_class方法),因此被测试的应用程序将使用SharedSparkSessionHelper启动的Spark会话。

Once the test class is finished, the teardown_class method stops the Spark Session and removes it from the InheritableThreadLocal leaving our test environment ready for a new Spark Session. In this way, tests using Spark can run in an isolated way.

一旦测试类完成, teardown_class方法将停止Spark Session并将其从InheritableThreadLocal中删除,从而为新的Spark Session准备好我们的测试环境。 这样,使用Spark进行的测试可以以隔离的方式运行。

很棒的项目 (Awesome project)

This article would be nothing without a real example. Just following this link you will find a project with tox, pipenv, pytest, pylint and pycodestyle where I use the SharedSparkSessionHelper class.

没有一个真实的例子,这篇文章将一无所获。 只需点击此链接,您将找到一个包含toxpipenvpytestpylintpycodestyle的项目 ,其中我使用了SharedSparkSessionHelper类。

The Awesome project uses object-oriented programming style but the same could be applied when using procedural style, the only requirement is to write test classes which can be used with the SharedSparkSessionHelper class. Many times you will end up using either object-oriented or procedural style or both at the same time in your python projects. Feel free to use one or another depending on what you need.

Awesome项目使用面向对象的编程风格,但是在使用过程风格时也可以使用相同的风格,唯一的要求是编写可与SharedSparkSessionHelper类一起使用的测试类。 很多时候,您最终会在python项目中同时使用面向对象或过程样式,或同时使用这两种样式。 根据您的需要随意使用一个或另一个。

Of course this application can be run in any of the available clusters that currently exist such as Kubernetes, Apache Hadoop Yarn, Spak running in cluster mode or any other of your choice.

当然,该应用程序可以在当前存在的任何可用集群中运行,例如KubernetesApache Hadoop Yarn ,以集群模式运行的Spak或您选择的任何其他集群

结论 (Conclusion)

Testing Spark applications can seem more complicated than with other frameworks not only because of the need of preparing a data set but also because of the lack of tools that allow us to automate such tests. By means of the SharedSparkSessionHelper class we can automate our tests in an easy way and it should work smoothly with pytest and unittest.

测试Spark应用程序似乎比使用其他框架更为复杂,这不仅是因为需要准备数据集,而且还因为缺少使我们能够自动执行此类测试的工具。 通过SharedSparkSessionHelper类,我们可以轻松实现自动化测试,并且可以与pytest和unittest一起顺利运行。

I hope this article was useful. If you enjoy messing around with Big Data, Microservices, reverse engineering or any other computer stuff and want to share your experiences with me, just follow me.

我希望这篇文章对您有所帮助。 如果您喜欢大数据,微服务,逆向工程或任何其他计算机内容,并且想与我分享您的经验,请关注

翻译自: https://medium.com/@gu.martinm/pyspark-unit-integration-and-end-to-end-tests-c2ba71467d85

端到端集成测试

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值