memorystream_使用MemoryStream对Apache Spark结构化流进行单元测试

最新推荐文章于 2020-12-20 20:09:01 发布

weixin_26746861

最新推荐文章于 2020-12-20 20:09:01 发布

阅读量205

点赞数

原文链接：https://medium.com/swlh/unit-testing-apache-spark-structured-streaming-using-memorystream-8e77e97c5f5d

版权

memorystream

Unit testing Apache Spark Structured Streaming jobs using MemoryStream in a non-trivial task. Sadly enough, official Spark documentation still lacks a section on testing. In this post, therefore, I will show you how to start writing unit tests of Spark Structured Streaming.

在不平凡的任务中使用MemoryStream对Apache Spark结构化流作业进行单元测试。令人遗憾的是，Spark官方文档仍然缺少测试部分。因此，在这篇文章中，我将向您展示如何开始编写Spark Structured Streaming的单元测试。

什么是MemoryStream？ (What is MemoryStream?)

A Source that produces value stored in memory as they are added by the user. This Source is intended for use in unit tests as it can only replay data when the object is still available.

当用户添加它们时，将产生存储在内存中的值的Source。 此源旨在用于单元测试，因为它只能在对象仍然可用时重播数据。

Spark SQL Docs

Spark SQL文档

MemoryStream is one of the streaming sources available in Apache Spark. This source allows us to add and store data in memory, which is very convenient for unit testing. The official docs emphasize this, along with a warning that data can be replayed only when the object is still available.

MemoryStream是Apache Spark中可用的流源之一。此源使我们能够在内存中添加和存储数据，这对于单元测试非常方便。官方文档强调了这一点，并警告说只有在对象仍然可用时才可以重播数据。

The MemoryStream takes a type parameter, which gives our unit tests a very desired type safety. The API of this abstraction is rich but not overwhelming. You should check it out for a full reference.

MemoryStream采用类型参数，这为我们的单元测试提供了非常理想的类型安全性。这种抽象的API丰富，但并不压倒一切。您应该查看它以获取完整参考。

Before writing any unit test, let’s create some sample job which will be tested.

在编写任何单元测试之前，让我们创建一些将要测试的示例作业。

编写Spark结构化流作业 (Writing Spark Structured Streaming job)

The job I will be using for the testing, has a simple role — read data from Kafka data source and write it to the Mongo database. In your case, the set of transformations and aggregations will be probably much richer, but the principles stay the same.

我将用于测试的工作具有简单的角色-从Kafka数据源读取数据并将其写入Mongo数据库。在您的情况下，转换和聚合的集合可能会丰富得多，但是原理保持不变。

First, I read the Kafka data source and extract the value column. I am also applying a prebuilt rsvpStruct schema, but that is specific to my code sample.

首先，我阅读Kafka数据源并提取value列。我还应用了一个预先构建的rsvpStruct模式，但这特定于我的代码示例。

import spark.implicits._
val df = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "rsvp")
  .load()
val rsvpJsonDf = df.selectExpr("CAST(value as STRING)")
val rsvpStruct = Schema.getRsvpStructSchema
val rsvpDf = rsvpJsonDf.select(from_json($"value", rsvpStruct).as("rsvp"))

Next, I am outputting the content of parsed data into a Mongo collection. That is done is foreachBatch action, which loops through every batch in the current result set and allows to perform an arbitrary operation. In my case, I am using Mongo Spark connector for persisting the batches.

接下来，我将解析数据的内容输出到Mongo集合中。这是通过foreachBatch操作完成的，该操作循环遍历当前结果集中的每个批次，并允许执行任意操作。就我而言，我正在使用Mongo Spark连接器来保留批次。

rsvpDf.writeStream
  .outputMode("append")
  .option("checkpointLocation", "/raw")
  .foreachBatch({(batchDf: DataFrame, batchId: Long) =>
    val outputDf = batchDf.select("rsvp.*")
    outputDf.write
      .format("mongo")
      .mode("append")
      .save()
  })
  .start()
  .awaitTermination()

Now that we have the sample job created, let’s start writing a unit test using MemoryStream class.

现在，我们已经创建了示例作业，让我们开始使用MemoryStream类编写单元测试。

使用MemoryStream进行Spark单元测试 (Unit testing Spark using MemoryStream)

The preliminary step in writing a unit test is to have some sample data that can be used by unit tests. In my case, I am using a stringified JSON, that contains information I am normally receiving from my Kafka data source. You should use any data that fits your use case.

编写单元测试的第一步是获取一些可供单元测试使用的样本数据。就我而言，我使用的是字符串化JSON，其中包含通常从我的Kafka数据源接收的信息。您应该使用适合您用例的任何数据。

val sampleRsvp = """{"venue":{"venue_name":"Capitello Wines" ...

Next, inside the test case, we create our SparkSession. Nothing special has to be set here, just typical config you use.

接下来，在测试用例中，我们创建SparkSession 。在这里没有什么特别的设置，只是您使用的典型配置。

class SaveRawDataTest extends AnyFunSuite {
  test("should ingest raw RSVP data and read the parsed JSON correctly") {
    val spark = SparkSession.builder()
      .appName("Save Raw Data Test")
      .master("local[1]")
      .getOrCreate()
...

Going further, we define our MemoryStream using String type parameter. I use string, because that's the type of sample data I am using. To create the MemoryStream, to imports are needed: sqlContext that can be obtained from SparkSession and Spark implicits library, which includes the needed type encoders. Lastly, I convert the stream to DataSet.

更进一步，我们使用String类型参数定义MemoryStream 。我使用字符串，因为这是我正在使用的示例数据的类型。要创建MemoryStream ，到需要进口： sqlContext可以从SparkSession和火花获得implicits库，它包括所需类型的编码器。最后，我将流转换为DataSet 。

implicit val sqlCtx: SQLContext = spark.sqlContext
import spark.implicits._val events = MemoryStream[String]
val sessions = events.toDS

Converting to DataSet allows us to make first assertion. We can check if the DataSet is indeed a streaming one. You can do this using the following code.

转换为DataSet允许我们进行第一个声明。我们可以检查DataSet是否确实是流式的。您可以使用以下代码执行此操作。

assert(sessions.isStreaming, "sessions must be a streaming Dataset")

Next, I add the transformations that have been included in the sample job. In this case however, I will not be checking if the data has been saved in Mongo — I am only interested if the raw stringified JSON has been parse correctly and I can run SQL queries on it. The transformations I use are:

接下来，我添加样本作业中已包含的转换。但是，在这种情况下，我将不会检查数据是否已保存在Mongo中-我只对原始的字符串化JSON是否已正确解析并且可以在其上运行SQL查询感兴趣。我使用的转换是：

val transformedSessions = sessions.select(from_json($"value", rsvpStruct).as("rsvp"))
val streamingQuery = transformedSessions
  .writeStream
  .format("memory")
  .queryName("rawRsvp")
  .outputMode("append")
  .start

It is important to use memory format, so that the actual data can be queried in a further step to make the assertions. It is useful to also name those results, using queryName option.

重要的是使用memory格式，以便可以在进一步的步骤中查询实际数据以进行断言。还可以使用queryName选项命名这些结果。

Lastly, I add my sample data into the instance of MemoryStream, process those and commit the offsets, as shown below:

最后，我将示例数据添加到MemoryStream实例中，对其进行处理并提交偏移量，如下所示：

val batch = sampleRsvp
val currentOffset = events.addData(batch)streamingQuery.processAllAvailable()
events.commit(currentOffset.asInstanceOf[LongOffset])

The very last step is to query the committed data and run some assertions on them. In my case, I run a SQL query, to get the event_name property and I run assertions against that.

最后一步是查询已提交的数据并对其进行一些声明。就我而言，我运行一个SQL查询以获取event_name属性，然后针对该属性运行断言。

val rsvpEventName = spark.sql("select rsvp.event.event_name from rawRsvp")
    .collect()
    .map(_.getAs[String](0))
    .headassert(rsvpEventName == "Eugene ISSA & Technology Association of Oregon December Cyber Security Meetup")

摘要 (Summary)

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally, you can follow me on my social media if you fancy so 🙂

我希望您发现这篇文章有用。如果是这样，请随时喜欢或分享此帖子。此外，如果您愿意，可以在我的社交媒体上关注我me