使用Apache Spark和Python测试

最新推荐文章于 2022-09-01 10:26:04 发布

cumei1658

最新推荐文章于 2022-09-01 10:26:04 发布

阅读量239

点赞数

文章标签：列表 python java spark 人工智能

原文链接：https://www.pybloggers.com/2016/03/testing-with-apache-spark-and-python/

版权

Apache spark and pyspark in particular are fantastically powerful frameworks for large scale data processing and analytics. In the past I’ve written about flink’s python api a couple of times, but my day-to-day work is in pyspark, not flink. With any data processing pipeline, thorough testing is critical to ensuring veracity of the end-result, so along the way I’ve learned a few rules of thumb and build some tooling for testing pyspark projects.

尤其是Apache Spark和pyspark，它们是用于大规模数据处理和分析的功能强大的框架。在过去，我已经写弗林克的Python API中一对夫妇的时候，但我一天到一天的工作是在pyspark，没有弗林克。在任何数据处理管道中，全面的测试对于确保最终结果的准确性至关重要，因此，在此过程中，我了解了一些经验法则，并构建了一些工具来测试pyspark项目。

爆发lambda (Breaking out lambdas)

When building out a job in pyspark, it can be very tempting to over-use the lambda functions. So for instance a simple map function which just takes an rdd of lists and converts each element to a string could be written two ways:

在pyspark中构建作业时，过度使用lambda函数可能非常诱人。因此，例如，可以使用两种方式编写一个简单的map函数，该函数仅接收列表的rdd并将每个元素转换为字符串。

rdd = rdd.map(lambda x: [str(y) for y in x])

or:

要么：

def stringify(x):
    return [str(y) for y in x]

...
rdd = rdd.map(stringify)

Which, while more verbose, exposes a pure python function that we can re-use and unit-test.

尽管较为冗长，但它公开了一个纯Python函数，我们可以重复使用它并进行单元测试。

For a more involved example, lets write a similar function that takes in a RDD of dictionaries and stringifys the values of any key in a set of keys, keylist. The first way:

对于更复杂的示例，让我们编写一个类似的函数，该函数接受字典的RDD并对一组键（键列表）中任何键的值进行字符串化。第一种方式：

rdd = rdd.map(lambda x: {k: str(v) if k in keylist else k: v for k, v in x.items()})

or the more testable way:

或更可测试的方式：

def stringify_values(x, keylist):
    return {k: str(v) if k in keylist else k: v for k, v in x.items()}

...

rdd = rdd.map(lambda x: stringify_values(x, keylist))

This simple rule of thumb goes a long way to increasing the testability of a pyspark codebase, but sometimes you do need to test the spark-y portions of code.

这个简单的经验法则对提高pyspark代码库的可测试性有很大帮助，但是有时您确实需要测试代码的“火花”部分。

虚拟RDD (DummyRDD)

One way to do that, for larger scale tests, is to just run a local instance of spark for the sake of the tests, but this can be slow, especially if you are having to spin up/down spark contexts over and over for different tests (if you do want to do that, here is a great example of how to). To get around that, I’ve started a project to write a mock version of pyspark which uses pure python datastructures under the hood to replicate pyspark behavior.

对于大型测试，这样做的一种方法是仅在测试过程中运行本地的spark实例，但这可能会很慢，尤其是如果您必须为不同的实例而一遍又一遍地调高/调低spark上下文时测试（如果您确实想这样做），这是一个很好的示例。为了解决这个问题，我启动了一个项目来编写pyspark的模拟版本，该版本在后台使用纯python数据结构来复制pyspark行为。

It is only intended for testing, and doesn’t begin to approach the full capabilities or API of pyspark (notably the dataframe or dataset APIs), but it is getting pretty close to having implemented the RDD functionality. Check out the source here: https://github.com/wdm0006/DummyRDD. DummyRDD works by implementing the underlying RDD data structure simply as a python list, so that you can use python’s map, filter, etc on that list as if it were an RDD. Of course, spark is lazily loaded, so to get comparable outcomes, we actually store copies of each intermediate step in memory, so large spark jobs run with the dummy backend will consume large amounts of memory, but for testing this may be ok.

它仅用于测试，并没有开始使用pyspark的全部功能或API（特别是数据框架或数据集API），但它已经接近实现RDD功能。在此处查看源代码： https : //github.com/wdm0006/DummyRDD 。 DummyRDD通过简单地将底层RDD数据结构实现为python列表来工作，因此您可以在该列表上使用python的映射，过滤器等，就好像它是RDD一样。当然，spark延迟加载，因此为了获得可比的结果，我们实际上将每个中间步骤的副本存储在内存中，因此使用虚拟后端运行的大型spark作业将消耗大量内存，但是对于测试而言，这是可以的。

A quick example, showing off some of the methods that are implemented:

一个简单的示例，展示了一些已实现的方法：

import os
import random

from dummy_spark import SparkContext, SparkConf
from dummy_spark.sql import SQLContext
from dummy_spark import RDD

__author__ = 'willmcginnis'

# make a spark conf
sconf = SparkConf()

# set some property (won't do anything)
sconf.set('spark.executor.extraClassPath', 'foo')

# use the spark conf to make a spark context
sc = SparkContext(master='', conf=sconf)

# set the log level (also doesn't do anything)
sc.setLogLevel('INFO')

# maybe make a useless sqlcontext (nothing implimented here yet)
sqlctx = SQLContext(sc)

# add pyfile just appends to the sys path
sc.addPyFile(os.path.dirname(__file__))

# do some hadoop configuration into the ether
sc._jsc.hadoopConfiguration().set('foo', 'bar')

# maybe make some data
rdd = sc.parallelize([1, 2, 3, 4, 5])

# map and collect
print('nmap()')
rdd = rdd.map(lambda x: x ** 2)
print(rdd.collect())

# add some more in there
print('nunion()')
rdd2 = sc.parallelize([2, 4, 10])
rdd = rdd.union(rdd2)
print(rdd.collect())

# filter and take
print('nfilter()')
rdd = rdd.filter(lambda x: x > 4)
print(rdd.take(10))

# flatmap
print('nflatMap()')
rdd = rdd.flatMap(lambda x: [x, x, x])
print(rdd.collect())

# group by key
print('ngroupByKey()')
rdd = rdd.map(lambda x: (x, random.random()))
rdd = rdd.groupByKey()
print(rdd.collect())
rdd = rdd.mapValues(list)
print(rdd.collect())

# forEachPartition
print('nforEachPartition()')
rdd.foreachPartition(lambda x: print('partition: ' + str(x)))

The README contains a list of all implemented methods, which will gradually grow over time.

自述文件包含所有已实现方法的列表，这些方法将随着时间的推移逐渐增长。