spark调用类内方法

最新推荐文章于 2022-06-20 18:04:53 发布

ClaireQi

最新推荐文章于 2022-06-20 18:04:53 发布

阅读量3.6k

点赞数

分类专栏： Spark 文章标签： spark

Spark 专栏收录该内容

6 篇文章 1 订阅

订阅专栏

在pyspark中调用类方法，报错

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

原因：

spark不允许在action或transformation中访问SparkContext，如果你的action或transformation中引用了self，那么spark会将整个对象进行序列化，并将其发到工作节点上，这其中就保留了SparkContext，即使没有显式的访问它，它也会在闭包内被引用，所以会出错。

解决：
应该将调用的类方法定义为静态方法 @staticmethod

class model(object):
    @staticmethod
    def transformation_function(row):
        row = row.split(',')
        return row[0]+row[1]

    def __init__(self):
        self.data = sc.textFile('some.csv')

    def run_model(self):
        self.data = self.data.map(model.transformation_function)

参考：
https://stackoverflow.com/questions/32505426/how-to-process-rdds-using-a-python-class