在pyspark中调用类方法,报错
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
原因:
spark不允许在action或transformation中访问SparkContext,如果你的action或transformation中引用了self,那么spark会将整个对象进行序列化,并将其发到工作节点上,这其中就保留了SparkContext,即使没有显式的访问它,它也会在闭包内被引用,所以会出错。
解决:
应该将调用的类方法定义为静态方法 @staticmethod
class model(object):
@staticmethod
def transformation_function(row):
row = row.split(',')
return row[0]+row[1]
def __init__(self):
self.data = sc.textFile('some.csv')
def run_model(self):
self.data = self.data.map(model.transformation_function)
参考:
https://stackoverflow.com/questions/32505426/how-to-process-rdds-using-a-python-class