我正在遍历文件,以在字典中收集有关其列和行中的值的信息。我有以下在本地工作的代码:def search_nulls(file_name):
separator = ','
nulls_dict = {}
fp = open(file_name,'r')
null_cols = {}
lines = fp.readlines()
for n,line in enumerate(lines):
line = line.split(separator)
for m,data in enumerate(line):
data = data.strip('\n').strip('\r')
if str(m) not in null_cols:
null_cols[str(m)] = defaultdict(lambda: 0)
if len(data) <= 4:
null_cols[str(m)][str(data)] = null_cols[str(m)][str(data)] + 1
return null_cols
files_to_process = ['tempfile.csv']
results = map(lambda file: search_nulls(file), files_to_process)
上面的代码运行良好,没有火花。
我评论了上面最后两行,并尝试使用spark,因为这是一个需要运行distributed的原型:os.environ['SPARK_HOME'] =
conf = SparkConf().setAppName("search_files").setMaster('local')
sc = SparkContext(conf=conf)
objects = sc.parallelize(files_to_process)
resulting_object = \
objects.map(lambda file_object: find_nulls(file_object))
result = resulting_object.collect()
但是,在使用spark时,这会导致以下错误:File "/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
process()
File "/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/python/lib/pyspark.zip/pyspark/serializers.py", line 267, in dump_stream
bytes = self.serializer.dumps(vs)
File "/python/lib/pyspark.zip/pyspark/serializers.py", line 415, in dumps
return pickle.dumps(obj, protocol)
TypeError: expected string or Unicode object, NoneType found
我找不到任何明显的失败原因,因为它完全在本地运行,而且我没有在工作节点之间共享任何文件。实际上,我只是在本地机器上运行这个。
有人知道为什么会失败吗?