I am trying to serialize a Spark RDD by pickling it, and read the pickled file directly into Python.
a = sc.parallelize(['1','2','3','4','5'])
a.saveAsPickleFile('test_pkl')
I then copy the test_pkl files to my local. How can I read them directly into Python? When I try the normal pickle package, it fails when I attempt to read the first pickle part of 'test_pkl':
pickle.load(open('part-00000','rb'))
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.6/pickle.py", line 1370, in load
return Unpickler(file).load()
File "/usr/lib64/python2.6/pickle.py", line 858, in load
dispatch[key](self)
File "/usr/lib64/python2.6/pickle.py", line 970, in load_string
raise ValueError, "insecure string pickle"
ValueError: insecure string pickle
I assume that the pickling method that spark is using is different than the python pickle method (correct me if I am wrong). Is there any way for me to pickle data from Spark and read this pickled object directly into python from the file?
解决方案
It is possible using sparkpickle project. As simple as
with open("/path/to/file", "rb") as f:
print(sparkpickle.load(f))