Pyspark can't pickle method_descriptor

如有不妥之处,欢迎随时留言沟通交流,谢谢~

其实错误背后的理论原因没理解很清楚,麻烦大神帮忙解答下?

错误代码:

from impala.dbapi import connect
is_test = False
host = '192.168.0.1' if is_test else '192.168.0.1'
conn = connect(host=host, port=25001, timeout=3600)
cursor = conn.cursor()
sql = 'INSERT INTO test_db.test_table(sec_code,dt,minute,itype,ftype) values(%s,%s,%s,%s,%s)'

data = [('0' , '1', "a", '23.0','a'), ('1','3', "C", '-23.0','a'), ('2','3', "A", '-21.0','a'), ('3','2', "B", '-19.0','a') ]
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x : cursor.execute(sql,x)) 
rdd2.collect()



Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 824, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2470, in _jrdd
    self._jrdd_deserializer, profiler)
  File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2403, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2389, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "/usr/local/lib/python2.7/site-packages/pyspark/serializers.py", line 568, in dumps
    return cloudpickle.dumps(obj, 2)
  File "/usr/local/lib/python2.7/site-packages/pyspark/cloudpickle.py", line 918, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python2.7/site-packages/pyspark/cloudpickle.py", line 249, in dump
    raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: TypeError: can't pickle cStringIO.StringO objects

正确代码:

from impala.dbapi import connect
 
def fun2(str,is_test=True):
    host = '192.168.0.1' if is_test else '192.168.0.1'
    conn = connect(host=host, port=25001, timeout=3600)
    cursor = conn.cursor()
    sql = 'INSERT INTO test_db.test_table(sec_code,dt,minute,itype,ftype) values(%s,%s,%s,%s,%s)'
    cursor.execute(sql, str)
 
data = [('0' , '1', "a", '23.0','a'), ('1','3', "C", '-23.0','a'), ('2','3', "A", '-21.0','a'), ('3','2', "B", '-19.0','a') ]
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x : fun2(x)) 
rdd2.collect()

REASON:

Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db connect object can't grant read/write permission to another scope (or even computer). The problem can be reproduced by trying to broadcast the connect object. For this instance there was a problem on serializing an i/o object.

The problem was partly solved by connecting to the database inside the map functions. Since there will be too many connections for each RDD element in the map function, I had to switch to partition processing to reduce the db connections from 20k to about 8-64 (based on number of partitions). Spark developers should consider creating an initialization function/script for the executors to avoid these kind of dead end problems.

So let's say I got this init function executed by every node, then every node will be connected to the database (some conn pool, or separate zookeeper nodes) because the init function and the map functions will share the same scope, and then the problem is gone, so you write faster code than the workaround I found. At the end of the execution spark will free/unload these defined variables and the program will end.

参考文献:

Spark can't pickle method_descriptor:  https://stackoverflow.com/questions/28142578/spark-cant-pickle-method-descriptor

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值