hadooppythonudf_在Pig中使用Python UDF时,如何让Hadoop找到导入的Python模块?

I am using Pig (0.9.1) with UDFs written in Python. The Python scripts import modules from the standard Python library. I have been able to run the Pig scrips that call the Python UDFs successfully in local mode, but when I run on the cluster it appears Pig's generated Hadoop job is unable to find the imported modules. What needs to be done?

For example:

Does python (or jython) need to be installed on each task tracker node?

Do the python (or jython) modules need to be installed on each task tracker node?

Do the task tracker nodes need to know how to find the modules?

If so, how do you specify the path (via an environment variable - how is that done for the task tracker)?

解决方案Does python (or jython) need to be installed on each task tracker

node?

Yes, since it's executed in task trackers.

Do the python (or jython) modules need to be installed on each task

tracker node?

If you are using a 3rd party module, it should be installed in task trackers as well (like geoip, etc).

Do the task tracker nodes need to know how to find the modules?

If so, how do you specify the path (via an environment variable - how

is that done for the task tracker)?

As an answer from the book "Programming Pig" :

register is also used to locate resources for Python UDFs that you use

in your Pig Latin scripts. In this case you do not register a jar, but

rather a Python script that contains your UDF. The Python script must

be in your current directory.

And also this one is important :

A caveat, Pig does not trace dependencies inside your Python scripts

and send the needed Python modules to your Hadoop cluster. You are

required to make sure the modules you need reside on the task nodes in

your cluster and that the PYTHONPATH environment variable is set on

those nodes such that your UDFs will be able to find them for import.

This issue has been fixed after 0.9, but as of this writing not yet

released.

And if you are using jython :

Pig does not know where on your system the Jython interpreter is, so

you must include jython.jar in your classpath when invoking Pig. This

can be done by setting the PIG_CLASSPATH environment variable.

As a summary, if you are using streaming then you can use "SHIP" command in pig which would send your executable files to cluster. if you are using UDF, as long as it can be compiled(check the note about jython) and doesn't have 3rd party dependency in it (which you didn't already put in PYTHONPATH / or installed in cluster), the UDF would be shipped to cluster when executed. (As a tip, it would make your life much more easier if you put your simple UDF dependencies in the same folder with pig script when registering)

Hope these would clear things.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值