Python--spark job--linux

Just for recording

Mission:

Make python project as a spark job,  triggered by upstream spark job.

python run spark job:

spark submit --py-files pkg.zip main.py

package:

Python package with:

Zip: zip -r HiveValidation.zip ./

libs: bin/pip install --install-option --install-lib=./ pyspark

virtualenv : pip install virtualenv

 URL:https://stackoverflow.com/questions/17486578/how-can-you-bundle-all-your-python-code-into-a-single-zip-file

之前看的一个链接,通过virtualenv打包,但是如果有anaconda会更加方便,一个管理python的工具,可以指定python版本

Steps:

0.  查看conda env list:查看电脑上的python环境 

conda env list

1.  创建干净的python环境,根据python2/3,默认安装2/3里的最高版本,2.7装的是2.7.16

conda create -n env_name python=2.7

2.  激活环境env_name

conda activate env_name 

3.  把需要的依赖放到requirements.txt

pandas
numpy
tomorrow
thrift

4.  安装.txt下的所有依赖到deps文件夹下

pip install -r requirements.txt -t deps   

5.  打包当前路径下除了dist下的所有文件

zip -r pkg.zip ./ -x "dist/*"     

Questions:

Question0 java.lang.IllegalStateException: User did not initialize spark context!

Solution:Change Yarn cluster to Yarn client   (my python project don't need spark, just for trigger)

Yarn cluster and yarn client: https://blog.csdn.net/kaaosidao/article/details/77948121 

# SparkContext
from pyspark import SparkContext, SparkConf
sc = SparkContext()

Question1: cannot find/load modules[standard libs or custom modules] Python3.7

Solution:

Step1: create two test envs, [full deps for run/pack, empty deps for test/unpack]

Step2: pip install -r requirements.txt -t deps

  SubQuestion1.1: cannot pip install sasl

  Solution: pip install git+https://github.com/JoshRosen/python-sasl.git@fix-build-with-newer-xcode <a fix version of sasl>

Step3: pack and upload to run, but still cannot find custom libs<python.log>, perhaps  due to python version

Step4: check remote spark python env, go to Q2 

Question2: translate envs(from python3.7[local machine] to python2.7.5[remote spark])

Solution:

Step1: print remote python version@python2.7.5, but the local python is 3.7, so create a new python env python2.7.16 by anaconda

Step2: go through Solution1, but still cannot find modules/libs, so doubt if the uploaded packages are correctly referred

Tips:

print('current path: {}'.format(os.getcwd()))  
print('Dirs: {}'.format(os.listdir(os.getcwd())))

os.system('echo Current python: `python --version`')     # terminal output: Current python: 'python 2.7.5'

Question3: uploaded zip file is not unzipped, so the deps cannot be referred accounting for the exceptions

Solution: unzip the archieve by main.py (or read file from zip) 

os.system('unzip -q -o ./pkg.zip')  #-q quiet unzip

Question4: remote python prefers to load/read its system file while exist, instead of upload deps

 

Solution: insert uploaded deps ahead of system libs by sys.path.insert   

import sys;
sys.path.insert(0, 'deps')

Tips: Add current path\ deps\ python to system path , then we can import .py from deps\

sys.path.insert(0, os.path.join(os.getcwd()))
sys.path.insert(0, os.path.join(os.getcwd(), 'deps'))
sys.path.insert(0, os.path.join(os.getcwd(), 'python'))

 Question5: failed to load/restore pyspark enviroment on remote machine

 

Solution: set key null and reload

```python

if 'PYSPARK_GATEWAY_PORT' in os.environ.keys():

    del os.environ['PYSPARK_GATEWAY_PORT']

if 'PYSPARK_GATEWAY_SECRET' in os.environ.keys():

    del os.environ['PYSPARK_GATEWAY_SECRET']

if 'SPARK_HOME' in os.environ.keys():

    del os.environ['SPARK_HOME']

```

Question6: python2.7.5 cannot import pandas due to numpy, pytz, dateutil version not matched<perhaps version too high>

Solution:  remove -info, recompile in linux, you need a server with root

Question7: can't find module sasl in linux

Solution: Re-compile in linux use root

pip install sasl

 Question8: Hive query failed:'ascii' codec can't encode character u'\xb0' in position 63: ordinal not in range(128)

(python2.7 diff with python3) 

os.environ['LC_ALL'] = 'en_US.utf8'
reload(sys)
sys.setdefaultencoding('utf8')

 https://blog.csdn.net/crazyhacking/article/details/39375535

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值