Just for recording
Mission:
Make python project as a spark job, triggered by upstream spark job.
python run spark job:
spark submit --py-files pkg.zip main.py
package:
Python package with:
Zip: zip -r HiveValidation.zip ./
libs: bin/pip install --install-option --install-lib=./ pyspark
virtualenv : pip install virtualenv
之前看的一个链接,通过virtualenv打包,但是如果有anaconda会更加方便,一个管理python的工具,可以指定python版本
Steps:
0. 查看conda env list:查看电脑上的python环境
conda env list
1. 创建干净的python环境,根据python2/3,默认安装2/3里的最高版本,2.7装的是2.7.16
conda create -n env_name python=2.7
2. 激活环境env_name
conda activate env_name
3. 把需要的依赖放到requirements.txt
pandas
numpy
tomorrow
thrift
4. 安装.txt下的所有依赖到deps文件夹下
pip install -r requirements.txt -t deps
5. 打包当前路径下除了dist下的所有文件
zip -r pkg.zip ./ -x "dist/*"
Questions:
Question0: java.lang.IllegalStateException: User did not initialize spark context!
Solution:Change Yarn cluster to Yarn client (my python project don't need spark, just for trigger)
Yarn cluster and yarn client: https://blog.csdn.net/kaaosidao/article/details/77948121
# SparkContext
from pyspark import SparkContext, SparkConf
sc = SparkContext()
Question1: cannot find/load modules[standard libs or custom modules] Python3.7
Solution:
Step1: create two test envs, [full deps for run/pack, empty deps for test/unpack]
Step2: pip install -r requirements.txt -t deps
SubQuestion1.1: cannot pip install sasl
Solution: pip install git+https://github.com/JoshRosen/python-sasl.git@fix-build-with-newer-xcode <a fix version of sasl>
Step3: pack and upload to run, but still cannot find custom libs<python.log>, perhaps due to python version
Step4: check remote spark python env, go to Q2
Question2: translate envs(from python3.7[local machine] to python2.7.5[remote spark])
Solution:
Step1: print remote python version@python2.7.5, but the local python is 3.7, so create a new python env python2.7.16 by anaconda
Step2: go through Solution1, but still cannot find modules/libs, so doubt if the uploaded packages are correctly referred
Tips:
print('current path: {}'.format(os.getcwd()))
print('Dirs: {}'.format(os.listdir(os.getcwd())))
os.system('echo Current python: `python --version`') # terminal output: Current python: 'python 2.7.5'
Question3: uploaded zip file is not unzipped, so the deps cannot be referred accounting for the exceptions
Solution: unzip the archieve by main.py (or read file from zip)
os.system('unzip -q -o ./pkg.zip') #-q quiet unzip
Question4: remote python prefers to load/read its system file while exist, instead of upload deps
Solution: insert uploaded deps ahead of system libs by sys.path.insert
import sys;
sys.path.insert(0, 'deps')
Tips: Add current path\ deps\ python to system path , then we can import .py from deps\
sys.path.insert(0, os.path.join(os.getcwd()))
sys.path.insert(0, os.path.join(os.getcwd(), 'deps'))
sys.path.insert(0, os.path.join(os.getcwd(), 'python'))
Question5: failed to load/restore pyspark enviroment on remote machine
Solution: set key null and reload
```python
if 'PYSPARK_GATEWAY_PORT' in os.environ.keys():
del os.environ['PYSPARK_GATEWAY_PORT']
if 'PYSPARK_GATEWAY_SECRET' in os.environ.keys():
del os.environ['PYSPARK_GATEWAY_SECRET']
if 'SPARK_HOME' in os.environ.keys():
del os.environ['SPARK_HOME']
```
Question6: python2.7.5 cannot import pandas due to numpy, pytz, dateutil version not matched<perhaps version too high>
Solution: remove -info, recompile in linux, you need a server with root
Question7: can't find module sasl in linux
Solution: Re-compile in linux use root
pip install sasl
Question8: Hive query failed:'ascii' codec can't encode character u'\xb0' in position 63: ordinal not in range(128)
(python2.7 diff with python3)
os.environ['LC_ALL'] = 'en_US.utf8'
reload(sys)
sys.setdefaultencoding('utf8')