流程
一、虚拟机中共享本地目录,见前文:《通过virtualbox实现虚拟机中共享本地目录》
二、python安装或相关问题见《Install Python 3 on CentOS 6.5 Server》
三、当然,spark是必须的,见《centos单机安装Spark1.4.0》(用到hadoop,见《centos单机安装Hadoop2.6》)
四、remote端安装、设置
vi /etc/profile
添加一行:PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip
source /etc/profile
# 安装pip 和 py4j
下载pip-7.1.2.tar
tar -xvf pip-7.1.2.tar
cd pip-7.1.2
python setup.py install
pip install py4j
# 避免ssh时tty检测
cd /etc
chmod 640 sudoers
vi /etc/sudoers
#Default requiretty
五、本地Pycharm设置
Settings > Project Interpreter:
Project Interpreter > Add remote(前提:remote端python安装成功):
注意,如果python安装在其它路径,要把路径改过来,如:
Run > Edit Configuration (前提:虚拟机中共享本地目录成功):
六、测试
import os
import sys
os.environ['SPARK_HOME'] = '/root/spark-1.4.0-bin-hadoop2.6'
sys.path.append("/root/spark-1.4.0-bin-hadoop2.6/python")
try:
from pyspark import SparkContext
from pyspark import SparkConf
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
Result:
ssh://root@192.168.22.250:22/usr/bin/python -u /mnt/shared/test01/test01a.py
Successfully imported Spark Modules
Process finished with exit code 0
来个复杂些的:
import sys
sys.path.append("/root/programs/spark-1.4.0-bin-hadoop2.6/python")
try:
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors
dv1 = np.array([1.0, 0.0, 3.0])
dv2 = [1.0, 0.0, 3.0]
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape=(3, 1))
print(sv2)
except ImportError as e:
print("Can not import Spark Modules", e)
sys.exit(1)
Result
ssh://root@192.168.22.250:22/root/programs/python3/bin/python -u /mnt/shared/test01/test01a.py
(0, 0) 1.0
(2, 0) 3.0
Process finished with exit code 0
Q&A
Q: sudo: sorry, you must have a tty to run sudo
A:
cd /etc
chmod 640 sudoers
vi /etc/sudoers
#Default requiretty #注释掉 Default requiretty 一行。意思就是sudo默认需要tty终端,注释掉就可以在后台执行了。
Q: VirtualBox的Shared folder功能出现broken shared folder错误
A: 见上文中提到的虚拟机中共享本地目录
Q: 一会儿什么cannot import name accumulators, 一会儿什么cannot import name py4j
A:
下载pip-7.1.2.tar
tar -xvf pip-7.1.2.tar
cd pip-7.1.2
python setup.py install
pip install py4j
搞定!
参考
https://edumine.wordpress.com/2015/08/14/pyspark-in-pycharm/
http://renien.github.io/blog/accessing-pyspark-pycharm/
http://www.tuicool.com/articles/MJnYJb
等等。。。