在mac上安装下pySpark,并且在pyCharm中python调用pyspark。目前用python比较多,所以想安装下pySpark,并且在pyCharm中调用。
ython 利用pyspark 直接在本地操作spark,运行spark程序
本文将从软件下载,安装,第一部分配置,编程,初次运行,第二部分配置,最终正确运行,这几个方面进行,下面,闲话不说,码上呈现过程。
1、下载软件包:
jdk-8u131-macosx-x64.dmg
spark-2.1.0-bin-hadoop2.6.tgz
2、安装spark环境
(
1)jdk默认安装
(2)spark-2.1.0-bin-hadoop2.6.tgz先进行解压,并进行相关配置。假设目录为/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6
(3)这时,切换到 /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/bin,输入pySpark。如果你安装成功的话,可看到下图
3、配置在PyCharm中调用pySpark的加载包
如何抛开mac权限问题,强行安装python第三方包。
要想在PyCharm中调用pySpark,需要加载包。将/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/文件夹下pySpark文件夹拷贝到/Library/Python/2.7/site-packages/(注:我的python安装目录是这个路径,可能有的读者是C:\Anaconda2\Lib\site-packages**或者C:\Python27\Lib\site-packages)
(1)首选找到找到/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/这个目录,然后将pyspark这个文件夹整个拷贝到/Library/Python/2.7/site-packages/这个目录下:
localhost:python a6$ pwd
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python
localhost:python a6$ cd pyspark/
localhost:pyspark a6$ ls
__init__.py broadcast.pyc context.py find_spark_home.py java_gateway.pyc profiler.py rddsampler.pyc shell.py statcounter.pyc streaming version.pyc
__init__.pyc cloudpickle.py context.pyc find_spark_home.pyc join.py profiler.pyc resultiterable.py shuffle.py status.py tests.py worker.py
accumulators.py cloudpickle.pyc daemon.py heapq3.py join.pyc rdd.py resultiterable.pyc shuffle.pyc status.pyc traceback_utils.py
accumulators.pyc conf.py files.py heapq3.pyc ml rdd.pyc serializers.py sql storagelevel.py traceback_utils.pyc
broadcast.py conf.pyc files.pyc java_gateway.py mllib rddsampler.py serializers.pyc statcounter.py storagelevel.pyc version.py
(2)找到python的包目录:/Library/Python/2.7/site-packages/
localhost:python a6$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']
>>> exit()
从而确定包目录为:/Library/Python/2.7/site-packages/
(3)进行pyspark的拷贝,发现没有权限,从而进行变相拷贝,如下:
localhost:site-packages a6$ pwd
/Library/Python/2.7/site-packages
localhost:site-packages a6$ mkdir pyspark
mkdir: pyspark: Permission denied
localhost:site-packages a6$ sudo mkdir pyspark
Password:
localhost:pyspark a6$ pwd
/Library/Python/2.7/site-packages/pyspark
localhost:pyspark a6$ cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
cp: ./__init__.py: Permission denied
cp: ./__init__.pyc: Permission denied
cp: ./accumulators.py: Permission denied
cp: ./accumulators.pyc: Permission denied
cp: ./broadcast.py: Permission denied
cp: ./broadcast.pyc: Permission denied
…………
cp: ./join.pyc: Permission denied
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).
cp: ./profiler.py: Permission denied
cp: ./profiler.pyc: Permission denied
cp: ./rdd.py: Permission denied
cp: ./rdd.pyc: Permission denied
cp: ./rddsampler.py: Permission denied
cp: ./rddsampler.pyc: Permission denied
cp: ./resultiterable.py: Permission denied
cp: ./resultiterable.pyc: Permission denied
cp: ./serializers.py: Permission denied
cp: ./serializers.pyc: Permission denied
localhost:pyspark a6$ sudo cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/streaming is a directory (not copied).
localhost:pyspark a6$ sudo cp -rf /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
localhost:pyspark a6$ ls
__init__.py broadcast.pyc context.py find_spark_home.py java_gateway.pyc profiler.py rddsampler.pyc shell.py statcounter.pyc streaming version.pyc
__init__.pyc cloudpickle.py context.pyc find_spark_home.pyc join.py profiler.pyc resultiterable.py shuffle.py status.py tests.py worker.py
accumulators.py cloudpickle.pyc daemon.py heapq3.py join.pyc rdd.py resultiterable.pyc shuffle.pyc status.pyc traceback_utils.py
accumulators.pyc conf.py files.py heapq3.pyc ml rdd.pyc serializers.py sql storagelevel.py traceback_utils.pyc
broadcast.py conf.pyc files.pyc java_gateway.py mllib rddsampler.py serializers.pyc statcounter.py storagelevel.pyc version.py
localhost:pyspark a6$
4、python操作pyspark的编码
代码如下:
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('words.txt')
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
代码中words.txt内容如下
good bad cool
hadoop spark mlib
good spark mlib
cool spark bad
4、初步运行
(1)初步运行,然后报错,哈哈哈 ,补充配置。
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.py
Could not find valid SPARK_HOME while searching ['/Users/a6/Downloads/PycharmProjects', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7']
Process finished with exit code 255
(2)其实是还有一个地方没有配置
在pyCharm的菜单栏里找到Run => Edit Configurations,点击下面红色标记的地方,添加环境变量。
添加spark的按照目录,如红色框部分所示:
(3)再次运行
得到如下正确结果
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/10/13 16:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/13 16:30:48 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 10.2.32.96 instead (on interface en0)
17/10/13 16:30:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling
bad: 2
spark: 3
mlib: 2
good: 2
hadoop: 1
cool: 2
Process finished with exit code 0
5、pySpark学习地址
(2)在上面解压的文件夹/spark-2.1.0-bin-hadoop2.6/examples/src/main/python中有很多示例代码,可以进行学习,本文中的wordCount就是用的上面的代码小修改版本。
6、查看Python安装目录及 第三方模块(modules)的安装位置
因为只要知道python home路径就好办了。
示例如下:
localhost:python a6$ python
Python 2.7.10 (default, Feb 7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC’]
7、Windows安装pySpark,并且在pyCharm中调用例子参见网址: