在mac上安装下pySpark,并且在pyCharm中python调用pyspark

44 篇文章 6 订阅

在mac上安装下pySpark,并且在pyCharm中python调用pyspark。目前用python比较多,所以想安装下pySpark,并且在pyCharm中调用。 

ython 利用pyspark 直接在本地操作spark,运行spark程序
本文将从软件下载,安装,第一部分配置,编程,初次运行,第二部分配置,最终正确运行,这几个方面进行,下面,闲话不说,码上呈现过程。

1、下载软件包:

jdk-8u131-macosx-x64.dmg
spark-2.1.0-bin-hadoop2.6.tgz

2、安装spark环境

1)jdk默认安装 
(2)spark-2.1.0-bin-hadoop2.6.tgz先进行解压,并进行相关配置。假设目录为/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6 
(3)这时,切换到 /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/bin,输入pySpark。如果你安装成功的话,可看到下图 


3、配置在PyCharm中调用pySpark的加载包

如何抛开mac权限问题,强行安装python第三方包。
要想在PyCharm中调用pySpark,需要加载包。将/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/文件夹下pySpark文件夹拷贝到/Library/Python/2.7/site-packages/(注:我的python安装目录是这个路径,可能有的读者是C:\Anaconda2\Lib\site-packages**或者C:\Python27\Lib\site-packages)

(1)首选找到找到/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/这个目录,然后将pyspark这个文件夹整个拷贝到/Library/Python/2.7/site-packages/这个目录下:
localhost:python a6$ pwd
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python
localhost:python a6$ cd pyspark/
localhost:pyspark a6$ ls
__init__.py        broadcast.pyc        context.py        find_spark_home.py    java_gateway.pyc    profiler.py        rddsampler.pyc        shell.py        statcounter.pyc        streaming        version.pyc
__init__.pyc        cloudpickle.py        context.pyc        find_spark_home.pyc    join.py            profiler.pyc        resultiterable.py    shuffle.py        status.py        tests.py        worker.py
accumulators.py        cloudpickle.pyc        daemon.py        heapq3.py        join.pyc        rdd.py            resultiterable.pyc    shuffle.pyc        status.pyc        traceback_utils.py
accumulators.pyc    conf.py            files.py        heapq3.pyc        ml            rdd.pyc            serializers.py        sql            storagelevel.py        traceback_utils.pyc
broadcast.py        conf.pyc        files.pyc        java_gateway.py        mllib            rddsampler.py        serializers.pyc        statcounter.py        storagelevel.pyc    version.py


(2)找到python的包目录:/Library/Python/2.7/site-packages/
localhost:python a6$ python
Python 2.7.10 (default, Feb  7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC']
>>> exit()

从而确定包目录为:/Library/Python/2.7/site-packages/

(3)进行pyspark的拷贝,发现没有权限,从而进行变相拷贝,如下:
localhost:site-packages a6$ pwd
/Library/Python/2.7/site-packages
localhost:site-packages a6$ mkdir pyspark
mkdir: pyspark: Permission denied
localhost:site-packages a6$ sudo mkdir pyspark
Password:
localhost:pyspark a6$ pwd
/Library/Python/2.7/site-packages/pyspark
localhost:pyspark a6$ cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
cp: ./__init__.py: Permission denied
cp: ./__init__.pyc: Permission denied
cp: ./accumulators.py: Permission denied
cp: ./accumulators.pyc: Permission denied
cp: ./broadcast.py: Permission denied
cp: ./broadcast.pyc: Permission denied
…………
cp: ./join.pyc: Permission denied
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).
cp: ./profiler.py: Permission denied
cp: ./profiler.pyc: Permission denied
cp: ./rdd.py: Permission denied
cp: ./rdd.pyc: Permission denied
cp: ./rddsampler.py: Permission denied
cp: ./rddsampler.pyc: Permission denied
cp: ./resultiterable.py: Permission denied
cp: ./resultiterable.pyc: Permission denied
cp: ./serializers.py: Permission denied
cp: ./serializers.pyc: Permission denied
localhost:pyspark a6$ sudo cp /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/ml is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/mllib is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/sql is a directory (not copied).
cp: /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/streaming is a directory (not copied).
localhost:pyspark a6$ sudo cp -rf  /Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/pyspark/* ./
localhost:pyspark a6$ ls
__init__.py        broadcast.pyc        context.py        find_spark_home.py    java_gateway.pyc    profiler.py        rddsampler.pyc        shell.py        statcounter.pyc        streaming        version.pyc
__init__.pyc        cloudpickle.py        context.pyc        find_spark_home.pyc    join.py            profiler.pyc        resultiterable.py    shuffle.py        status.py        tests.py        worker.py
accumulators.py        cloudpickle.pyc        daemon.py        heapq3.py        join.pyc        rdd.py            resultiterable.pyc    shuffle.pyc        status.pyc        traceback_utils.py
accumulators.pyc    conf.py            files.py        heapq3.pyc        ml            rdd.pyc            serializers.py        sql            storagelevel.py        traceback_utils.pyc
broadcast.py        conf.pyc        files.pyc        java_gateway.py        mllib            rddsampler.py        serializers.pyc        statcounter.py        storagelevel.pyc    version.py
localhost:pyspark a6$

4、python操作pyspark的编码
代码如下:
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
    sc = SparkContext(appName="PythonWordCount")
    lines = sc.textFile('words.txt')
    counts = lines.flatMap(lambda x: x.split(' ')) \
                  .map(lambda x: (x, 1)) \
                  .reduceByKey(add)
    output = counts.collect()
    for (word, count) in output:
        print "%s: %i" % (word, count)
    sc.stop()


代码中words.txt内容如下
good bad cool 
hadoop spark mlib 
good spark mlib 
cool spark bad


4、初步运行

(1)初步运行,然后报错,哈哈哈 ,补充配置。
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.py
Could not find valid SPARK_HOME while searching ['/Users/a6/Downloads/PycharmProjects', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7/site-packages/pyspark', '/Library/Python/2.7']

Process finished with exit code 255


(2)其实是还有一个地方没有配置 
在pyCharm的菜单栏里找到Run => Edit Configurations,点击下面红色标记的地方,添加环境变量。 
添加spark的按照目录,如红色框部分所示:




(3)再次运行
得到如下正确结果 
/System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/a6/Downloads/PycharmProjects/test_use_hbase_by_thrift/test_python_local_use_spark.py
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/10/13 16:30:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/13 16:30:48 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.0.1; using 10.2.32.96 instead (on interface en0)
17/10/13 16:30:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling
/Users/a6/Applications/spark-2.1.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling
bad: 2
spark: 3
mlib: 2
good: 2
hadoop: 1
cool: 2
Process finished with exit code 0


5、pySpark学习地址

(2)在上面解压的文件夹/spark-2.1.0-bin-hadoop2.6/examples/src/main/python中有很多示例代码,可以进行学习,本文中的wordCount就是用的上面的代码小修改版本。

6、查看Python安装目录及 第三方模块(modules)的安装位置
因为只要知道python home路径就好办了。


示例如下:

localhost:python a6$ python
Python 2.7.10 (default, Feb  7 2017, 00:08:15)
[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.34)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.path
['', '/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg', '/Library/Python/2.7/site-packages/py4j-0.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/redis-2.10.6-py2.7.egg', '/Library/Python/2.7/site-packages/MySQL_python-1.2.4-py2.7-macosx-10.12-intel.egg', '/Library/Python/2.7/site-packages/thrift-0.10.0-py2.7-macosx-10.12-intel.egg', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python27.zip', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-darwin', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/plat-mac/lib-scriptpackages', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-tk', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-old', '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/lib-dynload', '/Library/Python/2.7/site-packages', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python', '/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/PyObjC’]

7、Windows安装pySpark,并且在pyCharm中调用例子参见网址:


  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值