1,安装python3
安装python3,添加到系统环境变量path中:D:\Python37;D:\Python37\Scripts
安装pip组件
pip install py4j
2,安装intellJ idea
安装好intellJ idea
安装python插件,配置python
3,下载hadoop安装包
解压到本地,配置环境变量:HADOOP_HOME
4,下载spark安装包
解压到本地,配置环境变量:SPARK_HOME
把SPARK_HOME/python\pyspark 复制到 python的安装路径下:Python37\Lib\site-packages\
在idea中创建一个python工程,测试代码如下:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("My App")
sc = SparkContext(conf = conf)
lines = sc.parallelize(["pandas", "cat", "i like pandas"])
word = lines.filter(lambda s: "pandas" in s)
print(word.count())
运行代码报错:
分析原因:python版本不兼容问题,但是不想重新安装对应版本的python
解决方式: 下载三个文件:
https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py
https://github.com/apache/spark/blob/master/python/pyspark/serializers.py
https://github.com/apache/spark/blob/master/python/pyspark/util.py
把如上三个文件覆盖Python37\Lib\site-packages\pyspark中的文件
覆盖spark-2.0.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark中的文件
再重新运行测试代码成功运行
注:python3.7,spark-2.0.1-bin-hadoop2.6,hadoop-2.6.0
借鉴:
https://blog.csdn.net/sisteryaya/article/details/68945705
https://blog.csdn.net/wangxiao7474/article/details/81205426