1.序
由于笔者目前用python比较多,所以想安装下pySpark,并且在Anaconda2中调用。
(1)jdk-8u91-windows-x64.exe
(2)spark-1.6.0-bin-hadoop2.6.0.tgz
2.安装
(1)jdk默认安装
(2)spark-1.6.0-bin-hadoop2.6.0.tgz先进行解压。假设目录为E:\spark-1.6.0-bin-hadoop2.6.0
(3)配置环境变量
SPARK_HOME=E:\spark-1.6.0-bin-hadoop2.6.0
Path添加%SPARK_HOME%\bin和%SPARK_HOME%\python
这时,你可以利用打开cmd,输入pySpark。没有问题的话,你可以看到下图
(4)要想在Anaconda2中调用pySpark,需要加载包。将E:\spark-1.6.0-bin-hadoop2.6.0\python文件夹下pySpark文件夹拷贝到C:\Anaconda2\Lib\site-packages**(注:我的python安装目录是这个路径,可能有的读者是C:\Python27\Lib\site-packages)**
3.pyCharm wordCount示例
- 新建wordCount.py文件,写代码
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('words.txt')
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
19: 1
Michael: 1
Andy: 1
30: 1
29: 1
Justin: 1
如果出现如下错误:
1. ModuleNotFoundError: No module named 'py4j'
conda install py4j
或者pip install py4j
2. ImportError: cannot import name accumulators
ImportError: No module named py4j.java_gateway
设置:
import os
import sys
from operator import add
from pyspark import SparkContext
# Path for spark source folder
os.environ['SPARK_HOME']="D:\\ProgramFiles\\spark-1.6.0-bin-hadoop2.6"
# Append pyspark to Python Path
sys.path.append("D:\\ProgramFiles\\spark-1.6.0-bin-hadoop2.6/python/")
if __name__ == "__main__":
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile('E:\\testData\\spark\\spark1.6\\people.txt')
counts = lines.flatMap(lambda x: x.split(',')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.collect()
for (word, count) in output:
print "%s: %i" % (word, count)
sc.stop()
4.pySpark学习地址
(1)http://spark.apache.org/docs/latest/api/python/pyspark.html
(2)在上面解压的文件夹E:\spark-1.3.0-bin-hadoop2.4\examples\src\main\python中有很多示例代码,可以进行学习,本文中的wordCount就是用的上面的代码(进行了一点点修改)。
注意:
如果你用的python为3.5+版本,接下来这步操作决定我们是否能够完成配置
1、在D:\spark\spark-2.0.1-bin-hadoop2.7\bin文件夹下找到pyspark文件,然后用notepad++打开。
2、找到export PYSPARK_PYTHON然后把这个地方变成export PYSPARK_PYTHON=python3
3、保存,大功告成。