spark运行环境参考:https://blog.csdn.net/max_cola/article/details/78902597
对应的环境变量:
#java
export JAVA_HOME=/usr/local/jdk1.8.0_181
export PATH=$JAVA_HOME/bin:$PATH
#python
export PYTHON_HOME=/usr/local/python3
export PATH=$PYTHON_HOME/bin:$PATH
#spark
export SPARK_HOME=/usr/local/spark export PATH=$SPARK_HOME/bin:$PATH
#add spark to python
export PYTHONPATH=/usr/local/spark/python
#add pyspark to jupyter
export PYSPARK_PYTHON=/usr/local/python3/bin/python3 # 因为我们装了两个版本的python,所以要指定pyspark_python,>否则pyspark执行程序会报错。
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --allow-root'
使用 python写的Spark示例:
# -*- coding: utf-8 -*-
from __future__ import print_function
from pyspark import *
import os
if __name__ == '__main__':
sc = SparkContext("local[4]")
sc.setLogLevel("WARN")
rdd = sc.parallelize("hello Pyspark world".split(" "))
counts = rdd \
.flatMap(lambda line: line) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b) \
.foreach(print)
sc.stop
出现如下错误
Traceback (most recent call last):
File "test1.py", line 3, in <module>
from pyspark import *
File "/usr/local/spark/python/pyspark/__init__.py", line 46, in <module>
from pyspark.context import SparkContext
File "/usr/local/spark/python/pyspark/context.py", line 29, in <module>
from py4j.protocol import Py4JError
ImportError: No module named py4j.protocol
解决方法:
#进入python的目录
/usr/local/python3/lib/python3.6/site-packages
#拷贝日志包过来
cp /usr/local/spark/python/lib/py4j-0.10.7-src.zip ./
#解压
unzip py4j-0.10.7-src.zip