1、安装python
2、pip install pyspark==2.4.4
3、安装java:jdk1.8
4、安装hadoop 2.8.2
5、如果是window环境,需要下载https://github.com/srccodes/hadoop-common-2.2.0-bin
相关bin代码,并将其内文件覆盖hadoop的bin路径对应文件中
配置环境变量:
HADOOP_HOME:hadoop的bin上一层目录
JAVA_HOME:java的bin上层目录
配置path:%HADOOP_HOME%\bin;%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin
然后执行如下代码:
#!/usr/bin/python
# coding=utf-8
import traceback
from pyspark import SparkContext
try:
print("begin")
sc = SparkContext(appName="IIP_Recommend_System", master="local")
words = sc.parallelize(
["scala",
"java",
"hadoop",
"spark",
"akka",
"spark vs hadoop",
"pyspark",
"pyspark and spark"
])
counts = words.count()
print(counts)
'''
The Start!
'''
sc.stop()
except:
traceback.print_exc()
其中,python和pyspark的对应关系为:
pyspark | python |
<2.4.4 | <=3.6 |
2.4.4=<version<3 | <=3.7 |
version>=3 | <=3.8 |
如果要支持kafka的话,需要python的包依赖环境的Lib\site-packages\pyspark\jars目录中添加spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar(其他版本暂时未使用)