软件准备:
1. jdk1.8
2. scala-2.11.8
3. python3.7
4. hadoop-2.7.1
5. winutils
6. spark-2.3.2-bin-hadoop2.7
7. pycharm
1. 安装jdk
安装好jdk后,配置环境变量
参考:
windows 安装jdk1.8_小猫不会去楼兰捉虫的博客-CSDN博客
2.安装scala
官网下载scala, 将scala解压到指定目录,配置环境变量到path
3.安装python
官网下载python,安装并配置环境变量
4. 安装hadoop
官网下载版本>=spark对应hadoop版本,解压缩后配置HADOOP_HOME,bin目录追加到PATH
5. winutils安装
下载地址:GitHub - steveloughran/winutils: Windows binaries for Hadoop versions (built from the git commit ID used for the ASF relase),按hadoop版本对应下载, 将winutils.exe 放到hadoop的bin目录下。
6. 安装spark
官网下载压缩包,将Spark解压到指定目录,并添加到环境变量
spark下载地址:https://archive.apache.org/dist/spark/
验证pyspark是否正常
7. pycharm 开发环境配置
(1) 新建项目,选择解释器
(2) Run => Edit Configurations => 环境变量
PYTHONPATH= D:\MyProgram\spark-2.3.0-bin-hadoop2.7\python # python 环境变量
SPARK_HOME= D:\MyProgram\spark-2.3.0-bin-hadoop2.7 # spark环境变量
(3) Settings => Project Sturcture => Add Content Root 添加两个包
$SPARK_HOME\python\lib\py4j-0.10.6-src.zip
$SPARK_HOME\python\lib\pyspark.zip
(4) 测试环境是否正常
安装pyspark包
pip install pyspark
测试代码
from pyspark import SparkConf, SparkContext
# 1.创建SparkConf和SparkContext
conf = SparkConf().setMaster("local").setAppName("lichao-wordcount")
sc = SparkContext(conf=conf)
# 2.业务逻辑
data = ["hello", "world", "hello", "word", "count", "count", "hello"]
rdd = sc.parallelize(data)
resultRdd = rdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
resultColl = resultRdd.collect()
for line in resultColl:
print(line)
运行结果